Approximate Bayesian Discrimination between Alternative DNA Mosaic Structures

Dirk Husmeier and Frank Wright




Biomathematics and Statistics Scotland
SCRI, Invergowrie, Dundee DD2 5DA, United Kingdom
Phone: ++44 1382 562731
E-mail:dirk@bioss.ac.uk





We derive an approximate Bayesian hypothesis test to discriminate between alternative mosaic structures of DNA sequence alignments, and test the viability of this approach on a set of synthetic and real-world DNA sequence alignments.



INTRODUCTION

There has recently been an increased interest in sporadic recombination as an important, and previously underestimated, source of genetic diversification in bacteria and viruses. The exploitable consequence of this process, in which DNA subsequences are exchanged between different strains or species, is that the DNA sequence alignment of the involved taxa has a mosaic structure, with different regions corresponding to different phylogenetic topologies. While several methods for identifying the nature and the breakpoints of this mosaic structure have been developed, they do not satisfactorily address the question of whether the found mosaic structure is statistically significant. The aim of this paper, therefore, is to devise a hypothesis test to discriminate between alternative candidate mosaic structures.



METHOD

Let D denote a DNA sequence alignment, and H a hypothesis about its mosaic structure. In the absence of prior knowledge we should discriminate between alternative mosaic structures HA and HB on the basis of the Bayes factor P(D|HB ) /P(D|HA) which, if greater than 1, suggests rejecting HA in favour of HB. Bayesian hypothesis testing and model selection is thus based on the computation of the marginal log likelihood or evidence

where q is the vector of all model parameters (the branch lengths of the phylogenetic trees and the nucleotide substitution parameters), which has the prior probability distribution P (q | H). While this integral is intractable, a partial factorisation of the posterior distribution P (q | D, H) and the Laplace method lead to a tractable approximate expression. The technical details have to be omitted here due to space restrictions, but they can be found in a technical report available from the following URL: http://www.bioss.sari.ac.uk/~dirk/papers/BayesSegmentation.pdf.



TEST DATA

Synthetic Data.   We simulated sporadic recombination events in a synthetic population of 8 strains. The DNA sequences of the taxa were evolved down the branches of the phylogenetic tree of Figure 1, using the Kimura 2-parameter model of nucleotide substitution with a transition-transversion ratio of 2. Partial sequences were generated from different topologies, as indicated in the figure, and then spliced together. This simulates the exchange of DNA sequences between different strains of the population. We generated DNA sequence alignments for different values of the unit branch length, varying between 0.1 and 0.01.


Figure 1: Left: Phylogenetic tree for the synthetic problem. DNA sequences (5000 bp long) were evolved along the tree, using the Kimura 2-parameter model (transition-transversion ratio = 2) of nucleotide substitution. Two recombination events, involving closely related and distantly related taxa, were simulated by swapping the indicated lineages. Right: Mosaic structures of DNA sequence alignments. The first two rows show the null hypothesis (top) and the true mosaic structure (second from the top). The remaining rows show alternative mosaic structures: 1) Segmentation of only the left recombinant region. 2) Segmentation of only the right recombinant region. 3) Subdivision of both recombinant regions. 4) Subdivision of the three non-recombinant regions. 5) Subdivision of all regions. 6) Correct segmentation with a slight misplacement of the breakpoints, shifted by 10 nucleotides to the right.

Hepatitis B Virus.   Hepatitis B is caused by a DNA virus with a short genome of 3200 bp. Evidence for recombination was first found in [1]. The sequences used in this paper include two recombinant strains (accession numbers D00329 and X68292), and eight non-recombinant strains (accession numbers D00330, D00630, L27106, M32138, M54923, M57663, V00866, X01587). The sequences were aligned with ClustalW, using the default parameters. Columns with gaps were discarded, giving a total alignment length of 3049 nucleotides. The recombinant breakpoints found in [1] are at positions 603, 1882, 2071, and 2238. (The recombination breakpoints in the original data set prior to discarding columns with gaps were at positions 735, 2014, 2203, 2370.)

Neisseria.   We analysed a 787 bp DNA sequence alignment of the argF gene of eight strains of Neisseria with the following GenBank accession numbers: X64860, X64861, X64866, X64869, X64870, X64871, X64872, X64873. This data set was used in [2], where a recombinant region between positions 1 and 202 and a differently diverged region between nucleotides 508 and 538 was found. (The numbering scheme for the bases in [2] starts at 296 bp and ends at 1083 bp, so the locations of the breakpoints have to be shifted by 295 bp.)


Figure 2: Left: Mosaic structures of real-world DNA sequence alignments. The null hypothesis and the true mosaic structure are shown in the first two lines; the other lines show alternative mosaic structures. Left: Neisseria.1) Only the left recombinant region resolved. 2) Only the right recombinant region resolved. 3) Subdividing the left non-recombinant region. 4) Subdividing the right non-recombinant region. 5) Subdividing both non-recombinant regions. Right: Hepatitis-B virus. 1) Merging of the first three regions. 2) Merging of the first two regions. 3) Merging of the last three regions. 4) Merging of the last two regions. 5) Subdividing the largest region. 6) Subdividing the last region. 7) Subdividing both regions.



RESULTS

Figures 1 and 2 show different candidate mosaic structures of the DNA sequence alignments discussed in the previous section. These segmentations include the null hypothesis (one homogeneous region), the true mosaic structure, and various alternative mosaic structures in which the true sub-regions have either been partially merged or further sub-divided. The task is to find the true mosaic structure, and to test how reliable a selection criterion the approximate Bayesian evidence is. We first selected the mosaic structure with the maximum likelihood. This identified the correct mosaic structure only in a single case and usually preferred the more finely tessellated alternatives (overfitting). On the contrary, the approximate evidence was consistently maximised for the correct mosaic structure, and thus proves to be a reliable selection criterion. For details, see http://www.bioss.sari.ac.uk/~dirk/papers/BayesSegmentation.pdf.


REFERENCES

  1. Bollyky, Rambaut, Harvey, Holmes: J. Mol. Evo.42, 97-102, 1996
  2. Zhou, Spratt: Molecular Microbiology 6, 2135-2146, 1992