Biomathematics and Statistics Scotland
SCRI, Invergowrie, Dundee DD2 5DA, United Kingdom
Phone: ++44 1382 562731
E-mail:dirk@bioss.ac.uk
We derive an approximate Bayesian hypothesis test to discriminate between alternative mosaic structures of DNA sequence alignments, and test the viability of this approach on a set of synthetic and real-world DNA sequence alignments.
There has recently been an increased interest in sporadic recombination as an important, and previously underestimated, source of genetic diversification in bacteria and viruses. The exploitable consequence of this process, in which DNA subsequences are exchanged between different strains or species, is that the DNA sequence alignment of the involved taxa has a mosaic structure, with different regions corresponding to different phylogenetic topologies. While several methods for identifying the nature and the breakpoints of this mosaic structure have been developed, they do not satisfactorily address the question of whether the found mosaic structure is statistically significant. The aim of this paper, therefore, is to devise a hypothesis test to discriminate between alternative candidate mosaic structures.
Let D denote a DNA sequence alignment, and H a hypothesis about its mosaic structure. In the absence of prior knowledge we should discriminate between alternative mosaic structures HA and HB on the basis of the Bayes factor P(D|HB ) /P(D|HA) which, if greater than 1, suggests rejecting HA in favour of HB. Bayesian hypothesis testing and model selection is thus based on the computation of the marginal log likelihood or evidence
where q is the vector of all model parameters (the branch lengths of the phylogenetic trees and the nucleotide substitution parameters), which has the prior probability distribution P (q | H). While this integral is intractable, a partial factorisation of the posterior distribution P (q | D, H) and the Laplace method lead to a tractable approximate expression. The technical details have to be omitted here due to space restrictions, but they can be found in a technical report available from the following URL: http://www.bioss.sari.ac.uk/~dirk/papers/BayesSegmentation.pdf.
Synthetic Data. We simulated sporadic recombination events in a synthetic population of 8 strains. The DNA sequences of the taxa were evolved down the branches of the phylogenetic tree of Figure 1, using the Kimura 2-parameter model of nucleotide substitution with a transition-transversion ratio of 2. Partial sequences were generated from different topologies, as indicated in the figure, and then spliced together. This simulates the exchange of DNA sequences between different strains of the population. We generated DNA sequence alignments for different values of the unit branch length, varying between 0.1 and 0.01.
Hepatitis B Virus. Hepatitis B is caused by a DNA virus with a short genome of 3200 bp. Evidence for recombination was first found in [1]. The sequences used in this paper include two recombinant strains (accession numbers D00329 and X68292), and eight non-recombinant strains (accession numbers D00330, D00630, L27106, M32138, M54923, M57663, V00866, X01587). The sequences were aligned with ClustalW, using the default parameters. Columns with gaps were discarded, giving a total alignment length of 3049 nucleotides. The recombinant breakpoints found in [1] are at positions 603, 1882, 2071, and 2238. (The recombination breakpoints in the original data set prior to discarding columns with gaps were at positions 735, 2014, 2203, 2370.)
Neisseria. We analysed a 787 bp DNA sequence alignment of the argF gene of eight strains of Neisseria with the following GenBank accession numbers: X64860, X64861, X64866, X64869, X64870, X64871, X64872, X64873. This data set was used in [2], where a recombinant region between positions 1 and 202 and a differently diverged region between nucleotides 508 and 538 was found. (The numbering scheme for the bases in [2] starts at 296 bp and ends at 1083 bp, so the locations of the breakpoints have to be shifted by 295 bp.)
Figures 1 and 2 show different candidate mosaic structures of the DNA sequence alignments discussed in the previous section. These segmentations include the null hypothesis (one homogeneous region), the true mosaic structure, and various alternative mosaic structures in which the true sub-regions have either been partially merged or further sub-divided. The task is to find the true mosaic structure, and to test how reliable a selection criterion the approximate Bayesian evidence is. We first selected the mosaic structure with the maximum likelihood. This identified the correct mosaic structure only in a single case and usually preferred the more finely tessellated alternatives (overfitting). On the contrary, the approximate evidence was consistently maximised for the correct mosaic structure, and thus proves to be a reliable selection criterion. For details, see http://www.bioss.sari.ac.uk/~dirk/papers/BayesSegmentation.pdf.