In Silico Biology 5, 0027 (2005); ©2005, Bioinformation Systems e.V.  

CBCAnalyzer: inferring phylogenies based on compensatory base changes in RNA secondary structures

Matthias Wolf, Joachim Friedrich, Thomas Dandekar* and Tobias Müller

Department of Bioinformatics, Biocenter, University of Würzburg, Am Hubland, D-97074 Würzburg, Germany

*  Corresponding author; phone: +49-931-888 4551; fax +49-931-888 4552;

Edited by E. Wingender; received December 20, 2004; revised and accepted March 03, 2005; published March 16, 2005


The CBCAnalyzer (CBC = compensatory base change) is a custom written software toolbox consisting of three parts, CTTransform, CBCDetect, and CBCTree. CTTransform reads several ct-file formats, and generates a so called "bracket-dot-bracket" format that typically is used as input for other tools such as RNAforester, RNAmovie or MARNA. The latter one creates a multiple alignment based on primary sequences and secondary structures that now can be used as input for CBCDetect. CBCDetect counts CBCs in all against all of the aligned sequences. This is important in detecting species that are discriminated by their sexual incompatibility. The count (distance) matrix obtained by CBCDetect is used as input for CBCTree that reconstructs a phylogram by using the algorithm of BIONJ. In this note we describe the features of the toolbox as well as application examples. The toolbox provides a graphical user interface. It is written in C++ and freely available at:

Keywords: compensatory base change (CBC), internal transcribed spacer 2 (ITS2), phylogeny, rRNA, secondary structure


When there is a CBC in the internal transcribed spacer2 (ITS2) there is no sexual compatibility

According to Coleman and Vacquier, 2002, " all [...] eukaryote groups where a broad array of species has been compared for both [rRNA] ITS2 sequence secondary structure and tested for any vestige of interspecies sexual compatibility, an interesting correlation has been found. When sufficient evolutionary distance has accumulated to produce even one CBC in the relatively conserved pairing positions of the ITS2 transcript secondary structure, taxa differing by the CBC are observed experimentally to be totally incapable of intercrossing." However, to analyze CBCs in secondary structures of rRNA gene sequences, that is, to ascertain where in a phylogenetic lineage the first CBC appears that involves a pairing of two relatively conserved positions, beside a phylogenetic tree, two types of information are necessary: (1) predicted secondary structures, and (2) a multiple alignment guided by the primary sequences and the secondary structures.

About the application and programs connected to it

For the prediction of RNA secondary structures, many bioinformatics applications are available, e.g., the Vienna package [Hofacker, 2003] or Mfold [Zuker, 2003]. Preferably we use RNAstructure [Mathews et al., 1999] because of one feature that could be used in setting structural constraints, i.e., when sequences are folded, any lower-case bases are not allowed to base pair. This is a very important feature: Knowing one structure, we can use structural constraints to fold closely related sequences in the same way, and then look for CBCs. However, data thus obtained by RNAstructure (ct-files) could not be used as input, e.g., in MARNA [Siebert and Backofen, 2003], which generates a multiple alignment guided by the primary sequences and the secondary structures. Hence, we use a flexible format conversion routine (CTTransform), to convert ct-files into a bracket notation, known from the Vienna package, readable by MARNA and/or by RNAforester [Höchsmann et al., 2003]. (Note that RNAforester already visualizes CBCs, but generates only optimal pairwise alignments guided by the primary sequences and the secondary structures.) CTTransform inputs are single ct-files (optimal secondary structures) from different RNA sequences. Note that there are different ct-formats (ct; RnaViz ct and Mac ct) which can be transformed. Once the converted ct-files have been processed by MARNA or RNAforester, the output produced by MARNA or RNAforester is used as input for our second routine (CBCDetect, Fig. 1). This calculates compensatory base changes in all against all of the aligned sequences. If a multiple alignment is used, the output is a count matrix (uncorrected p-distances) (see Fig. 1), that now can be used as input for PHYLIP [Felsenstein, 1993], or BIONJ [Gascuel, 1997]. The algorithm of BIONJ is implemented in CBCTree to reconstruct a simplified phylogram directly based only on CBCs. Output is given in Newick format that can be used e.g. as input for Treeview [Page, 1996], NJplot [Perriere and Gouy, 1996] or hyperbolic-tree [Lamping et al., 1995], i. e. HyperGeny [De Praetere et al., 2004]. A program flowchart is given in Fig. 2.

Figure 1: CBCAnalyzer screenshot. CBCDetect input (MARNA alignment) and output (distance matrix).

Figure 2: CBCAnalyzer flowchart, connecting CTTransform, CBCDetect and CBCTree. All connections are explained in the text. Note especially RNA movie that helps to detect the flexibility in secondary structure formation as well as secondary structure configuration switches.

Examples and pitfalls

A test version of the toolbox successfully was applied on the data in Schmitt et al., 2004, dealing with the phylogeny of sponges. Furthermore, an ITS2 dataset of closely related scenedesmacean taxa, originally published by Hegewald and Wolf, 2003, was re-examined, i. e., reconstructing a phylogeny based on this dataset, using the complete alignment or the CBC information only, yield an identical tree topology. Moreover, 'subspecies' - as expected - accumulated no CBCs at all. This example is shown in the help section of the program. However, it should be mentioned, that the logic of the incapability of intercrossing by accumulating even one CBC is a logical implication: if there is a CBC, this is a barrier against intercrossing, but if there is no CBC at all, corresponding sequences do not necessarily belong to the same species. Generally, due to the poor quantity of CBCs, CBCTree should be used only for reconstructing the phylogeny of a small set of closely related taxa, i.e., a CBC may be a good phylogenetic marker; however, the accumulation of a CBC is generally a rare event in evolution. If there are unexpected many CBCs, visualizing a CBC-tree will rather detect alignment-shifts based on misaligned sequences and/or misaligned structures.

Additional features

Moreover, CTTransform accepts a multiple ct-file (optimal and suboptimal secondary structures) from a single sequence. Output is than a movie script that could be visualized by RNA movie [Evers and Giegerich, 1999]. Hence, flexibility in secondary structure formation as well as secondary structure configuration switches could be detected.


Special thanks to Annette Wolf for valuable discussions.