| In Silico Biology 3, 0003 (2003); ©2003, Bioinformation Systems e.V. |
| BGRS 2002 |
Institute for Information Transmission Problems RAS, Moscow, Russia
1Email: lyubetsk@iitp.ru
2Email: vld@vyugin.mccme.rssi.ru
* corresponding author
Edited by E. Wingender; received September 30, 2002; revised and accepted February 22, 2003; published March 16, 2003
A new approach for comparative analysis of multiple trees reconstructed for representative protein families is proposed. This approach is based on the hypothesis of gene duplication, gene loss and horizontal gene transfer and makes use of stochastic methods and optimization. We present a species tree of 40 prokaryotic organisms obtained by our algorithm on the basis of 132 clusters of orthologous groups of proteins (COGs) from the GenBank of the National Center for Biotechnology Information (USA). We also present a computer technology intended to determine horizontally transferred genes. Some application results of the technology, based on comparative analysis of protein and species trees, are given.
Key words: evolution, phylogenetic trees, consensus trees, horizontal gene transfer, mathematical models of evolution, stochastic optimization
The availability of multiple complete genome sequences from diverse taxa makes it possible to develop new phylogenetic approaches, which incorporate information derived from comparative analysis of large gene sets. The main goal of this paper is to offer algorithmic and computer-implemented methods for the detection of horizontal gene transfer. Even though events of horizontal (lateral) gene transfer occur on the molecular level, our methods are exclusively based on comparative analysis of phylogenetic protein trees and species trees.
The first component of our computer technology is an algorithm for the reconstruction of phylogenetic species trees. We have improved our algorithm for this reconstruction from [V'yugin and Lyubetsky, 2001; V'yugin et al., 2002] by adding a new module for stochastic optimization. This algorithm uses a set of possibly contradictory gene (protein) trees as the initial data and produces a species tree as the consensus for these gene trees. The algorithm resorts to a model of gene duplications and losses that explains and measures the dissimilarity between any two given gene and species trees [V'yugin and Lyubetsky, 2001]. In the second part of our work we develop two methods for the prediction of horizontal gene transfer on the basis of the given gene and species trees. The first of the methods is based on the comparison of the combinatorial structures of the given gene and species trees. The second method uses a model of gene loss and duplication, which is extended by the lengths of the edges.
Several computer experiments were conducted using these methods. All of them were based on the data obtained from the National Center for Biotechnology Information (USA). The data include 132 clusters of orthologous groups of proteins (COGs). The genome sequences extracted from 40 living organisms related to this data formed the following 13 taxonomic groups:
Archaea (10 organisms), gamma-proteobacteria (7 organisms), beta-proteobacteria (1), alpha-proteobacteria (3), epsilon-proteobacteria (2), Gram-positive bacteria (8), chlamydiae (2), spirochaetes (2), DMS (3 species: Deinococcus radiodurans, Mycobacterium tuberculosis, Synechocystis from cyanobacteria), thermotogae & aquifex (2, precisely: Thermotoga maritima, Aquifex aeolicus). The list of the corresponding living organisms is given in Section "Experimental results". We use DMS as an acronym for these three species following [Wolf et al., 2001], and analogously the grouping of thermotogae & aquifex.
The corresponding 132 maximum-likelihood gene (protein) trees were constructed on the basis of these 132 clusters, and then were used as input data for our method of constructing the resulting consensus species tree. It is exactly this species tree that is constructed as the tree nearest to these 132 gene trees. Finally we present a method and an algorithm for the detection of horizontal gene transfer, and the corresponding list of genes that our algorithm has selected from all COGs as candidates for horizontal transfer.
Mapping of a gene tree into a species tree
A classification task often starts with a set of initial operational taxonomic units. In our case this unit is a gene or a group of genes (for example, a genome or a species). We consider gene trees and species trees. A node in a gene tree corresponds to a gene divergence event. The terminal nodes of a gene tree correspond to genes extracted from the genomes of living organisms. On the other hand, a node in a species tree corresponds to a species divergence event, and terminal nodes of a species tree correspond to living organisms whose evolution history is presented in the given species tree. Any gene divergence event can be the result of either speciation or duplication. If gene divergence occurs only with speciation, the gene and species tree coincide. However, if a gene is duplicated, gene divergence can occur without a speciation event, and then two duplicated alleles are located in one line of descent. One of them can disappear in the evolution process; this event is called a gene loss. To make these considerations precise, we formalize the concept of mapping a gene tree into a species tree [Goodman et al., 1979]. Consider two binary trees denoted by G in the case of any gene tree and by S in the case of any species tree. We identify any terminal node of a gene tree G with one-element set containing the identification mark of a gene. Any node of a species tree S is identified with the whole set of genes (rather, with their identification marks) included in the corresponding species genome. An internal node of trees G or S is assigned a set formed as union of the two sets assigned to its offsprings as follows. If the direct offsprings of an internal node are assigned the sets A and B then we assign the union A
B to this node. In this case the root of a tree corresponds to the union of all sets assigned to all terminal nodes. A node is identified with its set. It is evident that, for any two node sets, either one is a subset of the other (if one node is an offspring of the other) or else they have an empty intersection (if the opposite is true). Therefore, each phylogenetic tree is composed of clusters that are subsets of the set of all genes assigned to the terminal nodes. The gene tree is constructed for a given family of proteins (or corresponding genes). All proteins from this family have the same function and similar amino acid sequences. The structure of a gene tree reflects the evolution of proteins from this family. The corresponding species form species trees that reflect their evolution. Let it be recalled that we consider each species as a set composed of its genes (as part of the genome). Any gene tree G and any species tree S produce a unique map
: G
Sdefined as follows: for each g
G the value of
(g) is defined as the minimal (relative to the set-theoretic inclusion) s
S such that g
s. This mapping is obviously a tree homomorphism, since if g
g' then
(g)
(g').
Let us consider an internal node or the root g of a tree. We will designate its left direct offspring as cg and its right direct offspring as Cg. If g is not a root then pg designates the direct ancestor of g. If a gene tree and a species tree have the same structure then evidently the mapping
is an isomorphism. Consider the main features that show the difference between a tree homomorphism and a tree isomorphism. They are duplications, i. e. two nodes g and g' where g' is a direct offspring of g but
(g) =
(g'), and also intermediate gaps (in the range of values), i. e. for a node s in a species tree S we have
(g)
s
(pg), which means that a node s is located strictly between the nodes
(g) and
(pg). This node s is named the g-gap (or the g-intermediate node). Consider the set of all g-intermediate nodes Ig. Obviously, the set of all gaps corresponds to

The pair (g,s) where s = a(g), is named one-sided duplication provided only one of the following is true: either
(g) =
(cg) or
(g) =
(Cg). If both these equations are true, than the pair (g,s) is named two-sided duplication. We will denote the set of all one-sided duplications by O(G,S). Eulenstein and Vingron, (1995) introduced a measure of dissimilarity for a gene tree G and a species tree S as the function of comparison "cost":
They considered the function c(G,S) as a measure of difference between the homomorphism
and isomorphism (see also [Page and Charleston, 1997; Page, 1998; Guigo et al., 1996]). In V'yugin and Lyubetsky (2001) this idea was modified as follows.
Let the lengths c(a,b) of tree edges be given. We considered the function

measuring the dissimilarity between a gene tree G and a species tree S.
Here the first sum represents the total cost of all one-sided duplications. Similarly, the second sum represents the total cost of all intermediate gaps (in the range of values of tree homomorphism
). We exploit lengths c(a,b) in the sums instead of just counting the duplication and intermediate gaps as is done in the papers referred to above. So, short edges of a gene tree contribute less to the cost of the given gene and species trees comparison.
In the next section we will discuss a method of searching for a species tree S which would minimize the sum of all costs L(G,S) for all given gene trees G. According to our approach, edges with small lengths (they correspond to a short period of gene evolution) exert a weak influence on the final choice of S.
A stochastic algorithm of species tree reconstruction
The algorithm is based on the hypothesis that dissimilarity between gene trees constructed for different protein families is caused by the events of gene loss and duplication. Let a family of gene trees G1, G2 ,..., Gn be given. We proceed from the assumption that a species tree S is unknown and so we are faced with the problem of its reconstruction. We will find a species tree S as a consensus tree for all given gene trees G1, G2,..., Gn. Specifically, we will look for a species tree S so as to minimize the number of events explaining the dissimilarity between all gene trees and the unknown species tree. These events are gene duplications and losses. The differential cost functional measuring this dissimilarity is defined as F = c(S), where
is the total cost of the molecular events of gene duplications and losses during the evolution of species. We would like to find a species tree S by minimizing the value of this functional F. Theoretically speaking, the optimal tree S could be found by an exhaustive search through all possible binary trees built on a given set of species (as the set of terminal nodes). However in practice such a tree is impossible to find because the number of trees grows exponentially with the number of species. For this reason, we apply a specific option of local minimization. On the whole, the general idea of our approach is well known [Guigo et al., 1996]. As the search algorithm produces a local minimum depending on the initial species tree S0, we developed for this algorithm a new module that generates initial species trees. The search algorithm runs on a set of randomly generated initial species trees S0 (for example, 5000 trees). Any such tree S0 is transformed using the method of nearest neighbor interchange to obtain a local minimum of the functional F. To specify this local minimum we use an a priori probability distribution in the set of initial species trees. This probability distribution is defined automatically using 132 (or all known) gene trees as follows. Let it first be recalled that the distance between two nodes a and b is now defined as the number of edges in the path that connects them. For any species a, an empirical probability distribution p(b|a) is defined as the probability for both b and a to form an elementary two-element tree (i. e. to be located at a distance of 2).
It would be appropriate to define even more detailed conditional distributions (for small trees of species located at a distance of 3, 4 and so on) but this requires larger sets of initial data than we now have at our disposal. The required empirical probability distribution is defined using the statistics of distribution of genes in 132 COGs trees. Let Na be the number of COGs containing species a, and let Na,b be the number of COGs containing a and b located at a distance of 2. We define p(b|a) = Na,b / Na. Then
is the probability of the event that a forms a one-element elementary tree (i. e. no species are located at a distance of 2 from a). The initial species a is defined using a uniform pseudo random numbers generator, its neighbor b is defined using a generator of conditional probability distribution p(·|a). Then we repeat this procedure to generate the next pair, and so on. A random binary tree S0 is generated on the basis of these elementary trees. This tree S0 serves as an initial tree for the algorithm searching for the optimal tree that gives a local minimum to the cost functional F. A variety of initial trees generate a variety of resulting trees produced by a search algorithm. As the final result, we yield the consensus tree computed on the basis of a subset of these trees with a sufficiently small value of the functional F. The edges of this consensus tree are supplied with numbers that suggest the reliability of the corresponding clusters.
The resulting species tree obtained in this way explains evolutionary phylogenetic relationships of 40 living organisms mentioned above. This tree is close to the one obtained in [Wolf et al., 2001, approach (v)]. Our species tree shows a fairly good matching for pairs of nodes and a sufficiently good matching for the main groups of organisms mentioned above (Figure 1).
A method for selecting the genes that may have been horizontally transferred
Horizontal gene transfer (HGT) is transfer of genes between organisms without reproduction. There are several hypotheses concerning possible mechanisms of such transfer. To give an example, DNA can be transferred by an infected bacteriophage virus or by mating mediated by a plasmid [Lorenz and Wackernagel, 1994]. In this section we present two computer methods of selecting the genes that may have been horizontally transferred.
The first method is based on the comparison of the combinatorial structures of gene and species trees. We use the tree mapping
defined in the previous section and compare the neighborhood of a gene (in a gene tree) and its image under
(in a species tree). If two genes are located at a small distance in a gene tree but their images under a are dispersed (over a species tree), this may point to a possible horizontal transfer of one of the genes. We measure this dispersion by the relative size of the image under the mapping a of the gene neighborhood. More precisely, we proceed as follows.
Let g1, g2,...,gn be all genes (excluding g) located in the neighborhood of a gene g of a fixed radius r, and let s1, s2, ..., sn and s be the corresponding species (the images in the species tree of these genes). The distances r(g,gi) in the gene tree and the distances r(s,si) in the species tree are calculated, where i =1, ..., n, and then we calculate average values

The ratio p(g) = r(s)/r(g) reflects the degree of the average dissipation of the vicinity of the gene g in the species tree. Large values of this ratio can be interpreted as reflecting a pathology in the location of the gene g in the species tree. We compute p-values for the statistics p(g) by the formula

where m is the number of all terminal nodes. The computer program produces the list of genes with p-value not exceeding the given threshold, i. e. it selects all genes g for which p(g)
p0, where p0 is the threshold. These genes are considered as candidates for the analysis of the horizontal transfer. We also select all cases where genes g1, g2,..., gn from the neighborhood of a gene g are mapped by
in species s1, s2,..., sn which is a part of a taxonomical group that does not contain the species s (the image of the gene g). This implies that this group is a possible source of the horizontal transfer of the gene g. The computer program also lists information of the type r(g,gi) / r(s,si), i = 1,...n. This information is useful for expert analysis.
The second approach uses the mapping
and a differential cost of the functional F. We suppose that any horizontally transferred gene generates a series of invalid duplications and losses under the mapping
, which are needed to explain the dissimilarity between the gene and the species trees in our model of tree comparison. So, a temporary deletion of the transferred gene g from a gene tree G and an update of
after this deletion entails an essential decrease in the value of the functional F. At first, for any COGs tree G the cost F of the embedding of a gene tree into a species tree is calculated. After that we remove for a moment each gene g from the gene tree G to obtain the reduced gene tree Gg and compute the cost Fg of the embedding of the gene tree Gg into the same species tree. The relative change of the cost of the embedding is calculated by the formula dFg = (Fg-F)/F. We compute also p-values for the statistics dFg for all genes g from a given COG G by the formula

Similarly, the computer program selects all genes g for which p(g)
p0, where p0 is the threshold. The mean and standard deviation of the statistics dFg are also computed. They can be used if the empirical distribution of statistics dFg is Gaussian.
Our data show a sufficiently good agreement with the hypothesis of normality for the empirical distribution of dFg for most COGs.
We also implement normalization of all gene trees. Some numbers (lengths reflecting the time of evolution of the corresponding genes) were assigned to the edges of maximum likelihood gene trees. The longest lengths have the biggest impact on the total value of the functional F. So wrong locations of the longest edges (which can occur due to an inaccuracy of the maximum likelihood method) are responsible for the most serious errors in the value of the functional. To correct these wrong influences, we normalize the too long lengths of the terminal edges in a gene tree.
Combined lists of genes which may have been horizontally transferred were formed using both methods. Only genes selected by both of them were included in the list of suspected genes in Table 1. Each gene in the list is supplied by confidence information. This information can serve as an additional tool for an expert analyzing the molecular events in the process of evolution.
| Table 1: List of all genes produced by our algorithm. |
| COG name [species mark], gene mark |
value of statistics dFg · 100 (p-value) |
value of statistics r(s)/r(g) (p-value) |
value of (mean - sigma)/ sigma1 |
group of organisms2 | distance to neighbors in gene (numerator) and species (denominator) trees3 |
| COG0012 [Syn]sll0245 |
-3.78% (0.1) | 2.67 (0.1) | -1.37 | (3/8 3/8) | |
| COG0012 [Buc]BU191 |
-10.74% (0.02) | 3.14 (0.03) | -3.73 | chlamydiae | (3/12 3/12 4/10 4/10) |
| COG0018 [Syn]sll0502 |
-7.07% (0.07) | 2.64 (0.1) | -1.74 | chlamydiae | (3/8 3/8 4/11 4/10) |
| COG0018 [Mtu]Rv1292 |
-10.74% (0.02) | 2.9 (0.05) | -2.62 | (2/10 4/10 4/9) | |
| COG0061 [Xfa]XF2090 |
-9.29% (0.03) | 1.89 (0.15) | -2.75 | alpha-proteobacteria | (2/6 3/6 4/5) |
| COG0072 [Mtu]Rv1650_2 |
-11.45% (0.03) | 2.16 (0.03) | -3.49 | alpha-proteobacteria | (3/9 4/9 4/10 4/10 4/3) |
| COG0080 [Bha]BH0409 |
-6.91% (0.02) | 2.17 (0.02) | -2.77 | spirochaetes | (4/10 4/10 4/6) |
| COG0085 [Aae]aq_1939 |
-15.91% (0.03) | 2.33 (0.08) | -3.77 | epsilon-proteobacteria | (3/7 3/7) |
| COG0102 [Dra]DR0174 |
-12.29% (0.03) | 2.8 (0.03) | -4.43 | gamma-proteobacteria | (2/9 4/11 4/8) |
| COG0126 [Aae]aq_118 |
-11.59% (0.03) | 2.0 (0.05) | -2.87 | (3/7 3/7 4/7 4/7) | |
| COG0143 [Mtu]Rv1007c |
-8.95% (0.05) | 3.8 (0.05) | -2.32 | alpha-proteobacteria | (2/10 3/9) |
| COG0143 [Mlo]mlr5926 |
-12.92% (0.02) | 4.6 (0.02) | -3.45 | (2/11 3/12) | |
| COG0162 [Pae]PA4138 |
-8.12% (0.02) | 2.22 (0.05) | -3.05 | (2/7 3/7 4/6) | |
| COG0173 [Mtu]Rv2572c |
-17.47% (0.03) | 2.37 (0.03) | -2.89 | alpha-proteobacteria | (3/9 4/8 4/10 4/10 4/8) |
| COG0178 [Hbs]VNG2636G |
-8.38% (0.03) | 2.6 (0.03) | -2.18 | DMS | (2/8 4/9 4/9) |
| COG0193 [Dra]DR2372 |
-16.18% (0.03) | 2.67 (0.03) | -2.41 | gamma-proteobacteria | (2/7 4/9) |
| COG0198 [Bbu]BB0489 |
-15.81% (0.03) | 2.63 (0.03) | -4.23 | gamma-proteobacteria | (3/7 4/12 4/10 4/12 4/9) |
| COG0200 [Bbu]BB0497 |
-6.64% (0.05) | 2.36 (0.08) | -2.75 | (3/8 4/9 4/9) | |
| COG0203 [Mtu]Rv3456c |
-11.92% (0.03) | 2.67 (0.03) | -2.73 | (3/8 3/8) | |
| COG0215 [Xfa]XF0995 |
-13.86% (0.05) | 1.89 (0.05) | -2.5 | alpha-proteobacteria | (2/6 3/6 4/5) |
| COG0215 [Hbs]VNG1097G |
-20.64% (0.03) | 2.5 (0.03) | -3.78 | DMS | (2/8 4/9 4/9 4/9) |
| COG0221 [Mlo]mlr8562 |
-9.12% (0.03) | 3.29 (0.1) | -2.24 | (3/10 4/13) | |
| COG0222 [Dra]DR2043 |
-10.31% (0.03) | 3.4 (0.03) | -2.23 | alpha-proteobacteria | (2/9 3/8) |
| COG0242 [Syn]slr1549 |
-10.14% (0.03) | 3.2 (0.03) | -2.64 | spirochaetes | (2/8 3/8) |
| COG0250[Aae]aq_1931 | -9.29% (0.02) | 2.2 (0.09) | -3.55 | (3/7 3/7 4/8) | |
| COG0272 [Rpr]RP720 |
-8.8% (0.1) | 1.67 (0.16) | -1.16 | (3/5 3/5) | |
| COG0272 [Eco]yicF |
-29.13% (0.03) | 4.8 (0.03) | -4.22 | spirochaetes | (2/12 3/12) |
| COG0292 [Tma]TM1592 |
-20.24% (0.03) | 2.8 (0.03) | -3.04 | DMS | (2/7 3/7) |
| COG0294 [Ccr]CC3224 |
-7.61% (0.06) | 3.22 (0.06) | -1.6 | DMS | (3/10 3/10 3/9) |
| COG0335 [Dra]DR0755 |
-6.19% (0.07) | 3.6 (0.03) | -1.7 | chlamydiae | (2/9 3/9) |
| COG0343 [Afu]AF1485 |
-8.58% (0.06) | 3.33 (0.03) | -2.2 | chlamydiae | (3/10 3/10) |
| COG0359 [Aae]aq_2042 |
-13.19% (0.03) | 2.8 (0.03) | -2.64 | DMS | (2/7 3/7) |
| COG0441 [Ape]APE0809 |
-14.09% (0.02) | 3.0 (0.07) | -3.25 | DMS | (2/8 3/7) |
| COG0452 [Bbu]BB0812 |
-8.69% (0.03) | 2.33 (0.03) | -2.34 | (2/7 4/7) | |
| COG0504 [Tpa]TP0305 |
-21.5% (0.03) | 3.2 (0.05) | -3.95 | DMS | (2/8 3/8) |
| COG0525 [Rpr]RP687 |
-16.16% (0.03) | 3.67 (0.03) | -3.99 | (3/12 3/12 3/9) | |
| COG0547 [Cje]Cj0346_2 |
-12.66% (0.03) | 3.6 (0.09) | -2.6 | (2/9 3/9) | |
| COG0552 [Dra]DR2260 |
-13.72% (0.03) | 2.27 (0.05) | -3.14 | alpha-proteobacteria | (3/8 4/9 4/9 4/8) |
| COG0556 [Hbs]VNG2390G |
-9.23% (0.03) | 2.33 (0.03) | -1.93 | DMS | (3/9 3/8 3/4) |
| COG0571 [Syn]slr0346 |
-19.72% (0.07) | 2.67 (0.1) | -2.43 | chlamydiae | (3/8 3/8 3/8) |
| COG0587 [Bsu]BS_yorL |
-10.28% (0.03) | 3.11 (0.03) | -2.5 | (3/10 3/10 3/8) | |
|
1 mean is the average value of statistics dFg for the COG, and sigma is its standard deviation
2 source of the horizontal transfer 3 for the neighborhood of the radius 4, i. e. r(g,gi)/ r(s,si) | |||||
Results of species trees reconstruction
As was mentioned in the Introduction, the initial data include 132 maximum-likelihood trees constructed on the basis of 132 clusters of orthologous groups of proteins (COGs). Genome sequences extracted from 40 living organisms form the following groups:
The corresponding 132 maximum-likelihood protein trees were used as the initial data for our method of constructing a census species tree. The search algorithm runs on a set of 5000 randomly generated initial species trees. Two sets of resulting species trees were selected:
After that, consensus trees for these sets were constructed using the program CONSENS from the PHYLIP package [Felsenstein, 1989]. These two trees have the same structure and are a very good support for the main groups of species (Fig. 1). The only difference between our tree and the best species tree from Wolf et al. (2001) is the relative position of epsilon-proteobacteria group and (Aae,Tma)-pair. We believe that the difference may be attributable to a slight setback in our algorithm.
|
Figure 1: The consensus tree for the best 182 trees constructed by our algorithm. Our tree practically coincides with the tree from [Wolf et al., 2001, approach (v)]. |
HGT in selected COGs
All 132 COGs were studied. 15 COGs failed the quality test and consequently were excluded (our quality test for COGs is beyond the scope of this paper). In the remaining 117 COGs, 47 suspected genes (from 35 COGs) were selected. We only consider HGT between organisms from different groups of the species tree. Our algorithm found the following sources of HGT: chlamydiae (7 HGT), alpha-proteobacteria (7 HGT), gamma-proteobacteria (3 HGT), epsilon-proteobacteria (2 HGT), spirochaetes (3 HGT), DMS (8 HGT), archaea (1 HGT). No sources for the remaining part of HGT were found. All cases of HGT that were found by our algorithm are presented in Table 1. We will now give some of them a more detailed consideration.
| Figure 2: The examples of horizontal transferring events. |
A well known method of comparing the gene and species trees based on simple counting of the number of duplications and losses was modified. Our method considers the lengths of the edges that define duplication or loss events. The local minimization algorithm of the nearest neighbor branch swapping supplied with a module generating random initial species trees yields species trees similar to the best known trees. When special a priori probability distribution over the trees, computed in advance, is used to generate random initial trees, the resulting consensus tree remains robust under different long runs of the algorithm. These results show that our model of gene duplication and gene loss and the corresponding algorithm present a sufficiently adequate tool for tree comparison and can be used for the detection of horizontal gene transfer.
The algorithms for the prediction of horizontal gene transfer were developed. Each of them produces a specific list of genes that are located at the bottom of the essential dissimilarity of gene and species trees detected by the corresponding algorithm. The final list (presented in Table 1) consists of genes common to both specific lists. We suppose that most of the genes from this list were horizontally transferred during the process of evolution. But the final solution should be made by an expert who would use our data.
We thank M. Gelfand for very useful discussions and explanations. We are also grateful to A. Koonin and Y. Wolf who let us have the initial data. We highly appreciate very useful remarks of the reviewer that helped us to fully express our ideas.