In Silico Biology 5, 0025 (2005); ©2005, Bioinformation Systems e.V. |
^{1} Institute of Computational Modeling, Russian Academy of Science, Russia
^{2} Centre for Mathematical Modelling, University of Leicester, UK
^{3} Institut des Hautes Etudes Scientifiques, Bures-sur-Yvette, France
^{4} Service Bioinformatique, Institut Curie, Paris, France
* Corresponding author
Email: andrei.zinovyev@curie.fr
Edited by E. Wingender; received November 12, 2004; revised January 25, 2005; accepted January 30, 2005; published March 02, 2005
Coding information is the main source of heterogeneity (non-randomness) in the sequences of microbial genomes. The heterogeneity corresponds to a cluster structure in triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in microbial genomic sequences and explained its properties. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy. Based on the analysis of 143 completely sequenced bacterial genomes available in Genbank in August 2004, we show that there are four "pure" types of the 7-cluster structure observed. All 143 cluster animated 3D-scatters are collected in a database which is made available on our web-site (http://www.ihes.fr/~zinovyev/7clusters). The findings can be readily introduced into software for gene prediction, sequence alignment or microbial genomes classification.
Keywords: word frequency, codon usage, clustering, visualization, symmetry
Bacterial genomes are compact ones: most of the sequence contains coding information. Hence any statistical study of bacterial genomic sequence will detect coding information as the main source of heterogeneity (non-randomness). This is confirmed by mining sequences "from scratch", without use of any biological information, using entropic or Hidden Markov Modeling (HMM) statistical approaches (for examples, see [1, 2, 3, 4]). Some of these methods can be seen as specific clustering of relatively short genomic fragments of length in the range 200-400bp comparable to the average length of a coding information piece.
Surprisingly, not much is known about the properties of the cluster structure itself, independently of gene recognition problems, where it is implicitly used since long time ago (see, for example, the early paper about the famous GENMARK gene-predictor [5] or [6] about the GLIMMER approach).
The idea of separating a microbial genome into seven homogeneous parts corresponding to six possible frameshifts and non-coding regions is, of course, not at all new. Nevertheless, existence of clusters (the fact that the clusters are indeed well separated) is a purely empirical fact, and one of the best ways to prove it is to perform cluster structure visualization.
Only recently the 7-cluster structure was described explicitly. In [7, 8, 9] and [10] the structure was visualized in the 64-dimensional space of non-overlapping triplet distributions for several genomes. Also the same dataset was visualized in [11] and [12] using non-linear principal manifolds. In [13] several particular cases of this structure were observed in the context of the Z-curve methodology in the 9-dimensional space of Z-coordinates: it was claimed that the structure has an interesting flower-like pattern but can be observed only for GC-rich genomes. This is somehow in contradiction with the results of [9], published before, where the same flower-like picture was demonstrated for AT-rich genome of Helicobacter pylori. This fact shows that this simple and basic structure is far from being completely understood and described.
The problem can be stated in the following way: there is a set of genomic fragments of length 100-1000 bp representing a genome almost uniformly. There are various ways to produce this set, for example, by sliding window with a given step of sliding (in this case sequence assembly is not generally needed), or it might be a full set of ORFs (in this case one needs to know the assembled sequence). We construct a distribution of points in a multidimensional space of statistics calculated on the fragments and study the cluster structure of this distribution. The following questions arise: What is the number of clusters? What is the character of their mutual locations? Is there a "thin structure" in the clusters? How the structure depends on the properties of genomic sequence, can we predict it?
One natural way to describe composition of a text fragment is a "frequency dictionary" of short words (see examples in [14, 15, 16, 17]). For our purposes we use frequencies of non-overlapping triplets, counted from the first basepair of a fragment. Thus every fragment is a point in 64-dimensional space of triplet frequencies. The cluster structure we are going to describe is universal in the sense that it is observed in any bacterial genome and with any type of statistics which takes into account coding phaseshifts. The structure is basic in the sense that it is revealed in the analysis in the first place, serving as the principal source of sequence non-randomness. In [7, 8, 9, 13] it was shown that even simple clustering methods like K-Means or Fuzzy K-Means give good results in application of the structure to gene-finding.
One example of the observed 7-cluster structure is shown on Fig. 1. In short, this is a PCA (Principal Components Analysis) plot of the point distribution. In Fig. 1 and further in this paper F0 stands for the (spatial) center of the group of "coding" fragments of F-type in which non-overlapping triplets have been red in the correct frame. The center is calculated as a simple mean point and it is a 64-dimensional vector. F1 and F2 correspond to the fragments where the triplets have been red with a frameshift (on one or two positions). Analogously, the B0, B1 and B2 labels stand for the centers of B-type fragment groups, where the triplets have been red with one of three possible frameshifts, respectively.
Referring for the details of the visualization to the Methods section, we stop now on basic properties of the structure. First, it consists of 7 clusters. This fact is rather natural. Indeed, we clip fragments only from the forward strand and every fragment can contain 1) a piece of coding region from the forward strand, with three possible shifts relatively to the first fragment position; 2) coding information from the backward strand, with three possible frameshifts; 3) non-coding region; 4) mix of coding and non-coding information: these fragments introduce noise in our distribution, but their relative concentration is not high. Second, the structure is well pronounced, the clusters are separated from each other with visible gaps. This means that most of learning (and even self-learning) techniques aiming at separation of the clusters from each other will work very well, which is the case for bacterial gene-finders that have performance more than 90% in most cases (for recent overview, see [18]). Third, the structure is well represented by a 3D-plot (in this case it is even almost flat, i. e. 2D). Fourth, it has indeed symmetric and appealing flower-like pattern, indicating that there should be a symmetry in our statistics governing the pattern formation.
In this paper we show how the structure depends on very general properties of genomic sequence and that it almost uniquely depends on a single parameter: the genomic G+C content. Also, based on analysis of 143 completely sequenced genomes, available in Genbank in August 2004, we describe four "pure" types of the structure observed in bacterial genomic sequences.
The outline of the paper is the following: after description of the methods utilized we introduce phaseshift and complementary reverse operators, helping to describe the structure, after we show that in nature we have almost one-dimensional set of triplet distributions. After that we explain the properties of the 7-cluster structure and describe four "pure" types of the structures, observed for bacterial genomes.
143 microbial genomes (124 eubacteria and 19 archaea) were selected from Genbank in August 2004. To reduce the bias connected with sequencing projects for some species of specific interest, only one genome per species was selected (the choice was done considering the annotation length). For every Genbank file the full annotation for CDS was considered, including those CDS marked as hypothetical or computationally predicted. No discrimination on CDS length was made.
For visualization of the 7-cluster structure a data set for each of the 143 genomes was prepared in the following way:
1) A Genbank file with the completed genome was downloaded from the Genbank FTP-site. Using BioJava package the complete sequence and the annotation was parsed. In the case when the genome had two chromosomes, the sequences of both were concatenated. Plasmid sequences were ignored.
2) Let N be length of a given sequence S and S_{i} be a letter in the i-th position of S. We define a step size p and a fragment size W. For given p, W and we clip a fragment of length W in the sequence, centered in i = w/2 + pk. In every fragment we count frequencies of non-overlapping triplets:
(1) |
where comp = (word1, word2) is a string comparison function that has value 1 if word1 equals word2 and 0 otherwise. Here c_{i} is a letter from genetic alphabet (c_{1} = A, c_{2} = C, c_{3} = G, c_{4} = T).
For every fragment a frequency vector is defined:
(2) |
All words that contain non-standard letters like N, S, W, are ignored.
The data set , is normalized to have unity standard deviation and zero mean.
3) We assign to the fragment a label according to the Genbank annotation of the CDS features (including hypothetical ones) that include the center of the fragment. If the center is inside a CDS feature then the reading frame and the strand of the CDS feature are determined and the fragment is assigned one of F0,F1,F2,B0,B1,B2 label. In the second case the label is J.
4) Standard PCA-analysis is performed and the first three principal components are calculated. They form a 3D-orthonormal basis in the 64-dimensional space. Every point is projected into the basis, thus we assign three coordinates for every point.
To create a schema of 7-cluster structure (see Fig. 1) the following method was utilized. We calculated the mean point y^{L} for every subset with a given label L.
For the set of centroids y^{F}^{0}, y^{F}^{1}, y^{F}^{2}, y^{B}^{0}, y^{B}^{1}, y^{B}^{2} a distance matrix of Euclidean distances was calculated and visualized using classical MDS.
To visualize the "radii" of the subsets, a mean squared distance d^{(}^{p}^{)} to the centroid p was calculated (intraclass dispersion). To visualize the value on 2D plane, we have to introduce dimension correction factor, so the radius drawn on the picture equals
(3) |
The form of the cluster is not always spherical and often intersection of radii do not reflect real overlapping of classes in high-dimensional space. To show how good the classes are actually separated, we developed the following method for cluster contour visualization. To create a contour for class p, we calculate averages of all positive and negative projections on the vectors connecting centroid p and 6 other centroids i = 1 .. 6.
(4) |
(5) |
Then, using the 2D MDS plot where every vector (y^{ p})' has 2 coordinates, we put 12 points t^{ f}, t^{ b} analogously.
(6) |
(7) |
Using a smoothing procedure in polar coordinates we create a smooth contour approximating these 12 points.
Let us denote frequencies of non-overlapping triplets for a given fragment as f_{ijk}, where i, j, k {A, C, G, T}, such as f^{ ACT}, for example, is a relative (normalized) frequency of ACT triplet.
One can introduce such natural operations over frequency distribution as phase shifts P^{(1)}, P^{(2)} and complementary reversion C^{R}:
(8) |
where is complementary to i, i. e., = T, = G, etc.
The phase-shift operator P^{(}^{n}^{)} calculates a new triplet distribution, counted with a frame-shift on n positions, in the hypothesis that no correlations exist in the order of triplets in the initial phase. Complementary reversion constructs the distribution of codons from a coding region in the complementary strand, counted from the forward strand ("shadow" frequency distribution).
Phaseshift operators approximate the shifted triplet frequency as superposition of a phase-specific nucleotide frequency and a dinucleotide frequency. This can be better understood if we rewrite definitions (8) in the following way:
(9) |
We introduce the notion of mean-field (or context free) approximation of the triplet distributions in the following way:
(10) |
i. e. the mean-field approximation is the distribution constructed from the initial triplet distribution neglecting all possible correlations in the order of nucleotides. The p_{i}^{(}^{k}^{)} are the frequencies of the i-th nucleotide (i {A,C,G,T}) at the k-th position of a codon (k = 1..3). In this way we model the 64 frequencies of the triplet distribution using only 12 frequencies of the three position-specific nucleotide distributions. This approximation is widely used in the literature (see, for example, [3, 19]). All triplet distributions that can be represented in the form (10) belong to a 12-dimensional curved manifold M, parametrized by 12 frequencies p_{i}^{(}^{k}^{)}. The manifold is embedded in the 64-dimensional space of all possible triplet distributions T. The normalization equality makes all distributions to form a standard 63-dimensional simplex in R^{64}. For M one has 3 independent normalizations: , these equalities distinguish a 9-dimensional set (image of the product of three 3-dimensional standard simplexes) in M, where all normalized distributions are located.
It is easy to understand that any phase-shift for m_{ijk} only rotates the upper (position) indexes:
(11) |
Also it is worth noticing that applying the P^{(1)} (or P^{(2)} operator several times to the initial triplet distribution we get the
(12) |
Operator acts as a projector operator from full 64-dimensional distribution space onto the 12-dimensional manifold M:
(13) |
On the manifold M of all possible m_{ijk} we have P^{(2)} = (P^{(1)})^{2}, therefore, there are only two operators: phaseshift P : Pm_{ijk} = m_{jki} and reversion C : . There are the following basic equalities:
(14) |
Let us consider a point m on M. It corresponds to a set of 12 phase-specific nucleotide frequencies p_{i}^{(1)}, p_{i}^{(2)} and p_{i}^{(3)}, i {A, C, G, T}. Applying operators P and C in all possible combinations we obtain an orbit on M, consisting of 6 points: m, Pm, P^{2}m, Cm, PCm, P^{2}Cm. Theoretically, some points can coincide, but only in such a way that the resulting orbit will consist of 1 (fully degenerated case), 3 (partially degenerated case) or 6 (non-degenerated case) points. The fully degenerated case corresponds to the triplet distribution with the highest possible entropy among all distributions with the same nucleotide composition:
(15) |
This distribution (completely random) is described by 4 nucleotide frequencies, with any information about position in the triplet lost. In our T space it is a 3-dimensional (due to normalization equality) simplex on M. For bacterial genomes this distribution can serve as an approximate (zero-order accuracy) model for triplet composition in non-coding regions.
Now let us consider triplet distributions corresponding to the codon usage of bacterial genomes, i. e. the subset of naturally occured triplet distributions. It was found that they could be reasonably well approximated by their mean-field distributions (see the next section), i. e. they are located close to M. Moreover, in the next section we show that in nature, for 143 completely sequenced microbial genomes, the m_{ijk} distributions are tightly located along two one-dimensional curves on M. The curves can be parametrized by the genomic G+C-content.
Twelve dependencies p_{i}^{(1)}(GC), p_{i}^{(2)}(GC) and p_{i}^{(3)}(GC), i {A, C, G, T}, where GC is genomic G+C-content, are presented in Fig. 2a-d for 143 fully sequenced bacterial genomes available in Genbank in August, 2004. These dependencies are almost linear.
Fig. 2e demonstrates that the codon position-specific G+C-content is a linear function of genomic G+C-content, in each position. In [13] analogous results were shown for 33 completed genomes.
The dependencies shown in the figures were already discussed in literature in different forms and aspects. In [19] the authors created a "heuristic" estimation of Hidden Markov Model parameters using codon position-specific nucleotide frequencies calculated from sufficiently long pieces of a genomic sequence. On 10 eubacterial genomes they demonstrated that the p_{A}^{(}^{k}^{)}(p_{A}), p_{C}^{(}^{k}^{)}(p_{C}), p_{G}^{(}^{k}^{)}(p_{G}) and p_{T}^{(}^{k}^{)}(p_{T}), k {1, 2, 3}, dependencies are very close to linear. Thus, it was shown in [19] that the codon usage of microbial genomes is a 4-parametric multi-linear function. In the case p_{A} = p_{T}, p_{C} = p_{G} (PR2-parity condition), these graphs are equivalent to Fig. 2, but many microbial genomes are shown to be not PR2-pared (see [20], for example). Fig. 2 shows that the linear dependencies on GC-content are highly significant.
It was shown that many properties of the codon usage are correlated with genomic G+C-content. For example, the strong correlation between the amino-acid composition and genomic G+C content was proven in [21] for 59 bacterial species.
The codon usage bias has been widely reported to correlate with G+C composition, and recently the quantitative regression between codon usage bias and GC3 (G+C-content in the third position) was published [22]. The regression equations are based on 70 eubacterial and 16 archaeal genomes.
Some fragments of observed correlations can be found in the Sueoka's neutrality plots [23, 24]. A theory of directional mutation pressure was proposed in 1962 [25]. It explained the wide variation of DNA base composition observed among different bacteria and its small heterogeneity within individual bacterial species. The theory was based on the assumption that the effect of mutation on a genome has a directionality toward higher or lower G+C content of DNA, and this pressure generates directional changes more in neutral parts of the genome than in functionally significant parts.
For analysis of codon bias evolution a population genetic model is developed taking into account population size and selective differences between synonymous codons [26].
Following the Sueoka theory, in regression analysis one mostly uses GC3 (the G+C content in the third position), and not the overall genomic G+C content. The reason is that GC1 and GC2 are under strong selection control, while for GC3 and G+C content of intergenic regions this control is much weaker, and the the overall genomic G+C content is the linear combination of these quantities. Sometimes, the difference between the GC3 and overall genomic G+C content as reference variable might be significant, but in our case, as it is presented on Fig. 2e, the correlation between GC3 and genomic G+C content is strong, one of them is practically a linear function of the other, and both of them can serve as reference variables with the same success.
It is necessary to mention that for the moment we don't know any theory that would give a solid explanation of the observed accuracy and linearity of the dependencies, presented in Figs. 2. Why the bacterial genomes form a straight line in the 9-dimensional space of codon-position specific nucleotide frequencies? We don't know.
It seems natural to apply "mutation+selection" arguments and models [23, 25, 26, 27]. Such models are in good agreement with some data of codon usage [24] (and some quantitative discrepancies [26] and doubts are also reported, even for the problem of genetic code optimality [28]). The problem is that we need to prove the models on another material, it is desirable to verify them independently, by direct measurements. It should be proven that the mutation+selection processes have kinetic constants that could provide such an accuracy, and that the whole process keeps all genomes near the observed straight line. And now we can only ask again, for the new data [29]: Codon usage: mutational bias, translational selection, or both? (Or anything else?)
For our purposes, it is important to notice that for the genomes with G+C-content higher than 60% there is the same well-defined structure in their codon usage: the G+C-content in the first codon position is close to the average of three values, the second is lower than the average and the third is essentially higher than the average. This pattern can be denoted in the form of simple GC-signature: 0 − +. In the next section we develop more complicated signature to classify 7-cluster structures, corresponding to the orbits, generated by P and C operators in a set of genomic fragments of 300-400bp length.
The linear functions, describing codon usage, are slightly different for eubacteria and achaea genomes. Significant deviations are observed for p_{A}^{(1)}, p_{C}^{(1)}, p_{G}^{(3)}, p_{G}^{(1)}, p_{T}^{(3)} functions. For the others, the dependencies are statistically indistinguishable.
A geometrical interpretation of Fig. 2 is the following: if we take the set of triplet frequencies, occurring in nature and corresponding to the codon usage of bacterial genomes, then in the 12-dimensional space of codon position-specific nucleotide frequencies this set appears almost as a straight line (more precisely, two close lines, one for eubacteria and the other for archaea). If we look at this picture from the 64-dimensional space of triplet frequencies T, then one sees that the distributions are located close to the curved M manifold of the mean-field approximations, embedded in the space. As a result, when analyzing the structure of the distribution of bacterial codon usage, one detects that the points are located along two curves. Moving along these curves one meets all bacterial genomes. Genomes with close G+C-content generally have close codon usage. Evidences for this structure were reported in studies on multivariate analysis of bacterial codon usage (for example, see Figure 6 from [30]).
To give a feeling of how different approximations work on different genomes, we provide Fig. 3, where different approximations are shown for three genomes (W. brevipalpis with the lowest G+C-content from our set, S. coelicolor with the highest G+C-content and E. coli which is in the middle of the scale).
Figure 3: Three examples of various codon usage approximations. Four triplet distributions are shown, for the extremes and the middle of the GC-scale. The codons are ordered alphabetically (for example, codons 49, 51 and 57 are the stop codons). Average (mn) distribution is f_{ijk} = p_{i} p_{j} p_{k}. 1D approximation is calculated using formulas from Fig. 2. |
In Fig. 4 we show the values of Kullback distance (entropy-based distance) between various codon usage approximations for 143 bacterial genomes plotting them against the value of G+C-content of the coding regions. To understand the scale, we mention that the "cu-mf" distances calculated for a very strongly correlated (although purely imaginal) triplet distribution f_{AAA} = p_{A}, f_{TTT} = p_{T}, f_{CCC} = p_{C}, f_{GGG} = p_{G} and f_{ijk} = 0 otherwise, would give values around 3 (not shown on the figure). "Random" interval shown on Fig. 4 represents the average and the standard deviation for imaginal triplet frequency distribution r_{ijk}, constructed in the
following way: we take random value v_{ijk} = rand(0,1), where rand(0,1) is a uniformly distributed random distribution and after construct
.
Figure 4: Graph of Kullback distances for 143 bacterial genomes between codon usage (cu) and various triplet distributions (mf is mean-field approximation, mn is the average distribution f_{ijk} = p_{i} p_{j} p_{k}, m0 is completely random distribution f_{ijk} = 1/64, 1d is distribution calculated with use of formulas from Fig. 2); "Random" interval shows the average and the standard deviation for imaginal triplet "random" frequency. |
As one can see from Fig. 4, the information contained in the nucleotide correlations in the coding regions ("cu-mf" distance) does not depend on GC-content having a constant value comparable with purely "random" correlation level appearing simply by chance in a "randomly" taken triplet frequencies (even slightly less). The absolute value of accuracy of one dimensional approximation also does not depend on G+C-content. Nevertheless, one should rather compare the accuracy ("mf-1d" distance) as well as mean-field approximation accuracy ("cu-mf" distance) relatively to the absolute information content ("cu-m0") of codon usage. It means that effectively the one-dimensional codon usage approximation is worse for the region of GC 50%, in correspondence to Fig. 3 and other studies.
The plots in Fig. 4 do not show any signs of difference between archaeal and eubacterial genomes (though it is not shown in the plots).
In [13] the authors claim that the codon position-specific nucleotide frequencies (represented as z-coordinates) in GC-rich genomes show flower-like cluster structure, and the phenomenon is not observed in other genomes. Here we explain the phenomenon and demonstrate other types of structures observed in genomes and that the type of the structure is related to the pattern of symmetric properties of codon usage.
First of all, we point out the fact that the space used in [13] is a specific projection from 64-dimensional space of triplet frequencies. The 9-dimensional phenomenon can also be observed in the 64-dimensional space and in the 12-dimensional space of codon position-specific nucleotide frequencies.
Let us consider the context free approximation of codon usage introduced above:
(16) |
and consider 3D space with the following coordinates:
(17) |
In fact, x, y and z are deviations of GC-content in the first, second and the third position from average GC-content f_{GC} of coding regions. In all GC-rich genomes (starting from f_{GC} > 60%) their codon usage context-free approximation has the following structure (see Fig. 2e): x 0, y < 0, z > 0. We can denote this pattern as 0 − +. Applying phaseshift and reverse operators defined above (notice that C operator does not change G+C-content, it only reverses the signture), we obtain the following orbit: {0 − +, − + 0, + 0 −} and {+ − 0, − 0 +, 0 + −}. If we consider now a 3D grid consisting of 27 nodes as shown in Fig. 5, corresponding to all possible patterns (GC-signatures), then it is easy to understand that the orbit corresponds to the points of where the grid is cross-sectioned by a plane, coming through the point perpendicular to the {−−−, +++} diagonal. It is a well-known fact that in this situation the form of the intersection is a regular hexagon. The point in our picture corresponds to the center of the non-coding cluster (this is the fully degenerated distribution described above), where all phases have been mixed. The {−−−, +++} diagonal corresponds to the direction of the fastest G+C-content increase. Hence, this model explains the following features of the flower-like structure observed in GC-rich (G + C > 60%) genomes:
1) In the 64-dimensional space the centers of clusters are situated close to a distinguished 2D-plane, forming a regular hexagonal structure.
2) The third principal component (perpendicular to the cluster plane) is the direction of G+C-content increase (i. e., the gradient of G+C-content linear function, defined in the 64-dimensional triplet space).
Figure 5: Model of the flower-like cluster structure. The broken line corresponds to the direction of the fastest G+C-content increase. |
In most flower-like structures the cluster that corresponds to the non-coding regions is slightly displaced in the direction perpendicular to the main cluster plane. This happens because G+C-content of non-coding regions is generally slightly lower than of coding regions.
Now let us consider the general case of a genome with any given genomic G+C-content. The type of the 7-cluster structure depends on the values of 12 functions p_{i}^{(1)}, p_{i}^{(2)}, p_{i}^{(3)}, i {A, C, G, T}. Applying phaseshift and reverse operators, one obtains an orbit which serves as a skeleton of the cluster structure. The orbit structure reflects symmetries in the set of values of these 12 functions with respect to the P and C operators.
We describe these symmetries in the following simplified manner. Let us order the 12 values in the form of a 6 x 2 table:
(18) |
Then the reverse operator C simply reads the table from right to the left:
(19) |
The phaseshift operator P rotates the values in the table by threes, for every letter:
(20) |
We reduce the description of s in the following way: every entry in the table is substituted by "+", "−" and "0", if the corresponding value is bigger than the average over the same letter frequencies, smaller than the average or in the interval [average − 0.01, average + 0.01], respectively. For example, for a set of frequencies p_{A}^{(1)} = 0.3, p_{A}^{(3)} = 0.5, p_{A}^{(3)} = 0.401, we substitute p_{A}^{(1)} −, p_{A}^{(2)} +, p_{A}^{(3)} 0. We call "signature" the new table with reduced description.
Using linear formulas from the Fig. 2a-d and calculating the tables for the range [0.2; 0.8] of G+C-content, we obtain 19 possible signatures in the intervals of genomic G+C-content, listed in Table 1.
Table 1: | Nineteen possible signatures for one-dimensional codon usage model |
−−+−−+ +−−++− |
[0.200; 0.255) | 000−++ +−−0+− |
[0.331; 0.373) | 0+−−++ +−−−0+ |
[0.434; 0.482) |
−−+−−+ +−−0+− |
[0.255; 0.265) | 0+0−++ +−−0+− |
[0.373; 0.385) | 0+−−++ +−0−0+ |
[0.482; 0.487) |
−−+−0+ +−−0+− |
[0.265; 0.289) | 0+−−++ +−−0+− |
[0.385; 0.388) | 0+−−++ +−0−−+ |
[0.487; 0.502) |
−0+−0+ +−−0+− |
[0.289; 0.316) | 0+−−++ +−−−+− |
[0.388; 0.391) | 0+−−+0 +−0−−+ |
[0.502; 0.515) |
00+−0+ +−−0+− |
[0.316; 0.326) | 0+−−++ +−−−+0 |
[0.391; 0.424) | ++−−+0 +−0−−+ |
[0.515; 0.542) |
000−0+ +−−0+− |
[0.326; 0.331) | 0+−−++ +−−−00 |
[0.424; 0.434) | ++−−+0 +−+−−+ |
[0.542; 0.545) |
++−−+− +−+−−+ |
[0.545; 0.800) |
There are 67 different signatures observed for really occurred p_{i}^{(}^{k}^{)}-values for 143 genomes considered in this work (see our web-site with supplementary materials). Most of them differ from the signature in Table 1 with corresponding G+C value only by changing one of the "+" or "−" for "0" or vice versa.
From Table 1 one can see that the only conserved positions, independent on genomic G+C-content for the interval [0.2; 0.8] are p_{T}^{(1)} (always "−"), p_{G}^{(1)} (always "+"), p_{G}^{(2)} (always "−"). This holds true also for all really observed signatures. This observation confirms already known "invariants" of codon usage described in [31, 32, 33].
Let us look at several typical examples (see Fig. 6). First let us make clear the abbreviations. Triangles F0-F1-F2 and B0-B1-B2 denote centers of point subsets calculated as described in the Methods section. They can be approximated from known codon usage f_{ijk} by the following expressions: F0 f_{ijk}, F1 P^{(1)}f_{ijk}, F2 P^{(2)}f_{ijk}, B0 C^{(}^{R}^{)}f_{ijk}, B1 P^{(1)}C^{(}^{R}^{)}f_{ijk}, B2 P^{(2)}C^{(}^{R}^{)}f_{ijk}.
All genomes with genomic G+C-content higher than 60% have the following genomic signature:
(21) |
This signature uniformly reflects the previously mentioned GC-signature ("0 − +"): pairs p_{A}^{(1)}, p_{T}^{(1)} and p_{C}^{(1)}, p_{G}^{(1)} compensate the signs of each other to give "0" in the first position of GC-signature, while in the second position we have "+" for A and T and "−" for G and C, and vice versa for the third position. As a result, we obtain the flower-like structure. In Fig. 6 we visualize the orbit for Streptomyces coelicolor, with a genome of high G+C-content (72%). Together with the orbit we visualize the distance matrix for the skeleton, where the distances are calculated in the full 64-dimensional triplet frequency space T. Black color on the plot corresponds to zero distance, white for the biggest value in the matrix. The most informative 3 x 3 block of the matrix is in the left bottom corner (or top right, by symmetry): it describes mutual distances between the vertexes of two skeleton triangles. The left top and right bottom 3 x 3 blocks contain equal values, since the sides of the triangles have the same length.
Our second example is the genome of Fusobacterium nucleatum (AT-rich genome, G+C-content is 27%), Fig. 6b. The signature is
(22) |
This pattern, commonly observed in AT-rich genomes, can be called "parallel triangles". Notice that two parallel triangles are rotated with respect to their corresponding phase labels: the F0 vertex is located in front of the B1 vertex.
The third example is genome of Bacillus halodurans (G+C-content is 44%):
(23) |
We refer to this pattern as "perpendicular triangles". Another example of the pattern is the genome of Bacillus subtilis. All non-diagonal distances in the distance matrix have in this case approximately the same value. This structure can be easily recognized from its signature: the second row has three zeros while the first one is almost palindromic. As we will see in the next example, palindromic rows in the signature (or such that can be made palindromic applying the phaseshift operator) make zero contribution to the diagonal of the "inter-triangle" part of the distance matrix. This is easy to understand, because the reverse operator reads the signature from right to the left. The rows with three zeros in different phase positions (when, for example, the phase specific nucleotide frequencies for one letter are equal to their average, as happened in this case) give approximately equal contribution to every value in the "inter-triangle" part of the distance matrix. The resulting matrix corresponds to the "perpendicular triangles" pattern. We should notice that the distance matrix showed on Fig. 6c can not be effectively represented as a distribution of 6 points in 3D. Thus the "perpendicular triangles" structure shown on Fig. 6c is only an approximate picture, the real configuration is almost 6-dimensional, due to the distance matrix symmetry.
In the region of G+C-content about 51% we observe a group of genomes with almost palindromic signatures. One typical example is the genome of Escherichia coli:
(24) |
The resulting pattern is a degenerated case: two triangles are co-located, without phase label rotation (F0 is approximately in the same point as B0). The distance matrix consists of 4 almost identical blocks. As a result, we have a situation when the 7-cluster structure effectively consists of only 4 clusters, one for every pair F0-B0, F1-B1, F2-B2 and a non-coding cluster.
The same degenerated case but with rotation of labels (F0-B1, F1-B2, F2-B0) is observed for some AT-rich genomes. For example, for the genome of Wigglesworthia brevipalpis (G+C-content equals 22%) the signature
(25) |
becomes a perfect palindrom after applying the phaseshift operator:
(26) |
One possible biological consequence (and even explanation) of this degeneracy is the existence of overlapping genes: in this case the same codons can be used to encode proteins simultaneously in the forward and backward strands on a regular basis (without frameshift for G+C-content around 50% and with a frameshift for AT-rich genomes), with the same codon usage.
The four patterns are typical for triplet distributions of bacterial genomes observed in nature. The other ones combine features from these four "pure" types. In general, going along the G+C-content scale, we meet first "parallel triangles" which will transform gradually to "perpendicular triangles". This way one can even meet structures resembling flower-like type in one of the 2D-projections, like for the genome of Helicobacter pylori (see our web-site and in [9] for the illustration). Then the pattern goes to the degenerated case with genomic G+C-content around 50% and signatures close to palindromic. After the degeneracy disappears, the pairs F0-B0, F1-B1, F2-B2 diverge in the same 2D-plane and after 55% threshold in G+C-content we almost exclusively have the flower-like structures. It is possible to browse the animated scatters of 7-cluster structures observed for each of the 143 genomes on our web-site.
We want to underline that the possibilities shown on Fig. 6 are not all possible configurations of the 7-cluster structure. For example, there are 3 possible degenerated orientations but only two of them happen in observed genomic sequences.
To make the images and graphs of the 7-cluster structures of 143 genomes available for a wide public, we established a web-site for cluster structures in genomic word frequency distributions.
For the moment our database contains 143 completely sequenced bacterial genomes and two types of cluster structures: the 7-cluster structure and the gene codon usage cluster structure. When browsing the database, a user can look at animated 3D-representations of these multidimensional cluster structures. For the description of the structures and the methods we refer the reader to the "intro" and "methods" pages of the web-site.
Another possibility which is provided on our web-site is browsing large-scale "maps" of various spaces where all 143 genomes can be embedded simultaneously. One example is the codon usage map: one point on the map is a genome, and close points correspond to the genomes with close codon usage. In fact, this is the same 64-dimensional triplet frequency space, used for construction of the 7-cluster structure, "observed" from a big distance. This gives the following hierarchy of maps: general map of codon usage in 143 genomes, then the 7-cluster structure of in-phase triplet distributions, then the "thin structure" of every coding cluster, i. e. the gene codon usage map. Clicking on a genome at the first map, the user "zooms" into its more detailed representations.
We strongly believe that the information in the database will help to advance existing tools for the analysis of bacterial genomes. Also it can serve as rich illustrative material for those who study sequence bioinformatics.
In this paper we prove the universal 7-cluster structure in triplet distributions of bacterial genomes. Some hints on this structure appeared long time ago, but only recently it was explicitly demonstrated and studied.
The 7-cluster structure is the main source of sequence heterogeneity (non-randomness) in the genomes of bacterial genomes. In this sense, our 7 clusters are the basic model of bacterial genome sequence. We demonstrated that there are four basic "pure" types of this model observed in nature: "parallel triangles", "perpendicular triangles", degenerated case and the flower-like type (see Fig. 6). These four types are not all possible symmetric configurations: there are possible symmetric configurations not found in the known bacterial genomes. The existence of 4 basic symmetry types in the 7-cluster structure is determined by dependencies of codon position-specific nucleotide frequencies on genomic G+C-content.
To explain the properties and types of the structure, which occur in natural bacterial genomic sequences, we studied 143 bacterial genomes available in Genbank in August, 2004. We showed that the codon usage of the genomes can be very closely approximated by a multi-linear function of their genomic G+C-content (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea). In the 64-dimensional space of all possible triplet distributions the bacterial codon distributions are close to two curves. When moving along these curves we meet all naturally occurring 7-cluster structures in the following order: "parallel triangles" for the AT-rich genomes (G+C-content is around 25%), then "perpendicular triangles" for G+C-content is around 35%, switching gradually to the degenerated case in the regions of GC = 50% and, finally, the degeneracy is resolved in one plane leading to the flower-like symmetric pattern (starting from GC = 60%). All these events can be illustrated using the material from the web-site we established.
The properties of the 7-cluster structure have natural interpretations in the language of Hidden Markov Models. Locations of clusters in a multdimensional space correspond to in-state transition probabilities, the way how clusters touch each other reflects inter-state transition probabilities. Our clustering approach is independent of the Hidden Markov Modeling, though can serve as a source of information to adjust the learning parameters.
In this paper we have analyzed only triplet distributions. It is easy to generalize our approach for longer (or shorter) words. In-phase hexamers, for example, are characterized by the same 7-cluster structure. However, our experience shows that most of the information is contained in triplets: the correlations in the order of codons are small and formula (8) works reasonably well. Other papers confirm this observation (see, for example, [1, 7]).
The subject of the paper has a lot of possible continuations. There are several basic questions: how can one explain the one-dimensional model of codon usage or why have the signatures palindromic structures in the middle of G+C-content scale? There are questions about how our model is connected with codon bias in translationally biased genomes: the corresponding cluster structure is the second hierarchical level or the "thin structure" in every cluster of the 7-cluster structure (see, for example, [34]). Also the following question is important: is it possible to detect and use the universal 7-cluster structure for higher eukaryotic genomes, where this structure also exists (see [9]), but is hidden by the huge non-coding cluster?
The information about the 7-cluster structure can be readily introduced into existing or new software for gene-prediction, sequence alignment and genome classification.