In Silico Biology 3, 0039 (2003); ©2003, Bioinformation Systems e.V.  


Seven clusters in genomic triplet distributions

Alexander N. Gorban1,2, Andrei Y. Zinovyev3, * and Tatyana G. Popova1




1Institute of Computational Modeling, Russian Academy of Science
2Institute of Polymer Physics, ETH, Switzerland
3Institut des Hautes Etudes Scientifiques, Bures-sur-Yvette, France

* corresponding author
  Email: zinovyev@ihes.fr





Edited by E. Wingender; received June 03, 2003; revision received September 26, 2003; accepted September 28, 2003; published November 19, 2003



Abstract

In several recent papers new gene-detection algorithms were proposed for detecting protein-coding regions without requiring a learning dataset of already known genes. The fact that unsupervised gene-detection is possible is closely connected to the existence of a cluster structure in oligomer frequency distributions. In this paper we study the cluster structure of several genomes in the space of their triplet frequencies, using a pure data exploration strategy. Several complete genomic sequences were analyzed, using the visualization of tables of triplet frequencies in a sliding window. The distribution of 64-dimensional vectors of triplet frequencies displays a well-detectable cluster structure. The structure was found to consist of seven clusters, corresponding to protein-coding information in three possible phases in one of the two complementary strands and in the non-coding regions with high accuracy (higher than 90% on nucleotide level). Visualizing and understanding the structure allows to analyze effectively the performance of different gene-prediction tools. Since the method does not require extraction of ORFs, it can be applied even for unassembled genomes.

Key words: visualization, gene recognition, unsupervised learning, codon usage



Introduction

With few exceptions, almost all commonly used gene-finding programs employ a learning dataset for tuning parameters of a learning rule (see, for example, reviews of gene recognition methods in Burge and Karlin, 1998, and Fickett, 1996). In several recent papers new algorithms were proposed for detecting coding regions without requiring a learning dataset of already known genes. In Bernaola-Galvan et al., 2000, the authors proposed a method developed for unsupervised segmentation of whole DNA texts corresponding to the division into coding and non-coding regions. In Audic and Claverie, 1998, the authors proposed a clustering procedure which uses all available annotated genomic data for its calibration. This iterative procedure uses genomic sequences to adjust parameters (that are initialized by randomly partitioning a number of small subsequences) of several probabilistic sequence models. The algorithm converges fast and gives an accuracy of up to 90%. In Baldi, 2000, it was explained that this algorithm is essentially a form of the expectation maximization algorithm applied to the corresponding probabilistic mixture model.

The fact that unsupervised gene detection is possible and effective (and, to lesser extent, supervised learning as well) is closely connected to the existence of a cluster structure in oligomer distributions. Implicitly the existence of this structure is known (see, for example, Borodovsky and McIninch, 1993) and is widely used in gene recognition, but was never visualized and studied in terms of data exploration strategies. Visual representations of the structure allows deeper understanding of its properties and can serve as useful display for analyzing the characteristics of existing gene finders.

Previously, we tested several unsupervised approaches for gene recognition [Gorban et al., 2001]. In this paper we use a method of data visualization to explore the space of frequencies of triplets in a sliding window. We demonstrate and analyze the structure of datasets used both for supervised and unsupervised learning.

In this work we do not have the purpose to invent principally new gene prediction tools, but rather provide illustrations of the datasets that are utilized by existing prokaryotic gene finders. Just as one quantative illustration, we propose a simple clustering method for detecting coding regions and assess it, using standard measures. Its performance turns out to be similar to that of the methods mentioned above, as well as of modern prokaryotic gene finders. In addition, even this simplified method, which does not take into account many important details, can have practical value when application of ORF extraction strategies is not possible (in the case of unassembled genomes, for example, when only chunks of 200-500 bp length are available).




Methods

Let us denote a codon frequency distribution as fijk, where i,j,k are in the {A,C,G,T} set, i. e., for example, fACG is equal to the relative frequency of the ACG codon in a given coding region. One can introduce such natural operations over frequency distribution as phase shifts P(1) , P(2) and complementary reversion CR:

where is complementary to i, i. e., , etc.

The phase-shift operator P(n) calculates a new triplet distribution, counted with a frame-shift on n positions, in the hypothesis that no correlations exist in the codon order. Complementary reversion constructs the distribution of triplets in the complementary strand, counted in the direct strand ("shadow" codon usage).
Let us introduce the distance between two distributions as

It is then natural to expect that the problem of gene recognition can be solved if one of the numbers, or , is big enough. It follows from that remark that after a large number of insertion and deletion operations of one base-pair at a time, we would have

This is expected to happen in non-coding regions, where frameshifts do not necessarily lead to misreading of genetic material, and may happen due to mutations.

Let us introduce a measure of how far fijk is from the shifted distributions:

Real (counted directly from genetic texts) triplet frequency distributions in the first and the second phases will be denoted as . Let us introduce the term "codon correlation contribution measure" as the average distance between real and calculated distributions:

We constructed datasets of triplet frequencies for several real genomes and for several model genetic sequences, as follows:

 

  1. Only direct strands of genomes are used for counting triplets;

  2. Every p positions in a sequence, we open a window (x-W/2,x+W /2) of size W and centered at position x; here p>>1 is used to reduce the sample size of the resulting dataset, otherwise one has to deal with a sample size of millions of points, while introducing p the sample size s [L/p] points, where L is the entire length of the sequence;

  3. Every window, starting from the first base pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets fijk are calculated;

  4. The dataset consists of N = [L/p] points; every data point Xi = {xis} corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1, . . ., 64. Resulting datasets were used both for visualization and clustering.

The standard centering and normalization on the unit dispersion procedure is then applied, i. e., , where is the value of the sth coordinate of the ith point after normalization, and is the mean value of the sth coordinate, 5 is the standard deviation of the sth coordinate.

Then we applied a principal components algorithm in order to visualize these 64-dimension datasets projected onto the 3-dimensional linear manifold spanned by the first three principal vectors of the distribution. It is known that a projection onto this manifold is only as informative as high the value of v(3) = D(3)/D is, where D is the dispersion of the dataset, calculated in 64-dimensional data-space and D(3) is the analogous quantity calculated after projecting the vectors into the 3-dimensional space. In practice, even if the value of v(3) is not high enough (say, it equals 0.1-0.3), we may still try to visualize the dataset, in the hope of being able to pick up qualitative "signals" of the presence of patterns in the data distribution, as well as to visually represent the dataset.



Results

Figure 1 presents several distributions calculated for 4 genetic texts.



Figure 1: Visualization of genetic sequences in the space of triplet frequencies.


In addition to the distribution itself, we introduced two triangles, formed by the points and into the figures. The large spheres correspond to the points fijk and , where fijk was calculated from the genome's known annotation. Data points have different shapes and colors, according to whether they are coding or non-coding in one of the two strands.

The explanation of the structures is quite evident: Coding information from windows in the direct strand has one of three possible phase shifts. Since this phase shift is not known in advance, approximately one-third of the windows fall into the vicinity of the point that corresponds to the fijk (0-shift), one-third are close to the (1-shift), and the last third are close to the (2-shift). This is also true for the reverse strand, but with the centers corresponding to complementary distributions.

The plots shown in Figure 1 are two-dimensional projections of 3D scatters. The 3D scatters are shown as rotating gif-animations in Figure 2. The first 2 distributions show very clear separation on seven clusters; no surprise, that in these cases unsupervised gene-prediction gives both high specificity and sensitivity. The distribution of triples in S. cerevisiae, chromosome IV forms seven clusters as well; though they are not clearly seen on 2D-pictures, because two "phase triangles", projected into the principal subspace are positioned on two parallel planes, perpendicular to the direction of the third principal component. The situation is worse in case of   P. wickerhhamii mitochondrion genome. In this case distributions of triplets in the direct and reverse strands indeed overlap. Figure 1 shows that this is not simply because the triplet distributions in direct and reverse strands are similar, but 0-phase of the distribution of triplets in the direct strand overlaps with 1-phase of the distribution in the reverse strand, and so on. One can predict in this case that gene-recognition procedures will often mix genes in the direct and reverse strands, though ORF-strategies can probably resolve this conflict.



Figure 2: Rotating animations of the distributions in 3D.


One can see from the pictures that the centers of phase-shifted distributions are close enough to the calculated points, demonstrating the absence of significant correlations in the order of codons. Indeed, the calculated values of CC are not high (see Table 1, CC column.) It means that in real texts correlation between subsequent codons is much less than the "inter-phase" difference.


Table 1: Summary table of results for assessing the method on the nucleotide level.
Sequence L W p v(3) % of coding bases CP CC Sn1 Sp1 Sn2 Sp2
Helicobacter pylori 1643831 300 120 0.35 90 0.68 0.28 0.93 0.97 0.93 0.98
Caulobacter crescentus 4016947 300 300 0.21 91 1.07 0.16 0.93 0.97 0.94 0.98
Prototheca wickerhamii 55328 120 18 0.17 49 0.83 0.11 0.82 0.93 0.84 0.95
Saccharomyces cerevisiae chromosome III 316613 399 99 0.16 69 0.45 0.10 0.90 0.88 0.90 0.90
Saccharomyces cerevisiae chromosome IV 1531929 399 120 0.15 73 0.48 0.09 0.89 0.91 0.92 0.92



Clusterization

Having in mind the visual representation of the distribution of data-points, it is possible to propose a natural way of segmenting sequences into regions that are homogeneous with respect to the coding phase. One would expect that regions with the same coding phase correspond to protein-coding regions.

We must underline, that in this work we do not have the purpose to invent a principally new gene-prediction tool, but rather provide illustrations on the datasets that are utilized by existing prokaryotic gene finders. The gene finding method we describe below is just an illustration to the "seven-clusters" structure of the triplet distributions, and it is intentionally made as simple as it can be. Nevertheless, it can be used in the situations where application of ORF strategies is not possible, giving rather good performance characteristics.

Trying to be as simple as we can, we make use of one of the simplest clusterization strategies, namely unsupervised K-Means clustering. The clustering is accomplished in the 64-dimensional space and the positions of all seven clusters are identified. This is the result of the learning step of the procedure. Also one must store the coefficients which were used for normalization (on unit dispersion) of the dataset.

After that step, during classification of all sequence positions, one must assign the corresponding cluster label along the whole sequence. This can be done by opening a window of width W at every 3rd position x of the sequence (x-W/2,x+W/2) and calculating non-overlapping triplet distribution inside this window. This distribution, after normalization with the coefficients stored at the previous step, corresponds to a point in 64-dimnesional space. Then the closest cluster is determined in this space. If it is the central cluster, that point is assigned to be non-coding; otherwise it is assigned to one of three possible coding phases. These basic steps of the method are presented graphically in Figure 3.



Figure 3: Graphical representation of the method.


To evaluate the ability of this procedure to differentiate between "coding" and "non-coding" base-pairs, we used base-level sensitivity and specificity of exon recognition, the measures which are commonly used in this case:

where TP is the number of true-positives, i. e., coding bases predicted to be coding;
TN is the number of true-negatives, i. e., non-coding bases predicted to be non-coding; FP is the number of false-positives, i. e., non-coding bases predicted to be coding, and FN is the number of false-negatives, i. e., coding bases predicted to be non-coding.

The results are shown in the Sn1 and Sp1 columns of Table 1.

We must underline that the procedure is fully automated and does not require any human intervention. Neither genomic annotation nor ORF strategy is used. The procedure is as simple as 1) open sliding window every p positions in a text; 2) count frequencies of non-overlapping triplets, starting from the first base-pair. A visualization of datasets can be useful to evaluate how reliable the prediction will be (measuring compactness of the clusters, for example) and to compare the prediction with known information. The only parameter - window size - may be visually evaluated by comparing pictures of data constructed with various values of W (see the full version of the paper on the accompanying web-page.) In fact, the dependence of effectiveness on window-size is not strong over a rather long interval of W.


Using known data

In the method described above, the learning process used no other information than the sequence itself; it was completely "unsupervised". One can also try to make use of some extra knowledge, as discussed in the next paragraph.

Studying a set of training examples (for example, following the strategy of GLIMMER, using long ORFs as a training set), it is possible to calculate the centers of all seven clusters. We have done this, using the annotation of the analyzed genomes. First, half of the genes were used to calculate the centers, and the rest for accuracy testing. Using the seven centers, we calculated new values for the sensitivity and specificity of gene recognition. They are shown in the columns Sn2 and Sp2 of Table 1. Here no clusterization is made at all. We provide these numbers only to show how unsupervised learning is close to the supervised classification based on heuristics and biological intuition (using long ORFs or homology search, for example).


Our method and GLIMMER gene finder

Choosing GLIMMER [Salzberg et al., 1998; Delcher et al., 1999] for our analysis was dictated by our desire to use a gene-predictor that does not use any learning information, except of what can be extracted from the genetic sequence itself. In GLIMMER, a learning dataset is formed by extracting long ORFs (usually longer than 500 bp) and then a variant of HMM-based predictor is used. Thus GLIMMER extrapolates the model of genetic sequence derived from the longest ORFs onto the shorter ORFs, which are the genes candidates. It is known that GLIMMER has a tendency to produce a lot of false-positive predictions. That version of GLIMMER that we used (version 2.02) did not have any model for non-coding regions. It was interesting to understand if and how many of GLIIMER false-positive predictions are due to this lack.

GLIMMER gene finder uses some ORF strategy to detect potential genes. Because of this, we have to introduce simple rules for deciding if a given ORF is coding or non-coding in our "seven-clusters" methodology. For every ORF, we calculate a 64-dimensional vector of its triplet frequencies and find the closest centroid in the space of triplet frequencies (the positions of the centroids are calculated as it was described earlier). If the closest centroid is the one which corresponds to the correct coding phase (let us denote it by P0), then this ORF is assigned to be coding. After this, from all such ORFs in P0 phase we filter out all ORFs that are too distant from the P0-centroid (the threshold is determined by an additional parameter), and all ORFs which are inside other ORFs in the P0 phase (it means that we take the longest ORF in the P0 phase).

To test this procedure, we analyzed the output of GLIMMER gene finder (using default settings), using the list of ORFs, produced by GLIMMER. Thus, we compare only the effectiveness of the measures used and not the details of the ORF extraction strategy.

In Table 2 we show the results of this comparison, using existing annotations of the genomes in GenBank. One must understand that the annotations are far from being perfect and some part of the ORFs that we denoted as false positives in the GLIMMER prediction can be unknown putative genes (as it is claimed by the authors of GLIMMER). Nevertheless, we find a significantly lower false-positive rate of our method comparing to the GLIMMER prediction. Analyzing this, we found in some genomes that a cluster structure exists in the distribution of false-positive GLIMMER predictions. In Figure 4 the visualization of GLIMMER predictions on the principal components plane is shown for Escherichia coli and Caulobacter crescentus for which GLIMMER produces many predictions of "additional genes". For example, our analysis shows that 62% of false-positives predictions for Escherichia coli and 80% of false positives for Caulobacter crescentus in the 64-dimensional space of triplet frequencies are closer to the centroid, which corresponds to the CR fijk distribution (C0-centroid), while only 2% of true-positive predictions for Caulobacter crescentus are close to the C0-centroid. Such discrepancy cannot be explained simply by the "presence of unknown genes" but is due to some effect of this HMM-based predictor, which often takes "shadow" genes as positive predictions.


Table 2: Comparing the method with GLIMMER gene-predictor
Sequence CLUSTER GLIMMER
Sn Sp Sn Sp
Helicobacter pylori 0.94 0.95 0.96 0.78
Haemophilus influenza 0.93 0.88 0.96 0.84
Escherichia coli 0.91 0.87 0.96 0.76
Bacillus subtilis 0.89 0.95 0.97 0.79
Caulobacter crescentus 0.89 0.76 0.94 0.60



Figure 4: Visualization of the distribution of predictions of GLIMMER gene finder in 64-dimensional space of triplet frequencies.


As one can see from Table 2, the sensitivity of our method is lower in all cases, compared to the GLIMMER gene finder, having a significatly better specificity. Using the annotation of E. Coli, we found that out of 228 genes predicted by GLIMMER, and not predicted by our method, 121 are annotated as predicted only by computational methods, 11 ribosomal genes and 12 transposases. Of 24 genes predicted by our method and not predicted by GLIMMER, 17 are annotated as predicted only by computational methods and 3 as ribosomal genes. This is not surprising; it is known that ribosomal genes, some other highly expressed genes as well as horizontally transferred genes (the percentage of which is estimated as 10% from the overall number, [Médigue et al., 1991]) can have rather different (with respect to the average) codon usage, for example, strongly translationally biased codon usage in the case of the ribosomes. It is known also that preliminary clusterization of genes can enhance existing gene finding procedures [Mathé et al., 1999; 2000].


Window-size dependence

Figure 5 presents our study of window-size dependence of the algorithm effectiveness for two genomes. One can see that the minimal window length, which can be used for the algorithm, is about 100 bp. This value is often characterized as a barrier for all gene-prediction methods based on the analysis of compositional differences. Then, the sensitivity of the algorithm drops monotonically, and, with a window size larger than 400-500 bp, becomes poor.



Figure 5: Window-size dependence of the algorithm.



Implementation

All datasets were prepared from sequences in the GenBank flat-file format. The programs used for data analysis, including simple implementation of the K-means clusterization algorithm, were written in Java and are available with instructions at the accompanying web page: http://www.ihes.fr/~zinovyev/bullet/. These programs actively use the BioJava programming package. Technically, the data visualization and all illustrations were produced using the ViDaExpert data visualization tool under Windows and are available at the supplementary web page.




Discussion

The seven clusters structure of oligomer distributions in genetic texts plays an important role in the ability of modern gene finders for unsupervised (and, to lesser extent, also for supervised) learning in prokaryotic genomes. Actually existence of the structure makes the prokaryotic gene finding so efficient. Using seven hidden states for hidden Markov modeling approach in gene-prediction was introduced long ago (see, for example, Borodovsky and McIninch, 1993). Though being widely exploited, this structure was never visually presented and analyzed by a pure data exploratory study .

Our study shows relatively high performance of using only short n-mers like triplets. It means that an essential part of the information needed to discriminate between coding and non-coding regions is already contained in their triplet distributions. Using hexamer frequencies (that is common practice in modern gene finders) can be more sensitive, but also can lead to certain undesirable effects. One needs more sequence information to evaluate hexamer frequencies, and, as a result, this fact can lead to "overfitting" effects, leading to a lower specificity. We demonstrated this fact, using visual analysis of positive predictions of GLIMMER gene finder.

We demonstrated as well that the correlations in the order of codons are small with respect to the interphase and "coding-noncoding" distance. This fact is interesting by itself, and is not completely trivial. In particular, it implies that in seven-states HMMs which are commonly used for gene prediction, the weights of different (coding) states are not at all independent: their dependence in the case of HMMs of order 2, in zero approximation is given by the formulas in the beginning of the "Methods" section. Another somehow unexpected observation is that the sizes of clusters in the phase triangle are similar. It would be natural to expect the cluster which corresponds to the "correct" P0 (or C0) phase to be more compact than P1 (C1) and P2 (C2) clusters, but this is not the case.

From the constructed representations of datasets it is clear that the spatial structure of triplet distributions is almost completely determined by two factors: 1) the frequency distribution of the 64 codons in the coding phase; 2) the dispersion of the codon frequency distribution. The latter one is related to the structure of codon usage over all genes in a genome, which is known to be inhomogeneous (see, for example, Médigue et al., 1991), especially in such fast-growing organisms as E. coli and B. subtilis, where the translational bias shapes the codon usage differently for different groups of genes. Nevertheless, the dispersion is smaller with respect to the phase-phase and coding-noncoding distances, which makes the gene-prediction possible even without preliminary gene classification.

From the figures it is evident that the distribution structure renders linear discrimination analysis (sometimes applied in this situation) inapplicable. Applying linear methods in this case would lead to the incorrect conclusion that the dataset is not well-separable and that this measure is less effective than others with respect to linear discrimination function. For example, in the case of Helicobacter pylori, linear discrimination yields a specificity of ~0.83 (which means many false positives), while the method we proposed yields ~0.97. This fact stresses that understanding the spatial structure of a learning dataset is necessary for reasonable applications of pattern recognition methods.

Frequency normalization plays an important role in cluster structure formation. It indicates the importance in distinguishing coding and non-coding regions of those codons that may not have high frequency values but considerably change their frequencies after phase-shift (codons that are "prohibited,", due to codon bias.)

Basically, distribution of non-overlapping triplets that is efficient for gene recognition corresponds to a high value of mutual information in three consecutive letters, i. e.,

where Pik is the average frequency of letter i {A, C, G, T} at the k-th place in a triplet. This value may be zero only in the case . In this case, we would have , i. e., the phase-shift does not change the triplet distribution. High values of M guarantee the presence of a "three-phase triangle" in the data space, as well as formation of the cluster structure.

In this paper, using visual analysis of spatial dataset structure and very simple clustering technique, we have shown that using a learning dataset is not necessary in order to accurately solve gene recognition tasks, at least in DNA segments with high concentrations of coding information (compact genomes). This property is very useful, since the problem of choosing a "good" learning dataset is not very well defined (see, for example, Mathé et al., 2002).

The method proposed can be applied for the rough annotation of unassembled genomes, since it does not require preliminary extraction of ORFs. This makes it useful for inexpensive genome survey projects. Also it allows efficient analysis of performance of existing prokaryotic gene finders. One more (and not the least) advantage of the visual representation of oligomer distributions is that it facilitates understanding of the subject by those who just enter this field.



Acknowledgements

The authors thank Alessandra Carbone for very fruitful discussions, Misha Gromov for the interest he expressed in this work, Noah Hardy and Arndt Benecke for their help with preparing the manuscript.



References