ISB Home

- Article -

Volume 3

Full article

In Silico Biology 3, 0039 (2003); ©2003, Bioinformation Systems e.V.  

Seven clusters in genomic triplet distributions

Alexander N. Gorban1,2, Andrei Y. Zinovyev3, * and Tatyana G. Popova1

1Institute of Computational Modeling, Russian Academy of Science;
2Institute of Polymer Physics, ETH, Switzerland;
3Institut des Hautes Etudes Scientifiques, Bures-sur-Yvette, France

*  corresponding author

Edited by E. Wingender; received June 03, 2003; revision received September 26, 2003; accepted September 28, 2003; published November 19, 2003


In several recent papers new gene-detection algorithms were proposed for detecting protein-coding regions without requiring a learning dataset of already known genes. The fact that unsupervised gene-detection is possible is closely connected to the existence of a cluster structure in oligomer frequency distributions. In this paper we study the cluster structure of several genomes in the space of their triplet frequencies, using a pure data exploration strategy. Several complete genomic sequences were analyzed, using the visualization of tables of triplet frequencies in a sliding window. The distribution of 64-dimensional vectors of triplet frequencies displays a well-detectable cluster structure. The structure was found to consist of seven clusters, corresponding to protein-coding information in three possible phases in one of the two complementary strands and in the non-coding regions with high accuracy (higher than 90% on nucleotide level). Visualizing and understanding the structure allows to analyze effectively the performance of different gene-prediction tools. Since the method does not require extraction of ORFs, it can be applied even for unassembled genomes.

Key words: visualization, gene recognition, unsupervised learning, codon usage