ISB Home

- Article -

Volume 5

Full article

In Silico Biology 5, 0025 (2005); ©2005, Bioinformation Systems e.V.  

Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences

Alexander N. Gorban1,2, Tatyana G. Popova1 and Andrei Y. Zinovyev3,4,*

1 Institute of Computational Modeling, Russian Academy of Science, Russia
2 Centre for Mathematical Modelling, University of Leicester, UK
3 Institut des Hautes Etudes Scientifiques, Bures-sur-Yvette, France
4 Service Bioinformatique, Institut Curie, Paris, France

* Corresponding author; Email:

Edited by E. Wingender; received November 12, 2004; revised January 25, 2005; accepted January 30, 2005; published March 02, 2005


Coding information is the main source of heterogeneity (non-randomness) in the sequences of microbial genomes. The heterogeneity corresponds to a cluster structure in triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in microbial genomic sequences and explained its properties. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy. Based on the analysis of 143 completely sequenced bacterial genomes available in Genbank in August 2004, we show that there are four "pure" types of the 7-cluster structure observed. All 143 cluster animated 3D-scatters are collected in a database which is made available on our web-site ( The findings can be readily introduced into software for gene prediction, sequence alignment or microbial genomes classification.

Keywords: word frequency, codon usage, clustering, visualization, symmetry