| In Silico Biology 2, 0008 (2002); ©2002, Bioinformation Systems e.V. |
| Dagstuhl Seminar "Functional Genomics" |
*Corresponding author
Laboratory of
Computational Genomics,
CIFN, UNAM, A.P. 565-A,
Cuernavaca, Morelos 62100,
Mexico
Email: moreno@cifn.unam.mx
Edited by E. Wingender; received October 26, 2001; revised and accepted December 18, 2001; published January 28, 2002
We have previously demonstrated that genes within experimentally characterized operons of Escherichia coli are conserved together in other genomes more frequently than genes at the borders of transcription units. Here we expand the analyses and show that, as the phylogenetic distance of the genomes compared increases, the genes remaining together must belong to genes associated into operons in other prokaryotes regardless of the operon organization of the corresponding orthologous gene pair of E. coli. At the same time, we show that the observed tendencies of genes within operons to keep very short inter-genic distances in E. coli, is the same in any other prokaryote whose genome is currently available. We also show the relationship between our analyses of conservation and the inference of functional relationships from genomic context.
Keywords: genome context, functional relationship, operon conservation,
intergenic distance
Genome rearrangements have been shown to occur so fast that gene order is not preserved [1], and that it is broken much faster than amino-acid identities [2]. Thus, it was very early noticed that conservation of neighborhood has biological meanings. For instance, conservation of gene order was shown to occur among genes whose protein products either interact [3], or have functional relationships [4] (e.g. the proteins are part of a hetero-multimeric enzyme, or are enzymes involved in consecutive steps in a metabolic pathway). Such works were based on the well-known tendency of operons, contiguous adjacent genes that are transcribed into a single messenger RNA, to be formed from the association of genes with related functions. Recently, it was indeed demonstrated that conservation of gene order is strongly correlated to association of genes into operons [5, 6].
Here, we expand our previous analyses contrasting the conservation of genes
found within experimentally characterized operons from Escherichia coli,
against that of adjacent genes found in the same DNA strand at the borders of
transcription units (TUs). We show that conserved pairs in other prokaryotes
contain mixed populations of genes within operons and of genes at TU
boundaries, but that the populations purify to contain mostly genes within
operons as the phylogenetic distances of the genomes compared become greater.
At the same time, we show that the previously observed tendency of genes within
operons to be kept at very short distances in E. coli is the same in any
prokaryote. We also show that the quantitative evaluation of different
parameters, proposed to infer functional relationships from genomic context
[7], can be performed using databases derived from the knowledge dispersed in
the literature. We used data from RegulonDB,
a database of TU organization and regulation in Escherichia coli [8].
Data sets of pairs of adjacent genes found in the same operons (WO pairs), and of pairs of adjacent genes at transcription unit boundaries (TUB pairs, being the last gene in a TU and the first one in the next), were prepared as previously described [9]. We derived 637 WO pairs from the 296 operons reported in the database of RegulonDB, and 428 TUB pairs from the comparison of the 495 TUs in RegulonDB with the 1298 directons (stretches of adjacent genes in the same DNA strand with no intervening gene in the opposite strand) of E. coli [9].
To find orthologs of E. coli genes in other organisms, we ran gapped BLASTP [10] comparisons of all the protein sequences corresponding to all the open reading frames (ORFs) of E. coli, against every protein sequence corresponding to the ORFs of all other genomes obtained from GenBank [11] (see Table 1). We used an expectation value cutoff of 0.001. We kept only those results where the alignment covered at least 50% of one of the sequences. Our putative orthologs were those genes whose protein products were overall bi-directional best hits to the E. coli query proteins. Fusions were those other genes that had the same best hit as a bi-directional best hit, but covered a different part of the sequence in the alignment. We used the annotations of each genome to find the locations of all genes and thus be able to find genes conserved adjacent to each other in the same strand, and their inter-genic distances.
As of the writing of this report, there are 57 prokaryotic genomes in GenBank. Since this
number increases rapidly, we had to automate a filter to avoid repetitive
genomes, leaving for instance a single E. coli genome (K12) as
representative of all E. coli strains, and so on. The method consists on
calculating the average blast score of the bi-directional best hits, of a given
genome versus itself (e.g. E. coli K12 versus E.
coli K12) (self-average score), and the average blast score of all the
bi-directional best hits of such genome versus every other genome (e.g.
E. coli K12 versus Bacillus subtilis) (comparison-average score). We
eliminated any genome if the rate of comparison-average score and self-average
score exceeded 0.80, making the comparisons and eliminations in alphabetical
order. This method functioned well keeping always a single representative of
each species in the database, leaving a total of 45 representative genomes
(Table 1). We reduced the data sets of WO and TUB pairs to include only those
pairs whose genes had both orthologs in at least one genome in the
non-redundant genome collection (532 WO pairs and 307 TUB pairs).
Table 1: Genomes available at GenBank as of the writing of this study.
| Species | GenBank Accession |
|---|---|
| Crenarchaeota | |
| Aeropyrum pernix | NC_000854 |
| Sulfolobus solfataricus | AE006641 |
| Sulfolobus tokodaii | BA000023 |
| Euryarchaeota | |
| Archaeoglobus fulgidus | NC_000917 |
| Halobacterium sp. NRC-1 | AE004437 |
| Methanococcus jannaschii | L77117 |
| Methanobacterium thermoautotrophicum | AE000666 |
| Pyrococcus abyssi | NC_000868 |
| Pyrococcus horikoshii* | NC_000961 |
| Thermoplasma acidophilum | AL139299 |
| Thermoplasma volcanium | NC_002689 |
| Aquificales | |
| Aquifex aeolicus | AE000657 |
| Chlamydiales | |
| Chlamydia muridarum | AE002160 |
| Chlamydophila pneumoniae AR39 | AE002161 |
| Chlamydophila pneumoniae CWL029* | AE001363 |
| Chlamydophila pneumoniae J138* | BA000008 |
| Cyanobacteria | |
| Synechocystis PCC6803 | AB001339 |
| Firmicutes | |
| Bacillus halodurans | BA000004 |
| Bacillus subtilis | AL009126 |
| Clostridium acetobutylicum | AE001437 |
| Lactococcus lactis subsp. lactis | AE005176 |
| Mycobacterium leprae | AL450380 |
| Mycobacterium tuberculosis CDC1551* | AE000516 |
| Mycobacterium tuberculosis* | AL123456 |
| Mycoplasma genitalium | NC_000908 |
| Mycoplasma pneumoniae | NC_000912 |
| Mycoplasma pulmonis | AL445566 |
| Staphylococcus aureus subsp. aureus Mu50 | BA000017 |
| Staphylococcus aureus N315* | NC_002745 |
| Streptococcus pneumoniae R6 | AE007317 |
| Streptococcus pneumoniae TIGR4* | AE005672 |
| Streptococcus pyogenes | AE004092 |
| Ureaplasma urealyticum | NC_002162 |
| Proteobacteria | |
| Agrobacterium tumefaciens (Chromosome 1) | AE007869 |
| Agrobacterium tumefaciens (Chromosome 2) | AE007870 |
| Buchnera sp. APS | AP000398 |
| Caulobacter crescentus | AE005673 |
| Campylobacter jejuni | AL111168 |
| Escherichia coli K12 | U00096 |
| Escherichia coli O157:H7* | BA000007 |
| Escherichia coli O157:H7* | AE005174 |
| Haemophilus influenzae Rd | L42023 |
| Helicobacter pylori 26695 | AE000511 |
| Helicobacter pylori J99* | AE001439 |
| Mesorhizobium loti | NC_002678 |
| Neisseria meningitidis | AE002098 |
| Neisseria meningitidis Z2491* | AL157959 |
| Pseudomonas aeruginosa | AE004091 |
| Pasteurella multocida | AE004439 |
| Rickettsia conorii | AE006914 |
| Rickettsia prowazekii* | AJ235269 |
| Sinorhizobium meliloti | AL591688 |
| Vibrio cholerae (Chromosome 1) | AE003852 |
| Vibrio cholerae (Chromosome 2) | AE003853 |
| Xylella fastidiosa | AE003849 |
| Spirochaetales | |
| Borrelia burgdorferi | AE000783 |
| Treponema pallidum | AE000520 |
| Thermotogales | |
| Thermotoga maritima | AE000512 |
| Thermus/Deinococcus | |
| Deinococcus radiodurans (Chromosome 1) | AE000513 |
| Deinococcus radiodurans (Chromosome 2) | AE001825 |
* Species eliminated to avoid over-representations.
If two proteins have inter-dependent functions, the presence of the gene
coding for one of them would be meaningless unless the gene coding for the
other protein would be present too. Thus, it has been suggested that genes
whose protein products have related functions should be evidenced by their
concerted appearance and disappearance in different genomes in what is
called a similar "phyletic pattern" [12, 13] or "phylogenetic profile"
[14]. We thus built binary phylogenetic profiles of all genes within the
genome of E. coli in the next manner: if we find an ortholog for a
given gene in another genome, we annotate a "1", otherwise we annotate a
"0". For instance, if the gene "b0002" has detectable orthologs in the
first three genomes examined (say in alphabetical order), none in the
fourth and fifth, then orthologs again within the sixth and seventh, the
corresponding philogenetic profile would be "1110011", we then counted the
number of differences between the phylogenetic profiles of each WO pair and
of each TUB pair. For instance, if the philogenetic profile of gene
"b0003" would be "1110111", the number of differences of the WO pair
"b0002/b0003" would be one. In Fig 1 we show the distribution of pairs as
a function of the number of differences in phylogenetic profiles. We
included only those pairs containing genes having orthologs in at least ten
other genomes (229 WO pairs and 100 TUB pairs). As shown, WO pairs have
more similar phylogenetic profiles than TUB pairs. This is not surprising,
but it is important to note that there is no perfect relationship, that is,
phyletic patterns of WO pairs can have differences due to several reasons,
such as limitations in the sensitivity of alignment methods that disallow a
complete detection of homologs (and thus of orthologs), the occurrence of
non-orthologous gene displacement [15], the association of some genes into
operons having inter-related yet not completely dependent functions, and
other factors. Phyletic patterns however, are sufficiently similar to allow
for a distinction of a functional relationship for genes having five or a
few more differences in their binary profiles, as long as both genes are
present in at least ten out of more than 40 non-redundant genomes.
We have shown before that WO pairs are conserved together always in higher numbers in other genomes than TUB pairs when measured as the rate of the number of conserved pairs and the number of co-occurring pairs in each genome [6]. Here we give a different perspective. In Fig 2 we display the number of pairs found together as a function of the number of genomes where they are found. As seen, TUB pairs suffer an abrupt fall in the number of pairs conserved leaving very few conserved in more than 2 genomes. The few exceptionally conserved pairs, belonging to TUB pairs in E. coli, seem to be associated in operons in other genomes, as will be shown in the inter-genic studies of the next section. The curve corresponding to WO pairs shows a much higher conservation that extends to many more genomes. Conservation of neighborhood [3, 4] and appearance of fusions [16, 17, 18] are also useful hints to detect functional relationships among genes. Fusions occur only among conserved WO pairs (not shown).
![]() |
Figure 2: Conservation of adjacency. Notice here that WO pairs are conserved adjacent in higher numbers than TUB pairs. Very few TUB pairs surpass a presence in more than 5 genomes. These TUB pairs might be in operons in other organisms, since their inter-genic distances are those typical of E. coli WO pairs (see Fig. 3). |
E. coli WO pairs display an inter-genic distance distribution (IDD) with a characteristic peak between -20 bp and 30 bp [9]. If WO pairs can be distinguished in the same way in other organisms, we would expect them to display similar distributions, as long as WO pairs in any other organism have the very same tendencies to keep such short inter-genic distances as E. coli WO pairs. In our previous analysis we found that conserved WO pairs have a similar IDD to WO pairs of E. coli, while the distribution of conserved TUB pairs showed that these are formed of mixed populations of TUB pairs and WO pairs. We had not enough data however to plot IDD of different groups. The current number of genomes is much higher now, and thus we can explore more precisely the tendencies of conserved pairs, in terms of distance distributions, and presumably of the association of genes into operons, in different groups of genomes. We plotted the IDD of genes conserved adjacent to each other among genomes, against those of WO pairs and TUB pairs of E. coli (Fig. 3). In Figs 3a and 3b, we compare the IDD of conserved WO pairs and conserved TUB pairs in Proteobacteria. Figs 3c and 3d correspond to conserved pairs in Firmicutes, and Figs 3e and 3f to conserved pairs in Archaea. Notice that the peak corresponding to conserved WO pairs is always very similar to that of WO pairs of E. coli, though the peak is shorter in Proteobacteria and Firmicutes, and slightly higher in Archaea. If we assume that the tendencies are exactly the same in any prokaryotic operon, then the explanation of this result would be as follows. Despite being conserved together more frequently than unrelated genes, operons are unstable [19], thus, a few conserved pairs, orthologs to WO pairs of E. coli, might not be in operons in other organisms (Proteobacteria and Firmicutes), and their conservation would be a product of chance. At longer phylogenetic distances (Archaea), conservation is most probably of biological significance, and thus the peak reveals a complete association of conserved pairs into operons. A similar reasoning applies to conserved TUB pairs. At short and intermediate phylogenetic distances, conserved TUB pairs reveal some pairs associated into operons, while in Archaea all of the conserved pairs display inter-genic distances typical of WO pairs of E. coli. This means that pairs conserved together in evolutionarily distant species, belonging to TUB pairs in E. coli, might be in operons in other prokaryotes. We have found further support to these assumptions from the analysis of a collection of one hundred operons of Bacillus subtilis compiled from the literature [19]. We used this collection to build a data set of WO pairs (310 pairs). We also used it to find TU boundaries by comparison with the genome sequence and annotation for this organism [20], and built a data set of TUB pairs (123 pairs). Fifty-nine out of 62 (about 95%) pairs of genes found in operons in E. coli, are also in operons in B. subtilis (the remaining three are boundary pairs in B. subtilis). Among those found at TU boundaries in E. coli, 3 out of 4 are in operons in B. subtilis, and are conserved as neighbors in at least 15 other genomes. These results also explain the exceptionally conserved TUB pairs in many prokaryotic genomes, since such pairs might actually be in operons in other organisms (see Fig 2).
The value of the databases derived from the careful examination of
experimental bench work is enormous. Based on data compiled in RegulonDB,
we show analyses useful for the quantitative evaluation of functional
inferences derived from genomic context. For instance, in the case of
phylogenetic profiles, we can see that functionally related genes (WO
pairs) can have many differences in their binary profiles, but that they
can be confidently identified even allowing up to five or a few more
differences (comparing 45 genomes). Conservation of adjacency is more
limited since it is scarce, but still, we can say that if there is
conservation of adjacency between pairs of genes in evolutionarily distant
organisms, like between Firmicutes and Archaea, such genes are most
probably functionally related and in the same operon in at least one of the
organisms compared. At the same time, we have shown that the inter-genic
distance distribution of WO pairs might be very similar in any prokaryote,
implying a universal structural consequence of the association of genes
into operons, and providing the basis to suspect that genes within operons
can be distinguished from TU boundaries by their inter-genic distances
alone in the same way as demonstrated for E. coli [9].
This work was supported by grant number 0028 from Consejo Nacional de Ciencia y Tecnología to J C-V. We acknowledge technical support from Víctor del Moral and César Bonavides.