In Silico Biology 2, 0008 (2002); ©2002, Bioinformation Systems e.V.  
Dagstuhl Seminar "Functional Genomics"

Operon conservation from the point of view of Escherichia coli, and inference of functional inter-dependence of gene products from genome context

Gabriel Moreno-Hagelsieb* and Julio Collado-Vides




*Corresponding author
Laboratory of Computational Genomics,
CIFN, UNAM, A.P. 565-A,
Cuernavaca, Morelos 62100, Mexico
Email: moreno@cifn.unam.mx





Edited by E. Wingender; received October 26, 2001; revised and accepted December 18, 2001; published January 28, 2002


Abstract

We have previously demonstrated that genes within experimentally characterized operons of Escherichia coli are conserved together in other genomes more frequently than genes at the borders of transcription units. Here we expand the analyses and show that, as the phylogenetic distance of the genomes compared increases, the genes remaining together must belong to genes associated into operons in other prokaryotes regardless of the operon organization of the corresponding orthologous gene pair of E. coli. At the same time, we show that the observed tendencies of genes within operons to keep very short inter-genic distances in E. coli, is the same in any other prokaryote whose genome is currently available. We also show the relationship between our analyses of conservation and the inference of functional relationships from genomic context.


Keywords: genome context, functional relationship, operon conservation, intergenic distance



Introduction

Genome rearrangements have been shown to occur so fast that gene order is not preserved [1], and that it is broken much faster than amino-acid identities [2]. Thus, it was very early noticed that conservation of neighborhood has biological meanings. For instance, conservation of gene order was shown to occur among genes whose protein products either interact [3], or have functional relationships [4] (e.g. the proteins are part of a hetero-multimeric enzyme, or are enzymes involved in consecutive steps in a metabolic pathway). Such works were based on the well-known tendency of operons, contiguous adjacent genes that are transcribed into a single messenger RNA, to be formed from the association of genes with related functions. Recently, it was indeed demonstrated that conservation of gene order is strongly correlated to association of genes into operons [5, 6].

Here, we expand our previous analyses contrasting the conservation of genes found within experimentally characterized operons from Escherichia coli, against that of adjacent genes found in the same DNA strand at the borders of transcription units (TUs). We show that conserved pairs in other prokaryotes contain mixed populations of genes within operons and of genes at TU boundaries, but that the populations purify to contain mostly genes within operons as the phylogenetic distances of the genomes compared become greater. At the same time, we show that the previously observed tendency of genes within operons to be kept at very short distances in E. coli is the same in any prokaryote. We also show that the quantitative evaluation of different parameters, proposed to infer functional relationships from genomic context [7], can be performed using databases derived from the knowledge dispersed in the literature. We used data from RegulonDB, a database of TU organization and regulation in Escherichia coli [8].


Data preparation

Data sets of pairs of adjacent genes found in the same operons (WO pairs), and of pairs of adjacent genes at transcription unit boundaries (TUB pairs, being the last gene in a TU and the first one in the next), were prepared as previously described [9]. We derived 637 WO pairs from the 296 operons reported in the database of RegulonDB, and 428 TUB pairs from the comparison of the 495 TUs in RegulonDB with the 1298 directons (stretches of adjacent genes in the same DNA strand with no intervening gene in the opposite strand) of E. coli [9].

To find orthologs of E. coli genes in other organisms, we ran gapped BLASTP [10] comparisons of all the protein sequences corresponding to all the open reading frames (ORFs) of E. coli, against every protein sequence corresponding to the ORFs of all other genomes obtained from GenBank [11] (see Table 1). We used an expectation value cutoff of 0.001. We kept only those results where the alignment covered at least 50% of one of the sequences. Our putative orthologs were those genes whose protein products were overall bi-directional best hits to the E. coli query proteins. Fusions were those other genes that had the same best hit as a bi-directional best hit, but covered a different part of the sequence in the alignment. We used the annotations of each genome to find the locations of all genes and thus be able to find genes conserved adjacent to each other in the same strand, and their inter-genic distances.

As of the writing of this report, there are 57 prokaryotic genomes in GenBank. Since this number increases rapidly, we had to automate a filter to avoid repetitive genomes, leaving for instance a single E. coli genome (K12) as representative of all E. coli strains, and so on. The method consists on calculating the average blast score of the bi-directional best hits, of a given genome versus itself (e.g. E. coli K12 versus E. coli K12) (self-average score), and the average blast score of all the bi-directional best hits of such genome versus every other genome (e.g. E. coli K12 versus Bacillus subtilis) (comparison-average score). We eliminated any genome if the rate of comparison-average score and self-average score exceeded 0.80, making the comparisons and eliminations in alphabetical order. This method functioned well keeping always a single representative of each species in the database, leaving a total of 45 representative genomes (Table 1). We reduced the data sets of WO and TUB pairs to include only those pairs whose genes had both orthologs in at least one genome in the non-redundant genome collection (532 WO pairs and 307 TUB pairs).

Table 1: Genomes available at GenBank as of the writing of this study.

Species GenBank Accession
Crenarchaeota
Aeropyrum pernix NC_000854
Sulfolobus solfataricus AE006641
Sulfolobus tokodaii BA000023
Euryarchaeota
Archaeoglobus fulgidus NC_000917
Halobacterium sp. NRC-1 AE004437
Methanococcus jannaschii L77117
Methanobacterium thermoautotrophicum AE000666
Pyrococcus abyssi NC_000868
Pyrococcus horikoshii* NC_000961
Thermoplasma acidophilum AL139299
Thermoplasma volcanium NC_002689
Aquificales
Aquifex aeolicus AE000657
Chlamydiales
Chlamydia muridarum AE002160
Chlamydophila pneumoniae AR39 AE002161
Chlamydophila pneumoniae CWL029* AE001363
Chlamydophila pneumoniae J138* BA000008
Cyanobacteria
Synechocystis PCC6803 AB001339
Firmicutes
Bacillus halodurans BA000004
Bacillus subtilis AL009126
Clostridium acetobutylicum AE001437
Lactococcus lactis subsp. lactis AE005176
Mycobacterium leprae AL450380
Mycobacterium tuberculosis CDC1551* AE000516
Mycobacterium tuberculosis* AL123456
Mycoplasma genitalium NC_000908
Mycoplasma pneumoniae NC_000912
Mycoplasma pulmonis AL445566
Staphylococcus aureus subsp. aureus Mu50 BA000017
Staphylococcus aureus N315* NC_002745
Streptococcus pneumoniae R6 AE007317
Streptococcus pneumoniae TIGR4* AE005672
Streptococcus pyogenes AE004092
Ureaplasma urealyticum NC_002162
Proteobacteria
Agrobacterium tumefaciens (Chromosome 1) AE007869
Agrobacterium tumefaciens (Chromosome 2) AE007870
Buchnera sp. APS AP000398
Caulobacter crescentus AE005673
Campylobacter jejuni AL111168
Escherichia coli K12 U00096
Escherichia coli O157:H7* BA000007
Escherichia coli O157:H7* AE005174
Haemophilus influenzae Rd L42023
Helicobacter pylori 26695 AE000511
Helicobacter pylori J99* AE001439
Mesorhizobium loti NC_002678
Neisseria meningitidis AE002098
Neisseria meningitidis Z2491* AL157959
Pseudomonas aeruginosa AE004091
Pasteurella multocida AE004439
Rickettsia conorii AE006914
Rickettsia prowazekii* AJ235269
Sinorhizobium meliloti AL591688
Vibrio cholerae (Chromosome 1) AE003852
Vibrio cholerae (Chromosome 2) AE003853
Xylella fastidiosa AE003849
Spirochaetales
Borrelia burgdorferi AE000783
Treponema pallidum AE000520
Thermotogales
Thermotoga maritima AE000512
Thermus/Deinococcus
Deinococcus radiodurans (Chromosome 1) AE000513
Deinococcus radiodurans (Chromosome 2) AE001825

* Species eliminated to avoid over-representations.


Differences in phylogenetic profiles

If two proteins have inter-dependent functions, the presence of the gene coding for one of them would be meaningless unless the gene coding for the other protein would be present too. Thus, it has been suggested that genes whose protein products have related functions should be evidenced by their concerted appearance and disappearance in different genomes in what is called a similar "phyletic pattern" [12, 13] or "phylogenetic profile" [14]. We thus built binary phylogenetic profiles of all genes within the genome of E. coli in the next manner: if we find an ortholog for a given gene in another genome, we annotate a "1", otherwise we annotate a "0". For instance, if the gene "b0002" has detectable orthologs in the first three genomes examined (say in alphabetical order), none in the fourth and fifth, then orthologs again within the sixth and seventh, the corresponding philogenetic profile would be "1110011", we then counted the number of differences between the phylogenetic profiles of each WO pair and of each TUB pair. For instance, if the philogenetic profile of gene "b0003" would be "1110111", the number of differences of the WO pair "b0002/b0003" would be one. In Fig 1 we show the distribution of pairs as a function of the number of differences in phylogenetic profiles. We included only those pairs containing genes having orthologs in at least ten other genomes (229 WO pairs and 100 TUB pairs). As shown, WO pairs have more similar phylogenetic profiles than TUB pairs. This is not surprising, but it is important to note that there is no perfect relationship, that is, phyletic patterns of WO pairs can have differences due to several reasons, such as limitations in the sensitivity of alignment methods that disallow a complete detection of homologs (and thus of orthologs), the occurrence of non-orthologous gene displacement [15], the association of some genes into operons having inter-related yet not completely dependent functions, and other factors. Phyletic patterns however, are sufficiently similar to allow for a distinction of a functional relationship for genes having five or a few more differences in their binary profiles, as long as both genes are present in at least ten out of more than 40 non-redundant genomes.

Figure 1: Differences between phylogenetic profiles of pairs of genes within operons (WO pairs), and of pairs of genes at the borders of transcription units (TUB pairs). Notice that there is a tendency of WO pairs to have more similar (lower number of differences) phylogenetic profiles than TUB pairs. TUB pairs seem to have a normal distribution of differences, while WO pairs start with very little differences, and the curve decays rapidly as the number of differences surpass 15.


Pairs of genes conserved together

We have shown before that WO pairs are conserved together always in higher numbers in other genomes than TUB pairs when measured as the rate of the number of conserved pairs and the number of co-occurring pairs in each genome [6]. Here we give a different perspective. In Fig 2 we display the number of pairs found together as a function of the number of genomes where they are found. As seen, TUB pairs suffer an abrupt fall in the number of pairs conserved leaving very few conserved in more than 2 genomes. The few exceptionally conserved pairs, belonging to TUB pairs in E. coli, seem to be associated in operons in other genomes, as will be shown in the inter-genic studies of the next section. The curve corresponding to WO pairs shows a much higher conservation that extends to many more genomes. Conservation of neighborhood [3, 4] and appearance of fusions [16, 17, 18] are also useful hints to detect functional relationships among genes. Fusions occur only among conserved WO pairs (not shown).


Figure 2: Conservation of adjacency. Notice here that WO pairs are conserved adjacent in higher numbers than TUB pairs. Very few TUB pairs surpass a presence in more than 5 genomes. These TUB pairs might be in operons in other organisms, since their inter-genic distances are those typical of E. coli WO pairs (see Fig. 3).


Inter-genic distance distributions of pairs conserved together

E. coli WO pairs display an inter-genic distance distribution (IDD) with a characteristic peak between -20 bp and 30 bp [9]. If WO pairs can be distinguished in the same way in other organisms, we would expect them to display similar distributions, as long as WO pairs in any other organism have the very same tendencies to keep such short inter-genic distances as E. coli WO pairs. In our previous analysis we found that conserved WO pairs have a similar IDD to WO pairs of E. coli, while the distribution of conserved TUB pairs showed that these are formed of mixed populations of TUB pairs and WO pairs. We had not enough data however to plot IDD of different groups. The current number of genomes is much higher now, and thus we can explore more precisely the tendencies of conserved pairs, in terms of distance distributions, and presumably of the association of genes into operons, in different groups of genomes. We plotted the IDD of genes conserved adjacent to each other among genomes, against those of WO pairs and TUB pairs of E. coli (Fig. 3). In Figs 3a and 3b, we compare the IDD of conserved WO pairs and conserved TUB pairs in Proteobacteria. Figs 3c and 3d correspond to conserved pairs in Firmicutes, and Figs 3e and 3f to conserved pairs in Archaea. Notice that the peak corresponding to conserved WO pairs is always very similar to that of WO pairs of E. coli, though the peak is shorter in Proteobacteria and Firmicutes, and slightly higher in Archaea. If we assume that the tendencies are exactly the same in any prokaryotic operon, then the explanation of this result would be as follows. Despite being conserved together more frequently than unrelated genes, operons are unstable [19], thus, a few conserved pairs, orthologs to WO pairs of E. coli, might not be in operons in other organisms (Proteobacteria and Firmicutes), and their conservation would be a product of chance. At longer phylogenetic distances (Archaea), conservation is most probably of biological significance, and thus the peak reveals a complete association of conserved pairs into operons. A similar reasoning applies to conserved TUB pairs. At short and intermediate phylogenetic distances, conserved TUB pairs reveal some pairs associated into operons, while in Archaea all of the conserved pairs display inter-genic distances typical of WO pairs of E. coli. This means that pairs conserved together in evolutionarily distant species, belonging to TUB pairs in E. coli, might be in operons in other prokaryotes. We have found further support to these assumptions from the analysis of a collection of one hundred operons of Bacillus subtilis compiled from the literature [19]. We used this collection to build a data set of WO pairs (310 pairs). We also used it to find TU boundaries by comparison with the genome sequence and annotation for this organism [20], and built a data set of TUB pairs (123 pairs). Fifty-nine out of 62 (about 95%) pairs of genes found in operons in E. coli, are also in operons in B. subtilis (the remaining three are boundary pairs in B. subtilis). Among those found at TU boundaries in E. coli, 3 out of 4 are in operons in B. subtilis, and are conserved as neighbors in at least 15 other genomes. These results also explain the exceptionally conserved TUB pairs in many prokaryotic genomes, since such pairs might actually be in operons in other organisms (see Fig 2).

Figure 3: Intergenic distance distributions of pairs conserved together across genomes. Pairs conserved among Proeobacteria and Firmicutes show distributions that denote mixed populations with a higher load of the corresponding data set in E. coli, but at the highest phylogenetic distance from E. coli (Archaea), any pair conserved together has inter-genic distances typical of WO pairs.


Concluding remarks

The value of the databases derived from the careful examination of experimental bench work is enormous. Based on data compiled in RegulonDB, we show analyses useful for the quantitative evaluation of functional inferences derived from genomic context. For instance, in the case of phylogenetic profiles, we can see that functionally related genes (WO pairs) can have many differences in their binary profiles, but that they can be confidently identified even allowing up to five or a few more differences (comparing 45 genomes). Conservation of adjacency is more limited since it is scarce, but still, we can say that if there is conservation of adjacency between pairs of genes in evolutionarily distant organisms, like between Firmicutes and Archaea, such genes are most probably functionally related and in the same operon in at least one of the organisms compared. At the same time, we have shown that the inter-genic distance distribution of WO pairs might be very similar in any prokaryote, implying a universal structural consequence of the association of genes into operons, and providing the basis to suspect that genes within operons can be distinguished from TU boundaries by their inter-genic distances alone in the same way as demonstrated for E. coli [9].


Acknowledgments

This work was supported by grant number 0028 from Consejo Nacional de Ciencia y Tecnología to J C-V. We acknowledge technical support from Víctor del Moral and César Bonavides.



References

  1. Mushegian, A. R. and Koonin, E. V. (1996). Gene order is not conserved in bacterial evolution. Trends Genet. 12, 289-90.
  2. Huynen, M. A. and Bork, P. (1998). Measuring genome evolution. Proc. Natl. Acad. Sci. USA 95, 5849-5856.
  3. Dandekar, T., Snel, B., Huynen, M. and Bork, P. (1998). Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324-328.
  4. Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G. D. and Maltsev, N. (1999). The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA 96, 2896-2901.
  5. Ermolaeva, M. D., White, O. and Salzberg, S. L. (2001). Prediction of operons in microbial genomes. Nucleic Acids Res. 29, 1216-1221.
  6. Moreno-Hagelsieb, G., Trevino, V., Perez-Rueda, E., Smith, T. F. and Collado-Vides, J. (2001). Transcription unit conservation in the three domains of life: a perspective from Escherichia coli. Trends Genet. 17, 175-177.
  7. Huynen, M., Snel, B., Lathe, W. 3rd and Bork, P. (2000). Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 10, 1204-1210.
  8. Salgado, H., Santos-Zavaleta, A., Gama-Castro, S., Millan-Zarate, D., Diaz-Peredo, E., Sanchez-Solano, F., Perez-Rueda, E., Bonavides-Martinez, C. and Collado-Vides, J. (2001). RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res. 29, 72-74.
  9. Salgado, H., Moreno-Hagelsieb, G., Smith, T. F. and Collado-Vides, J. (2000). Operons in Escherichia coli: genomic analyses and predictions. Proc. Natl. Acad. Sci. USA 97, 6652-6657.
  10. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.
  11. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A. and Wheeler, D. L. (2000). GenBank. Nucleic Acids Res. 28, 15-18.
  12. Gaasterland, T. and Ragan, M. A. (1998). Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microb. Comp. Genomics 3, 199-217.
  13. Tatusov, R. L., Koonin, E. V. and Lipman, D. J. (1997). A genomic perspective on protein families. Science 278, 631-637.
  14. Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. and Yeates, T. O. (1999). Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 4285-4288.
  15. Koonin, E. V., Mushegian, A. R. and Bork, P. (1996). Non-orthologous gene displacement. Trends Genet. 12, 334-336.
  16. Das, S., Yu, L., Gaitatzes, C., Rogers, R., Freeman, J., Bienkowska, J., Adams, R. M., Smith, T. F. and Lindelien, J. (1997). Biology's new Rosetta stone. Nature 385, 29-30.
  17. Enright, A. J., Iliopoulos, I., Kyrpides, N. C. and Ouzounis, C. A. (1999). Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86-90.
  18. Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O. and Eisenberg, D. (1999). Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751-753.
  19. Itoh, T., Takemoto, K., Mori, H. and Gojobori, T. (1999). Evolutionary instability of operon structures disclosed by sequence comparisons of complete microbial genomes. Mol. Biol. Evol. 16, 332-346.
  20. Kunst, F., Ogasawara, N., Moszer, I., Albertini, A. M., Alloni, G., Azevedo, V., Bertero, M. G., Bessières, P., Bolotin, A., Borchert, S., Borriss, R., Boursier, L., Brans, A., Braun, M., Brignell, S. C., Bron, S., Brouillet, S., Bruschi, C. V., Caldwell, B., Capuano, V., Carter, N. M., Choi, S. K., Codani, J. J., Connerton, I. F., Danchin, A. et al. (1997). The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature 390, 249-256.