Functional annotation of proteins encoded in newly sequenced genomes can be expected to meet two conflicting objectives: (i) provide as much information as possible, and (ii) avoid erroneous functional assignments and over-predictions. The continuing exponential growth of the number of sequenced genomes makes the quality of sequence annotation a critical factor in the efforts to utilize this new information. When dubious functional assignments are used as a basis for subsequent predictions, they tend to proliferate, leading to "database explosion". It is therefore important to identify the common factors that hamper functional annotation. As a first step towards that goal, we have compared the annotations of the Mycoplasma genitalium and Methanococcus jannaschii genomes produced in several independent studies. The most common causes of questionable predictions appear out to be: i) non-critical use of annotations from existing database entries; ii) taking into account only the annotation of the best database hit; iii) insufficient masking of low complexity regions (e.g. non-globular domains) in protein sequences, resulting in spurious database hits obscuring relevant ones; iv) ignoring multi-domain organization of the query proteins and/or the database hits; v) non-critical functional inferences on the basis of the functions of neighboring genes in an operon; vi) non-orthologous gene displacement, i.e. involvement of structurally unrelated proteins in the same function. These observations suggest that case by case validation of functional annotation by expert biologists remains crucial for productive genome analysis.
Key words: Genome comparison; prediction of gene functions; systematic error; database annotation; automatic genome annotation; manual genome annotation; low complexity; non-globular domains; multi-domain proteins; non-orthologous gene displacement
The growing list of completely sequenced genomes presents a unique possibility to extract a wealth of information on biochemistry, genetics, and evolutionary history of these organisms. The success of such analysis will largely depend upon our ability to predict protein functions encoded in each of these genomes.
The history of functional annotations for the first completed genomes shows that functions of 40-60% of the proteins encoded in each genome could be predicted based solely on high-level sequence similarity with known proteins [Fleischmann et al., 1995; Fraser et al., 1995; Bult et al., 1996]. On the other hand, detailed computer analysis using advanced methods for sequence comparison (gapped BLAST, PSI-BLAST), amino acid motifs detection (MoST, HMMer), prediction of secondary structure elements, and delineation of families of paralogs (e.g., the ones listed in Table 1) results in an increase of the fraction of proteins with predicted functions to 75-85% for the bacterial genomes and ~70% for the first published archaeal genome, that of M. jannaschii [Tatusov et al., 1996; Koonin et al., 1997]. This analysis is a pre-requisite for more detailed characterization of each particular organism based on the genome sequence, e.g. prediction of the existing and missing metabolic pathways and the transcription regulatory network, but is time- and labor-consuming.
Table 1: Selected protein sequence analysis utilities
Accordingly, and in anticipation of the expected exponential growth of the number of sequenced genomes, considerable effort has been devoted to the automation of genome analysis at different levels. These projects range in their scope from a set of simple programs for data handling intended to facilitate genome annotation by a biologist (e.g., SEALS [Walker and Koonin, 1997]) to a complete automated system that attempts automatic annotation without any human intervention (GeneQuiz see [Scharf et al., 1994]). While automation of the tedious process of sequence annotation is certainly an attractive possibility, the quality of automatically generated predictions remains suspect, given the many pitfalls that complicate even manual annotation by expert biologists [Bork and Bairoch, 1996; Bork and Gibson, 1996]. Thus, while the accuracy of their functional assignments by GeneQuiz has been estimated at 95% or better [Ouzounis et al., 1996], few of the new functional predictions for M. genitalium proteins, even those specifically discussed by the authors, could be fully corroborated by manual analysis (see Koonin et al., [1997], for discussion). Several of these questionable predictions, listed in Table 2, suggest that the error rate of GeneQuiz is indeed far above the claimed 1% [Ouzounis et al., 1996].
Table 2: A comparison of the automatic and manual annotations of
the M. genitalium genome
| Protein name | Best hit to a
characterized protein, E-value |
GENEQUIZ
annotation
[Ouzounis et al., 1996] |
Manual
annotation
[Koonin et al., 1997] |
|
|
|
|
Possible reasons,
comments |
||||
| MG123 | None | Arginine deiminase | Periplasmic protein |
|
Non-globular domains were not masked (see Table 6) |
| MG139 | None | Amps (fragment) | Zn-dependent hydrolase |
|
Not clear |
| MG140 | SMB2_HUMAN 4.6e-06 | DNA-binding protein Smbp2 | Superfamily I helicase |
|
Only the best hit was considered; motif analysis was not performed |
| MG225 | LYSP_ECOLI, 0.019 | Histidine permease | Amino acid permease |
|
Only the best hit was considered (see text) |
| MG237 | None | Ile-tRNA synthetase domain | Unknown |
|
Not clear |
| MG294 | NARK_BACSU, 0.0016 | NarK, nitrate extrusion protein | Permease |
|
Only the best hit was considered |
| MG377 | None | Zn protease | Unknown |
|
Zn protease motif is present, but similarity to other peptidases is low |
| MG449 | SYFB_ECOLI, 4.8e-13 | Phe-tRNA synthetase N-terminal | Putative RNA-binding protein |
|
Multi-domain organization of Phe-tRNA synthetase ignored (see Fig. 3) |
| MG464 | SP3J_BACSU, 0.014 | Stage III sporulation protein J | Highly conserved membrane protein |
|
Only the best hit was considered; similar proteins in many bacteria |
| MG468 | DPO1_BACCA, 5.6e-39 | DNA polymerase I | 5'-3' exonuclease |
|
Multi-domain organization of DNA polymerase I was ignored (see Fig. 2) |
a - W, wrong prediction; U - underprediction; O - overprediction.
Likewise, several corrections to the original TIGR annotation [Fraser et al., 1995], offered by Ouzounis et al. [1996], turned out to be unjustified (Table 3). A similar contrast was found between the GeneQuiz predictions for the M. jannaschii genome [Andrade et al., 1997] and those obtained by mostly manual annotation (see MJ table [Koonin et al., 1997]; see MJ reconstruction [Selkov et al., 1997]). Several of these discrepancies are listed in Table 4.
Table 3: Evaluation of the "corrections" provided to the original
TIGR annotations by GeneQuiz analysis
| Protein name | Best hit to a characterized protein, E-value | TIGR
annotation
[Fraser et al., 1995] |
|
GeneQuiz
correction
[Ouzounis et al., 1996] |
|
Latest annotation [Koonin et al., 1997] |
| MG061 | UHPT_SALTY, 0.23 | Hexose phosphate
transport protein UhpT |
|
False positive |
|
Hexose phosphate transport protein |
| MG090 | RS6_HAEIN, 0.00024 | Ribosomal protein S6 |
|
False positive |
|
Ribosomal protein S6 |
| MG120 | RBSC_HAEIN, 0.27 | Ribose permease rbsC |
|
False positive |
|
Permease |
| MG406 | gi1209759, 1.6e-45 | Transport permease P69 |
|
False positive |
|
H+-ATPase I chain |
| MG006 | KTHY_BACSU, 1.1e-28 | Thymidylate kinase |
|
Putative kinase |
|
Thymidylate kinase |
| MG041 | PTHP_STAAU, 7.4e-9 | Phosphohistidino-
protein PtsH |
|
ptsH gene, HPr |
|
Phosphocarrier protein HPr |
| MG099 | HYIN_AGRRA, 1.2e-20 | Hydrolase Aux2 |
|
Indoleacetamide hydrolase |
|
Amidase |
| MG137 | GLF1_KLEPN, 9.5e-52 | dTDP-4-dehydro-
rhamnose reductase RfbD |
|
Amine oxidase |
|
Dehydrogenase; LPS biosynthesis (see text) |
| MG278 | SPOT_ECOLI,
1.8e-56
|
Stringent response-
like protein |
|
Pyrophospho-
hydrolase |
|
ppGpp synthetase/ pyrophosphatase |
| MG310 | PIP_BACCO, 1.7e-6 | Proline iminopeptidase |
|
Triacylglycerol lipase |
|
Hydrolase |
| MG409 | PHOU_ECOLI, 0.00021 | Peripheral membrane protein PhoU |
|
Pho negative regulator |
|
Phosphate transport regulator PhoU |
A, adequate prediction; O, overprediction;
U, underprediction; W, wrong annotation;
F, Failure, correction of an adequate
TIGR annotation.
Table 4: A comparison of an automatic and manual annotations of the
Methanococcus jannaschii genome
| Protein name | Best hit to a characterized protein, E-value | GeneQuiz
annotation [Andrade et al., 1997] |
Manual
annotation [Koonin et al., 1997] |
|
|
|
|
|
||||
| MJ0134 | MDMC_STRMY 1e-18 | Protein beta-aspartate methyltransferase | SAM-dependent methyltransferase |
|
Only the best hit was considered |
| MJ0226 | HAM1_YEAST 7.2e-21 | HAM1, controls hydroxyl-
aminopurine mutagenesis |
Unknown ACR |
|
Actual function is not known; similar proteins in many bacteria |
| MJ0252 | PYR5_DICDI 1.0e-8 | UMP synthetase | Orotidine 5'-phosphate decarboxylase |
|
Only the best hit was considered; multi-
domain structure of UMP synthase was ignored (see Fig. 1) |
| MJ0392 | IMDH_PYRFU 2.1e-9 | IMP dehydrogenase homolog | Zn-dependent protease |
|
Multi-domain structure of the IMP-DH was ignored; it has only a 100 aa overlap with MJ0392; similar proteins in many bacteria (see text) |
| MJ0590 | SUCD_THEFL 2.7e-9 | Succinyl-CoA ligase (GDP- forming), a chain | Succinyl-CoA ligase, a and b chains |
|
The differences in lengths of the query and the best hit were ignored; GDP- and ADP-forming enzymes are similar, MJ0590 can be either one |
| MJ0682 | DPOL_THELI 0.0043 | DNA polymerase B replication factor C | Unknown ACR, intein-containing |
|
Intein region was not masked; only the best hit was considered |
| MJ0797 | FTSX_ECOLI 0.0021 | Cell division protein FtsX homolog | Permease |
|
Membrane-spanning regions were not masked; only the best hit was considered |
| MJ1079 | None | Spore germination protein B2 | Integral membrane protein |
|
Not clear; similar proteins in many bacteria |
| MJ1129 | MRP_SYNY3 4.4e-6 | MRP protein homolog | unknown ACR |
|
The difference in length of the query and the best hit was ignored; MJ1129 is shorter than the MRP proteins, and does not contain the conserved ATP-binding site |
| MJ1207 | PAIA_BACSU 3.3e-9 | Protease synthase and sporulation negative regulatory protein | Acetyltransferase |
|
Only the best hit was considered; motif search for HTH domain was not performed |
| MJ1310 | NULC_SYNY3 0.012 | Na+/H+ antiporter system ORF3 | NADH-ubiquinone oxidoreductase chain 2 |
|
Membrane-spanning regions were not masked; actual function of the best hit is unknown |
| MJ1336 | CC31_YEAST 0.018 | ADP-heptose synthase | Unknown ACR |
|
The difference in length of the query and the best hit was ignored; MJ1336 is shorter and lacks the conserved ATP- binding site; actual function of the best hit is unknown |
| MJ1375 | CAPF_STAAU 2.0e-6 | Putative O-antigen transporter | Permease |
|
Only the best hit was considered; actual function of the best hit is not known |
| MJ1452 | HMT1_YEAST 0.0037 | rRNA adenine N-6- methyltransferase | SAM-dependent methyltransferase |
|
Only the best hit was considered |
| MJ1533 | GSPE_ERWCA 9.6e-5 | Mannose-sensitive hemagglutinin E | Glutamyl-tRNA transferase + KH domain + ATPase |
|
Multi-domain structure of MJ1533 was ignored |
| MJ1618 | CURC_STRCN 3.7e-8 | Polyketide synthase CurC | Mannose-6-
phosphate isomerase |
|
Only the best hit was considered; actual function of CurC is unknown; the name refers to the whole pathway, not just this enzyme (see text) |
W, wrong prediction; O - overprediction.
When dubious functional assignments are used as a basis for further predictions, they tend to proliferate, which has been referred to as "database explosion" [Bhatia et al., 1997]. It is therefore important to delineate possible factors that lead to questionable functional assignments. To this end, we compared the sets of functional annotations for the genomes of M. genitalium and M. jannaschii, produced by different groups [Fraser et al., 1995; Bult et al., 1996; Kyrpides et al., 1996; Ouzounis et al., 1996; Koonin et al., 1997; Selkov et al., 1997; Andrade et al., 1997], and examined the likely reasons for apparently erroneous predictions. We believe that the problems that affect prediction of gene functions most seriously are the same for automatic and manual analysis (as will be demonstrated by some of the examples discussed below) but of course manual analysis has more immediate flexibility to handle them. The approach we take is not benchmarking of different systems and methods for genome annotation but examination of some typical cases that highlight inherent difficulties in this process.
Incorrect annotation in protein databases
The simplest reason for unjustified function predictions is an incorrect annotation of the database entry that happens to be the closest homolog of the protein in question. Indeed, an open reading frame (ORF) is often assumed to code for an enzyme when it complements a known mutation, resulting in increase in the enzyme activity. Such an effect, of course, can be due to suppression of the mutation, provision of a missing cofactor, and a plethora of other mechanisms. Thus, the Pseudomonas aeruginosa ORF (KHSE_PSEAE) that complemented the thrB mutation in Escherichia coli, was assumed to code for homoserine kinase, even though it lacked any detectable sequence similarity with the same enzyme from other sources (e.g., KHSE_ECOLI), its sequence did not contain any known ATP-binding motif, and inactivation of the chromosomal copy of the gene did not confer threonine autotrophy [Clepet et al., 1992]. Several similar cases, when a protein has been included in curated databases like SwissProt and/or PIR, while the enzyme activity assigned to it has never been supported by either direct experiments or significant sequence conservation, are listed in Table 5. Most likely, the functions of most of these proteins have been misidentified.
Table 5: Questionable enzyme identifications in SwissProt and PIR databases
|
|
|
|
|
|
||
|
|
|
|
||||
| Isopropylmalate dehydrogenase
(EC 1.1.1.85) |
LEU3_SCHOC
|
S55845 | BUD3_YEAST | LEU3_ECOLI | Changed to YLEU_SCHOC in SwissProt rel. 35, a warning added | Iserentant and Verachtert, [1995] |
| Protoporphyrinogen oxidase
(EC 1.3.3.4) |
HEMG_ECOLI | JC2513 | FLAV_CLOAB FLAV_DESGI | HEMG_BACSU | A flavodoxin component of PPO, not shown to have enzymatic activity | Sasarman et al., [1993]; Nishimura et al., [1995] |
| Dihydrofolate reductase (EC 1.5.1.3) | DYR_MYCTU | S21834 | - | DYRA_ECOLI DYR2_ECOLI | J.Dale, personal communication | |
| Thymidylate synthase
(EC 2.1.1.45) |
TYSY_MYCTU | - | - | TYSY_ECOLI | Misannotated in SwissProt as a member of the thymidylate synthase family. Belongs to an uncharacterized protein family unrelated to thymidylate synthases. | J.Dale, personal communication |
| Uroporphyrin-III C-methyltransferase
(EC 2.1.1.107) |
HEMX_ECOLI | S02185 | - | CYSG_ECOLI | Sasarman et al., [1988] | |
| Lipopolysaccharide 1,2-N-acetylglucos-
aminetransferase (EC 2.4.1.56) |
RFAK_ECOLI | C42981 | AF004712
|
RFAK_SALTY | Misannotated based on the position of the gene in the rfa operon (see text) | Klena et al., [1992] |
| Queuine tRNA- ribosyltransferase
(EC 2.4.2.29) |
TGT_RABIT TGT_HUMAN TGT_CAEEL | S68430
|
UBPF_YEAST, UBPD_MOUSE | TGT_ECOLI | Probable ubiquitine C-terminal hydrolases; similarity mentioned in SwissProt, not in PIR | Deshpande et al., [1996] |
| Homoserine kinase
(EC 2.7.1.39) |
KHSE_PSEAE | S27981 | - | KHSE_ECOLI | No known ATP -binding motifs (see text) | Clepet et al., [1992] |
| Acetylornithine deacetylase
(EC 3.5.1.16) |
ARGE_LEPBI | A31840 | RPOC_ECOLI | ARGE_ECOLI | Discussed by authors, warning given in SwissProt and PIR | Zuerner and Charon, [1988] |
| Lactoylglutathione methylglyoxal lyase
(EC 4.4.1.5) |
LGUL_SOYBN | S47177 | GTXA_TOBAC | LGUL_HUMAN | Probable glutathione S-transferase | - |
| Chorismate mutase
(EC 5.4.99.5) |
PHEB_BACSU | D32804 | - | CHMU_BACSU | Warning given in PIR, not in SwissProt | Trach and Hoch, [1989] |
| Folylpolyglutamate synthase
(EC 6.3.2.17) |
VG29_BPT4 | - | - | FOLC_ECOLI | Ishimoto et al., [1988] | |
Another group includes cases where annotation of a protein, while technically correct, does not contain the biological information that can be used for assigning functions to its homologs. Thus, MJ1618, annotated by GeneQuiz as polyketide synthase CurC, indeed is homologous to CURC_STRCN, a product of the third ORF in an operon coding for the biosynthesis of an antibiotic, curamycin, in Streptomyces curacoi [Bergh and Uhlen, 1992]. However, such an annotation is flawed as M. jannaschii evidently does not produce this antibiotic. On the other hand, a detailed analysis of MJ1618 shows that it has statistically significant sequence similarity to several phosphomannose isomerases, such as ALGA_PSEAE, and, most probably, has phosphohexomutase activity [L. Aravind et al., manuscript in preparation].
Low sequence complexity
Low complexity regions, which are abundant in protein sequences, particularly eukaryotic ones, and typically correspond to non-globular domains, tend to produce spurious hits in database searches [Wootton, 1994]. While these regions are routinely masked using the SEG program prior to similarity searches with the BLAST family programs, the default settings of SEG are not suited for masking most of the non-globular protein domains [Wootton, 1994; Wootton and Federhen, 1996]. More stringent filtering, specifically adjusted for delineation of non-globular domains, is frequently needed for the detection of subtle but functionally relevant signals in the globular domains [Wootton, 1994]. The erroneous identification of an arginine deiminase homolog in M. genitalium, which became the basis for far-reaching conclusions on the existence of amino acid metabolism in this bacterium [Ouzounis et al., 1996], is a typical example of the misleading consequences of inadequate filtering of low complexity regions (Table 6). For a number of proteins, however, even masking with strict SEG parameters may be insufficient to detect all the non-globular domains that tend to produce spurious hits in database searches. Additional masking of coiled-coil domains or transmembrane helices may be required. Programs are now available for sequential sequence masking with a variety of methods [Walker and Koonin, 1997].
Table 6: Removing spurious database hits of a non-globular protein
by modifying SEG parameters
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a - Low-complexity regions in MG123
were masked using the SEG program using the listed parameters. Each of
the resulting proteins was compared to the non-redundant database
(NCBI) using BLASTP
v. 1.4.9., WUBLASTP v. 2.0a,
and BLASTPGP
v. 2.0.3.
b - Trigger window length, trigger complexity, and extension complexity,
see Wootton
and Federhen [1996]. The default parameters set is 12 2.2 2.5
c - ARCA_MYCAR
was the database hit chosen for the MG123
annotation by Ouzounis
et al. [1996].
Multi-domain organization of proteins
Many proteins, perhaps the majority in the case of eukaryotes, are composed of several domains that may have different, sometimes unknown, functions [Doolittle, 1995; Mushegian et al., 1997]. A simple illustration of the effect that multidomain organization of a protein may have on sequence-based protein function prediction is shown in Fig. 1. M. jannaschii protein MJ0252, annotated by GeneQuiz as UMP synthetase [Andrade et al., 1997], aligns only with the C-terminal part of UMP synthetase (PYR5_DICDI), which is responsible for the orotidine 5'-decarboxylase activity. Even though PYR5_DICDI is the best hit, MJ0252 cannot be predicted to possess UMP synthetase activity as it does not contain the orotate phosphorybosyltransferase domain.
|
Figure 1: Alignment of MJ0252 with orotidine 5'-phosphate decarboxylases and UMP synthetases. The sequences are: 1, MJ0252; 2-5, orotidine 5'-phosphate decarboxylases; 2, DCOP_BACSU; 3, DCOP_ECOLI; 4, DCOP_YEAST; 5-6, UMP synthetases; 5, PYR5_DICDI; 6, PYR5_HUMAN. The alignment was generated by the MACAW program [Schuler et al., 1991]; shading indicates the mean similarity scores between the aligned segments. |
Ignoring the domain structure of the best database hit may easily result in an obviously wrong functional annotation even in case of striking sequence conservation. The M. genitalium gene MG262 product has been repeatedly annotated as DNA polymerase I [Fraser et al., 1995; Ouzounis et al.,1996; Frishman and Mewes, 1997]. While MG262 is clearly homologous to the N-terminal part of DPO1_ECOLI, it lacks the polymerase portion (Klenow fragment). The N-terminal domain of DNA polymerase I is responsible for its 5'-3' exonuclease activity, and this is the obvious functional prediction for MG262.
|
Figure 2: Alignment of MG262 with DNA polymerases I. The sequences are: 1, MG262; 2, Klenow fragment of the E. coli DNA polymerase I; 3-5, DNA polymerases I; 3, DPO1_ECOLI; 4, DPO1_BACCA; 5. DPO1_THEAQ. Other details as in Fig. 1. |
A more complex and typical case in illustrated in Fig. 3. The M. genitalium protein MG449 has been annotated as Phe-tRNA synthetase
[Ouzounis
et al., 1996]. However, while MG449-like domain is present in
bacterial Phe-tRNA synthetases (e.g., SYFB_ECOLI),
it is absent from archaeal, eukaryotic and chloroplast enzymes (Fig. 3).
Thus, despite the highly statistically significant database hits of this
protein with several Phe-tRNA synthetases, the proper annotation for MG449
should have had stated that its function was unknown. Recent studies on
related proteins in yeast and humans [Kleeman
et al., 1997; Simos
et al., 1996] indicate that MG449 is most likely an RNA-binding
domain found in a variety of multidomain and stand-alone proteins.
|
Figure 3: Alignment of MG449 with phenylalanyl-tRNA synthetases.
The sequences are: 1, MG449; 2, MP179, a homolog from M. pneumoniae; 3, YtpR, a homolog from Bacillus subtilis; 4-5, bacterial Phe-tRNA synthetases; 4, SYFB_ECOLI; 5, SYFB_BACSU; 6-7, archaeal Phe-tRNA synthetases; 6, MJ1108, a Methanococcus jannaschii protein; 7, SS56KBFR, a Sulfolobus solfataricus protein; 8, SYFB_PORPU, a chloroplast protein; 9-10, eukaryotic Phe-tRNA synthetases; 9 - SYFA_YEAST; 10 - F22B5.9, a Caenorhabditis elegans protein. Archaeal, chloroplast, and eukaryotic Phe-tRNA synthetases lack the putative RNA-binding domain homologous to MG449.Other details as in Fig. 1. |
An even more striking series of incorrect annotations due to the multi-domain structure of target proteins involves cystathionine b-synthase (CBS) domain, described recently by Bateman [1997]. This domain of unknown function is found in many proteins, including E. coli IMP dehydrogenase (IMDH_ECOLI). As a result, each protein containing this domain, shows statistically significant similarity with IMP dehydrogenase, causing a widespread confusion among genome annotators. In the revision of the M. jannaschii genome, for example, Kyrpides et al. [1996] annotated 12 CBS domain-containing proteins as similar to IMP dehydrogenase. Remarkably, in 5 cases these misleading annotations were offered as revisions of the original, more appropriate annotations of these proteins as "hypothetical" [Bult et al., 1996]. GenQuiz identified another protein from M. jannaschii (MJ0392, see Table 4) containing the CBS domain and duely annotated it as an IMP dehydrogenase. Even after the illuminating report of Bateman [1997] has been published, and CBS domains were clearly marked in SwissProt entries, ten proteins of Methanobacterium thermoautotrophicum, containing this domain, were annotated as IMP dehydrogenase-related ones [Smith et al., 1997]. In the recently published genome of Archaeoglobus fulgidus, some of such proteins are annotated simply as conserved hypothetical ones, while others (AF0847, AF1259) are still annotated as putative IMP dehydrogenases [Klenk et al., 1997].
The simplest (but not most reliable) way to circumvent the problem of multi-domain organization of proteins is to compare the length of a match for each database hit with the length of the query sequence, which could indicate possible conflicts. This method is implemented in the annotation engine of the WIT database. The new WWW interfaces for gapped BLAST and PSI-BLAST [Altschul et al., 1997] on the NCBI server present schematic graphical alignments, showing the location of the hit as compared to the query sequence. Another option is to compare the query protein with the Clusters of Orthologous Groups database [Tatusov et al., 1997], where multi-domain proteins are divided into separate domains whenever their single-domain orthologs are found in any of the completely sequenced genomes. Significant hits with proteins from more than one COG would indicate a likely multi-domain organization of the query.
Non-orthologous gene displacement.
In different organisms, the same function can be performed by unrelated
or distantly related proteins [Koonin
and Mushegian, 1996; Koonin
et al., 1996]. It appears that in many cases, these enzymes
have evolved by shifting the substrate specificity of a related but distinct
enzyme. Fig. 4 shows an alignment of gluconate kinases from E. coli
and B. subtilis. It is clear that GNTK_BACSU
is unrelated to GNTK_ECOLI
and is a paralog of GLPK_BACSU.
Were these activities not known from biochemical data [Fujita et al., 1986], GNTK_BACSU
would be confidently annotated as glycerophosphate kinase. Similar cases
were discovered in all enzyme classes and appear to be more common than
previously thought [M.Y.G., D.R.Walker and E.V.K., manuscript in preparation].
|
Figure 4: Non-orthologous gene displacement: two types of gluconate
kinase in bacteria.
The sequences are: 1, GNTK_BACSU; 2-4, gluconate kinases; 2, GNTK_ECOLI; 3, GNTV_ECOLI; 4, GNTK_SCHPO; 5-6, xylose kinases; 5, XYLB_ECOLI; 6, XYLB_STAXY; 7-8, glycerol kinases; 7, GLPK_BACSU; 8, GLPK_MYCGE. |
In many protein families, enzymes and binding proteins with different specificities may be as similar to each other as those with the same specificity. Ignoring this easily results in overpredictions as for example, in the case of MG225 (Table 2). This membrane protein has significant sequence similarity to lysine, histidine, and arginine permeases; it can be confidently predicted that it mediates amino acid transport; the available data, however, are insufficient to predict the exact specificity.
An even more important case is MG137. Originally annotated by the TIGR team as dTDP-4-dehydro-rhamnose reductase RfbD, it was re-annotated by the GeneQuiz team as amine oxidase. We considered the TIGR annotation to be adequate, and were unable to find any justification for its correction by Ouzounis et al. [1996]. As the only readily identifiable functional motif in this protein was the glycine-rich loop typical of dinucleotide-binding enzymes, it was tentatively annotated as a dehydrogenase, participating in the lypopolysaccharide biosynthesis. Subsequent experimental studies on this operon finally identified the E. coli ortholog of MG137 as UDP-galactose mutase [Nassau et al., 1997], which is now reflected in the SWISS-PROT description of MG137 (GLF_MYCGE); this is indeed a FAD-utilizing enzyme, so the the conserved motif is a part of the FAD-binding site as predicted. This example shows the inherent limitations of the functional predictions, manual or automatic, based solely on protein sequence motif conservation; in a number of cases, such predictions may correctly predict certain important structural and functional properties of a protein but miss the specific activity.
Operon disruption.
Comparison of genome organization of bacteria and archaea indicated that only very few operons are conserved across large phylogenetic distances [Mushegian and Koonin, 1996; Koonin and Galperin, 1997]. As a result of operon disruption, genes that belong to the same operon in one species are likely to be scattered in other species, which may complicate their identification. This is often observed even in closely related species [Watanabe et al., 1997]; hence, functional prediction based on gene position in operons sometimes leads to errors. Thus, RFAK_ECOLI was predicted to code for lipopolysaccharide 1,2-N-acetylglucosamine transferase solely on the basis of its position in the E. coli rfa operon, even though the products of this genes in E. coli and S. typhimurium showed very little similarity [Klena et al., 1992]. A recent study of a homologous protein in Haemophilus ducreyi identified it as D-glycero-D-manno-heptosyl transferase [Gibson et al., 1997].
It appears that errors in genome annotation most frequently occur when:
These pitfalls plague both manual and automatic precition but are particularly
difficult to avoid in automated systems for genome annotation. The solution
should be both in continuing involvement of expert biologists in genome
annotation and in the incorporation of more sophisticated logic into automated
methods.