Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption

Michael Y. Galperin1 and Eugene V. Koonin2

National Center for Biotechnology Information,
National Library of Medicine, National Institutes of Health,
Bethesda, Maryland 20894, USA
1galperin@ncbi.nlm.nih.gov
2koonin@ncbi.nlm.nih.gov
















ABSTRACT

Functional annotation of proteins encoded in newly sequenced genomes can be expected to meet two conflicting objectives: (i) provide as much information as possible, and (ii) avoid erroneous functional assignments and over-predictions. The continuing exponential growth of the number of sequenced genomes makes the quality of sequence annotation a critical factor in the efforts to utilize this new information. When dubious functional assignments are used as a basis for subsequent predictions, they tend to proliferate, leading to "database explosion". It is therefore important to identify the common factors that hamper functional annotation. As a first step towards that goal, we have compared the annotations of the Mycoplasma genitalium and Methanococcus jannaschii genomes produced in several independent studies. The most common causes of questionable predictions appear out to be: i) non-critical use of annotations from existing database entries; ii) taking into account only the annotation of the best database hit; iii) insufficient masking of low complexity regions (e.g. non-globular domains) in protein sequences, resulting in spurious database hits obscuring relevant ones; iv) ignoring multi-domain organization of the query proteins and/or the database hits; v) non-critical functional inferences on the basis of the functions of neighboring genes in an operon; vi) non-orthologous gene displacement, i.e. involvement of structurally unrelated proteins in the same function. These observations suggest that case by case validation of functional annotation by expert biologists remains crucial for productive genome analysis.

Key words: Genome comparison; prediction of gene functions; systematic error; database annotation; automatic genome annotation; manual genome annotation; low complexity; non-globular domains; multi-domain proteins; non-orthologous gene displacement


INTRODUCTION

The growing list of completely sequenced genomes presents a unique possibility to extract a wealth of information on biochemistry, genetics, and evolutionary history of these organisms. The success of such analysis will largely depend upon our ability to predict protein functions encoded in each of these genomes.

The history of functional annotations for the first completed genomes shows that functions of 40-60% of the proteins encoded in each genome could be predicted based solely on high-level sequence similarity with known proteins [Fleischmann et al., 1995; Fraser et al., 1995; Bult et al., 1996]. On the other hand, detailed computer analysis using advanced methods for sequence comparison (gapped BLAST, PSI-BLAST), amino acid motifs detection (MoST, HMMer), prediction of secondary structure elements, and delineation of families of paralogs (e.g., the ones listed in Table 1) results in an increase of the fraction of proteins with predicted functions to 75-85% for the bacterial genomes and ~70% for the first published archaeal genome, that of M. jannaschii [Tatusov et al., 1996; Koonin et al., 1997]. This analysis is a pre-requisite for more detailed characterization of each particular organism based on the genome sequence, e.g. prediction of the existing and missing metabolic pathways and the transcription regulatory network, but is time- and labor-consuming.

 

Table 1: Selected protein sequence analysis utilities
 
Type of analysis Program used Source (WWW access, availability by FTP) Reference
Sequence similarity searches
Database search and sequence comparison  BLASTP www.ncbi.nlm.nih.gov/BLAST/       ftp://ncbi.nlm.nih.gov/blast/ Altschul et al. [1990] 
WU-BLASTP www2.ebi.ac.uk/blast2/ 
ftp://blast.wustl.edu/
Altschul and Gish [1996] 
BLASTPGP www.ncbi.nlm.nih.gov/BLAST/     ftp://ncbi.nlm.nih.gov/blast/ Altschul et al. [1997]
PSI-BLAST www.ncbi.nlm.nih.gov/BLAST/     ftp://ncbi.nlm.nih.gov/blast/ Altschul et al. [1997]
Motif detection  MoST ftp://ncbi.nlm.nih.gov/pub/most/ Tatusov et al. [1994] 
HMMer http://genome.wustl.edu/eddy/HMMER /main.html Eddy et al. [1995] 
ScanProsite www.expasy.ch/sprot/scnpsite.html Appel et al. [1994] 
Multiple alignment  CLUSTALW MACAW ftp://ftp.ebi.ac.uk/pub/software/ 
ftp://ncbi.nlm.nih.gov/pub/macaw/
Thompson et al. [1994] 
Schuler et al. [1991] 
Identification of conserved gene strings  GENESTRING  ftp://ncbi.nlm.nih.gov/pub/koonin/ 
Complete_Genomes/utility/
Tatusov et al. [1996] 
Handling of large data sets
Batch mode  SEALS www.ncbi.nlm.nih.gov/Walker/SEALS/ Walker and Koonin [1997]
Taxonomy sorting of BLAST results  BLATAX ftp://ncbi.nlm.nih.gov/pub/bla/blatax/ Koonin et al. [1996] 
Prediction of protein structure elements
Signal peptides  SIGNALP www.cbs.dtu.dk/services/SignalP/ Nielsen et al. [1997] 
Transmembrane helices  PHDtopology www.embl-heidelberg.de/ 
predictprotein
Rost et al. [1995] 
Coiled-coil regions  COILS2  http://ulrec3.unil.ch/software/ 
COILS_form.html
Lupas [1996] 
Non-globular domains  SEG  ftp://ncbi.nlm.nih.gov/pub/seg/ Wootton and Federhen [1996] 
 

Accordingly, and in anticipation of the expected exponential growth of the number of sequenced genomes, considerable effort has been devoted to the automation of genome analysis at different levels. These projects range in their scope from a set of simple programs for data handling intended to facilitate genome annotation by a biologist (e.g., SEALS [Walker and Koonin, 1997]) to a complete automated system that attempts automatic annotation without any human intervention (GeneQuiz see [Scharf et al., 1994]). While automation of the tedious process of sequence annotation is certainly an attractive possibility, the quality of automatically generated predictions remains suspect, given the many pitfalls that complicate even manual annotation by expert biologists [Bork and Bairoch, 1996; Bork and Gibson, 1996]. Thus, while the accuracy of their functional assignments by GeneQuiz has been estimated at 95% or better [Ouzounis et al., 1996], few of the new functional predictions for M. genitalium proteins, even those specifically discussed by the authors, could be fully corroborated by manual analysis (see Koonin et al., [1997], for discussion). Several of these questionable predictions, listed in Table 2, suggest that the error rate of GeneQuiz is indeed far above the claimed 1% [Ouzounis et al., 1996].

 

Table 2: A comparison of the automatic and manual annotations of the M. genitalium genome
 
Protein name Best hit to a 
characterized  protein,  E-value 
 
GENEQUIZ annotation 
[Ouzounis et al., 1996
 
Manual annotation  
[Koonin et al., 1997
 
Nature of the errors in GENEQUIZ annotation
Type of errorsa
Possible reasons, 
comments
MG123 None Arginine deiminase  Periplasmic protein 
W
Non-globular domains were not masked (see Table 6)
MG139 None Amps (fragment) Zn-dependent hydrolase
W
Not clear
MG140 SMB2_HUMAN 4.6e-06 DNA-binding protein Smbp2 Superfamily I helicase 
U
Only the best hit was considered; motif analysis was not performed
MG225 LYSP_ECOLI, 0.019 Histidine permease Amino acid permease
O
Only the best hit was considered (see text)
MG237 None Ile-tRNA synthetase domain Unknown
W
Not clear
MG294 NARK_BACSU, 0.0016 NarK, nitrate extrusion protein Permease
O
Only the best hit was considered
MG377 None Zn protease Unknown
O?
Zn protease motif is present, but similarity to other peptidases is low
MG449 SYFB_ECOLI, 4.8e-13 Phe-tRNA synthetase N-terminal Putative RNA-binding protein
W
Multi-domain organization of Phe-tRNA synthetase ignored (see Fig. 3)
MG464 SP3J_BACSU, 0.014 Stage III sporulation protein J Highly conserved membrane protein
W
Only the best hit was considered; similar proteins in many bacteria
MG468 DPO1_BACCA, 5.6e-39 DNA polymerase I 5'-3' exonuclease 
W
Multi-domain organization of DNA polymerase I was ignored (see Fig. 2)

a - W, wrong prediction; U - underprediction; O - overprediction.

Likewise, several corrections to the original TIGR annotation [Fraser et al., 1995], offered by Ouzounis et al. [1996], turned out to be unjustified (Table 3). A similar contrast was found between the GeneQuiz predictions for the M. jannaschii genome [Andrade et al., 1997] and those obtained by mostly manual annotation (see MJ table [Koonin et al., 1997]; see MJ reconstruction [Selkov et al., 1997]). Several of these discrepancies are listed in Table 4.

 

Table 3: Evaluation of the "corrections" provided to the original TIGR annotations by GeneQuiz analysis
 
Protein name Best hit to a characterized protein, E-value TIGR annotation 
[Fraser et al., 1995]
Status
GeneQuiz correction 
[Ouzounis et al., 1996]
Status
Latest annotation     [Koonin et al., 1997] 
MG061 UHPT_SALTY, 0.23  Hexose phosphate 
transport protein UhpT
A
False positive
F
Hexose phosphate transport protein
MG090 RS6_HAEIN, 0.00024 Ribosomal protein S6
A
False positive
F
Ribosomal protein S6 
MG120 RBSC_HAEIN, 0.27 Ribose permease rbsC
A,O?
False positive
F
Permease
MG406 gi1209759, 1.6e-45 Transport permease P69
W
False positive
U
H+-ATPase I chain 
MG006 KTHY_BACSU, 1.1e-28 Thymidylate kinase
A
Putative kinase
F
Thymidylate kinase
MG041 PTHP_STAAU, 7.4e-9 Phosphohistidino- 
protein PtsH
A
ptsH gene, HPr
A
Phosphocarrier protein HPr 
MG099 HYIN_AGRRA, 1.2e-20 Hydrolase Aux2
A
Indoleacetamide hydrolase
O
Amidase
MG137 GLF1_KLEPN, 9.5e-52 dTDP-4-dehydro- 
rhamnose reductase RfbD
A
Amine oxidase
F
Dehydrogenase; LPS biosynthesis (see text)
MG278 SPOT_ECOLI, 1.8e-56 

 

Stringent response- 
like protein
A
Pyrophospho- 
hydrolase
A
ppGpp synthetase/ pyrophosphatase 
MG310 PIP_BACCO, 1.7e-6 Proline iminopeptidase
O
Triacylglycerol lipase
O
Hydrolase
MG409 PHOU_ECOLI, 0.00021 Peripheral membrane protein PhoU
A
Pho negative regulator
A
Phosphate transport regulator PhoU

A, adequate prediction; O, overprediction; U, underprediction; W, wrong annotation; F, Failure, correction of an adequate TIGR annotation.
 
 

 

Table 4: A comparison of an automatic and manual annotations of the Methanococcus jannaschii genome
 
Protein name Best hit to a characterized protein, E-value GeneQuiz annotation
[Andrade et al., 1997]
Manual annotation
[Koonin et al., 1997
Nature of the errors in GeneQuiz annotation
Type of errors
Possible reasons, comments
MJ0134 MDMC_STRMY 1e-18 Protein beta-aspartate methyltransferase SAM-dependent methyltransferase
O
Only the best hit was considered
MJ0226 HAM1_YEAST 7.2e-21 HAM1, controls hydroxyl- 
aminopurine mutagenesis
Unknown ACR 
O
Actual function is not known; similar proteins in many bacteria
MJ0252 PYR5_DICDI 1.0e-8 UMP synthetase Orotidine 5'-phosphate decarboxylase
W
Only the best hit was considered; multi- 
domain structure of  UMP synthase was ignored (see Fig. 1)
MJ0392 IMDH_PYRFU 2.1e-9 IMP dehydrogenase homolog Zn-dependent protease
W
Multi-domain structure of the IMP-DH was ignored; it has only a 100 aa overlap with MJ0392; similar proteins in many bacteria (see text)
MJ0590 SUCD_THEFL 2.7e-9 Succinyl-CoA ligase (GDP- forming), a chain Succinyl-CoA ligase, a and b chains
W, O
The differences in lengths of the query and the best hit were ignored; GDP- and ADP-forming enzymes are similar, MJ0590 can be either one 
MJ0682 DPOL_THELI 0.0043  DNA polymerase B replication factor C Unknown ACR, intein-containing
W
Intein region was not masked; only the best hit was considered
MJ0797 FTSX_ECOLI 0.0021  Cell division protein FtsX homolog Permease
W
Membrane-spanning regions were not masked; only the best hit was considered
MJ1079 None Spore germination protein B2 Integral membrane protein
W
Not clear; similar proteins in many bacteria 
MJ1129 MRP_SYNY3 4.4e-6 MRP protein homolog unknown ACR
W
The difference in length of the query and the best hit was ignored; MJ1129 is shorter than the MRP proteins, and does not contain the conserved ATP-binding site
MJ1207 PAIA_BACSU 3.3e-9 Protease synthase and sporulation negative regulatory protein Acetyltransferase
W
Only the best hit was considered; motif search for HTH domain was not performed
MJ1310 NULC_SYNY3 0.012 Na+/H+ antiporter system ORF3 NADH-ubiquinone oxidoreductase chain 2
O
Membrane-spanning regions were not masked; actual function of the best hit is unknown
MJ1336 CC31_YEAST 0.018 ADP-heptose synthase Unknown ACR
W
The difference in length of the query and the best hit was ignored; MJ1336 is shorter and lacks the conserved ATP- binding site; actual function of the best hit is unknown
MJ1375 CAPF_STAAU 2.0e-6 Putative O-antigen transporter Permease
O
Only the best hit was considered; actual function of the best hit is not known
MJ1452 HMT1_YEAST 0.0037 rRNA adenine N-6- methyltransferase SAM-dependent methyltransferase
O
Only the best hit was considered
MJ1533 GSPE_ERWCA 9.6e-5 Mannose-sensitive hemagglutinin E Glutamyl-tRNA transferase + KH domain + ATPase 
W
Multi-domain structure of MJ1533 was ignored
MJ1618 CURC_STRCN 3.7e-8 Polyketide synthase CurC Mannose-6- 
phosphate isomerase
W
Only the best hit was considered; actual function of CurC is unknown; the name refers to the whole pathway, not just this enzyme (see text)

W, wrong prediction; O - overprediction.

When dubious functional assignments are used as a basis for further predictions, they tend to proliferate, which has been referred to as "database explosion" [Bhatia et al., 1997]. It is therefore important to delineate possible factors that lead to questionable functional assignments. To this end, we compared the sets of functional annotations for the genomes of M. genitalium and M. jannaschii, produced by different groups [Fraser et al., 1995; Bult et al., 1996; Kyrpides et al., 1996; Ouzounis et al., 1996; Koonin et al., 1997; Selkov et al., 1997; Andrade et al., 1997], and examined the likely reasons for apparently erroneous predictions. We believe that the problems that affect prediction of gene functions most seriously are the same for automatic and manual analysis (as will be demonstrated by some of the examples discussed below) but of course manual analysis has more immediate flexibility to handle them. The approach we take is not benchmarking of different systems and methods for genome annotation but examination of some typical cases that highlight inherent difficulties in this process.

 


Common problems in protein function prediction

Incorrect annotation in protein databases

The simplest reason for unjustified function predictions is an incorrect annotation of the database entry that happens to be the closest homolog of the protein in question. Indeed, an open reading frame (ORF) is often assumed to code for an enzyme when it complements a known mutation, resulting in increase in the enzyme activity. Such an effect, of course, can be due to suppression of the mutation, provision of a missing cofactor, and a plethora of other mechanisms. Thus, the Pseudomonas aeruginosa ORF (KHSE_PSEAE) that complemented the thrB mutation in Escherichia coli, was assumed to code for homoserine kinase, even though it lacked any detectable sequence similarity with the same enzyme from other sources (e.g., KHSE_ECOLI), its sequence did not contain any known ATP-binding motif, and inactivation of the chromosomal copy of the gene did not confer threonine autotrophy [Clepet et al., 1992]. Several similar cases, when a protein has been included in curated databases like SwissProt and/or PIR, while the enzyme activity assigned to it has never been supported by either direct experiments or significant sequence conservation, are listed in Table 5. Most likely, the functions of most of these proteins have been misidentified.

 

 

Table 5: Questionable enzyme identifications in SwissProt and PIR databases
Enzyme activity
(EC No.)
Protein in question 
Enzyme with demonstrated activity
Comment
Reference
SwissProt 
entry
PIR entry
Closest characterized homolog 
Isopropylmalate dehydrogenase 
(EC 1.1.1.85)
LEU3_SCHOC 

 

S55845 BUD3_YEAST LEU3_ECOLI Changed to YLEU_SCHOC in SwissProt rel. 35, a warning added Iserentant and Verachtert, [1995]
Protoporphyrinogen oxidase 
(EC 1.3.3.4)
HEMG_ECOLI  JC2513 FLAV_CLOAB FLAV_DESGI HEMG_BACSU  A flavodoxin component of PPO, not shown to have enzymatic activity  Sasarman et al., [1993]; Nishimura et al., [1995]
Dihydrofolate reductase (EC 1.5.1.3) DYR_MYCTU S21834 - DYRA_ECOLI DYR2_ECOLI    J.Dale, personal communication
Thymidylate synthase 
(EC 2.1.1.45)
TYSY_MYCTU - - TYSY_ECOLI Misannotated in SwissProt as a member of the thymidylate synthase family. Belongs to an uncharacterized protein family unrelated to thymidylate synthases. J.Dale, personal communication
Uroporphyrin-III C-methyltransferase 
(EC 2.1.1.107)
HEMX_ECOLI S02185 - CYSG_ECOLI   Sasarman et al., [1988]
Lipopolysaccharide 1,2-N-acetylglucos- 
aminetransferase 
(EC 2.4.1.56)
RFAK_ECOLI C42981 AF004712 

 

RFAK_SALTY  Misannotated based on the position of the gene in the rfa operon (see text) Klena et al., [1992]
Queuine tRNA- ribosyltransferase 
(EC 2.4.2.29)
TGT_RABIT TGT_HUMAN TGT_CAEEL S68430 

 

UBPF_YEAST, UBPD_MOUSE TGT_ECOLI Probable  ubiquitine C-terminal hydrolases; similarity  mentioned in SwissProt, not in PIR Deshpande et al., [1996]
Homoserine kinase 
(EC 2.7.1.39)
KHSE_PSEAE  S27981 - KHSE_ECOLI No known ATP -binding motifs (see text) Clepet et al., [1992]
Acetylornithine deacetylase 
(EC 3.5.1.16)
ARGE_LEPBI A31840 RPOC_ECOLI ARGE_ECOLI Discussed by authors, warning given in SwissProt and PIR Zuerner and Charon, [1988]
Lactoylglutathione methylglyoxal lyase 
(EC 4.4.1.5)
LGUL_SOYBN S47177 GTXA_TOBAC LGUL_HUMAN Probable glutathione S-transferase -
Chorismate mutase 
(EC 5.4.99.5)
PHEB_BACSU D32804 - CHMU_BACSU Warning given in PIR, not in SwissProt Trach and Hoch, [1989] 
Folylpolyglutamate synthase 
(EC 6.3.2.17)
VG29_BPT4 - - FOLC_ECOLI   Ishimoto et al., [1988]
 

Another group includes cases where annotation of a protein, while technically correct, does not contain the biological information that can be used for assigning functions to its homologs. Thus, MJ1618, annotated by GeneQuiz as polyketide synthase CurC, indeed is homologous to CURC_STRCN, a product of the third ORF in an operon coding for the biosynthesis of an antibiotic, curamycin, in Streptomyces curacoi [Bergh and Uhlen, 1992]. However, such an annotation is flawed as M. jannaschii evidently does not produce this antibiotic. On the other hand, a detailed analysis of MJ1618 shows that it has statistically significant sequence similarity to several phosphomannose isomerases, such as ALGA_PSEAE, and, most probably, has phosphohexomutase activity [L. Aravind et al., manuscript in preparation].

 

Low sequence complexity

Low complexity regions, which are abundant in protein sequences, particularly eukaryotic ones, and typically correspond to non-globular domains, tend to produce spurious hits in database searches [Wootton, 1994]. While these regions are routinely masked using the SEG program prior to similarity searches with the BLAST family programs, the default settings of SEG are not suited for masking most of the non-globular protein domains [Wootton, 1994; Wootton and Federhen, 1996]. More stringent filtering, specifically adjusted for delineation of non-globular domains, is frequently needed for the detection of subtle but functionally relevant signals in the globular domains [Wootton, 1994]. The erroneous identification of an arginine deiminase homolog in M. genitalium, which became the basis for far-reaching conclusions on the existence of amino acid metabolism in this bacterium [Ouzounis et al., 1996], is a typical example of the misleading consequences of inadequate filtering of low complexity regions (Table 6). For a number of proteins, however, even masking with strict SEG parameters may be insufficient to detect all the non-globular domains that tend to produce spurious hits in database searches. Additional masking of coiled-coil domains or transmembrane helices may be required. Programs are now available for sequential sequence masking with a variety of methods [Walker and Koonin, 1997].

 

Table 6: Removing spurious database hits of a non-globular protein by modifying SEG parameters
 
SEG parametersb
No. of masked reidues
E-value with an ortholog (Y123_MYCPN)
E-value with arginine deiminase (ARCA_MYCAR)c
E-value with a typical non-globular protein (myosin)
BLASTP
WUBLAST
BLASTPGP
BLASTP
WUBLAST
BLASTPGP
BLASTP
WUBLAST
BLASTPGP
No filtering
0
1.7e-131
2.3e-99
1e-100
0.068
0.099
0.046
0.032
0.0073
0.016
12 2.2 2.5
28
2.2e-126
1.5e-93
1e-101
0.058
0.085
0.046
0.60
0.0089
0.016
12 2.3 2.6
105
6.6e-105
4.6e-76
4e-80
0.94
0.997
-
0.73
0.025
0.016
12 2.4 2.7
166
1.2e-90
3.1e-63
4e-67
0.85
0.97
-
-
-
-
12 2.5 2.8
193
6.5e-72
3.5e-54
6e-51
0.76
0.91
-
-
-
-
12 2.6 2.9
253
6.2e-55
1.1e-41
1e-26
-
-
-
-
-
-

a - Low-complexity regions in MG123 were masked using the SEG program using the listed parameters. Each of the resulting  proteins was compared to the non-redundant database (NCBI) using BLASTP v. 1.4.9., WUBLASTP v. 2.0a, and BLASTPGP v. 2.0.3.
b - Trigger window length, trigger complexity, and extension complexity, see Wootton and Federhen [1996]. The default parameters set is 12 2.2 2.5
c - ARCA_MYCAR was the database hit chosen for the MG123 annotation by Ouzounis et al. [1996].

 

Multi-domain organization of proteins

Many proteins, perhaps the majority in the case of eukaryotes, are composed of several domains that may have different, sometimes unknown, functions [Doolittle, 1995; Mushegian et al., 1997]. A simple illustration of the effect that multidomain organization of a protein may have on sequence-based protein function prediction is shown in Fig. 1. M. jannaschii protein MJ0252, annotated by GeneQuiz as UMP synthetase [Andrade et al., 1997], aligns only with the C-terminal part of UMP synthetase (PYR5_DICDI), which is responsible for the orotidine 5'-decarboxylase activity. Even though PYR5_DICDI is the best hit, MJ0252 cannot be predicted to possess UMP synthetase activity as it does not contain the orotate phosphorybosyltransferase domain.

 

Figure 1: Alignment of MJ0252 with orotidine 5'-phosphate decarboxylases and UMP synthetases.
The sequences are: 1, MJ0252; 2-5, orotidine 5'-phosphate decarboxylases; 2, DCOP_BACSU; 3, DCOP_ECOLI; 4, DCOP_YEAST; 5-6, UMP synthetases; 5, PYR5_DICDI; 6, PYR5_HUMAN. The alignment was generated by the MACAW program [Schuler et al., 1991]; shading indicates the mean similarity scores between the aligned segments.

 

Ignoring the domain structure of the best database hit may easily result in an obviously wrong functional annotation even in case of striking sequence conservation. The M. genitalium gene MG262 product has been repeatedly annotated as DNA polymerase I [Fraser et al., 1995; Ouzounis et al.,1996; Frishman and Mewes, 1997]. While MG262 is clearly homologous to the N-terminal part of DPO1_ECOLI, it lacks the polymerase portion (Klenow fragment). The N-terminal domain of DNA polymerase I is responsible for its 5'-3' exonuclease activity, and this is the obvious functional prediction for MG262.

 

Figure 2: Alignment of MG262 with DNA polymerases I.
The sequences are: 1, MG262; 2, Klenow fragment of the E. coli DNA polymerase I; 3-5, DNA polymerases I; 3, DPO1_ECOLI; 4, DPO1_BACCA; 5. DPO1_THEAQ. Other details as in Fig. 1.

 

A more complex and typical case in illustrated in Fig. 3. The M. genitalium protein MG449 has been annotated as Phe-tRNA synthetase [Ouzounis et al., 1996]. However, while MG449-like domain is present in bacterial Phe-tRNA synthetases (e.g., SYFB_ECOLI), it is absent from archaeal, eukaryotic and chloroplast enzymes (Fig. 3). Thus, despite the highly statistically significant database hits of this protein with several Phe-tRNA synthetases, the proper annotation for MG449 should have had stated that its function was unknown. Recent studies on related proteins in yeast and humans [Kleeman et al., 1997; Simos et al., 1996] indicate that MG449 is most likely an RNA-binding domain found in a variety of multidomain and stand-alone proteins.
 

 

Figure 3: Alignment of MG449 with phenylalanyl-tRNA synthetases.
The sequences are: 1, MG449; 2, MP179, a homolog from M. pneumoniae; 3, YtpR, a homolog from Bacillus subtilis; 4-5, bacterial Phe-tRNA synthetases; 4, SYFB_ECOLI; 5, SYFB_BACSU; 6-7, archaeal Phe-tRNA synthetases; 6, MJ1108, a Methanococcus jannaschii protein; 7, SS56KBFR, a Sulfolobus solfataricus protein; 8, SYFB_PORPU, a chloroplast protein; 9-10, eukaryotic Phe-tRNA synthetases; 9 - SYFA_YEAST; 10 - F22B5.9, a Caenorhabditis elegans protein. Archaeal, chloroplast, and eukaryotic Phe-tRNA synthetases lack the putative RNA-binding domain homologous to MG449.Other details as in Fig. 1.

 

An even more striking series of incorrect annotations due to the multi-domain structure of target proteins involves cystathionine b-synthase (CBS) domain, described recently by Bateman [1997]. This domain of unknown function is found in many proteins, including E. coli IMP dehydrogenase (IMDH_ECOLI). As a result, each protein containing this domain, shows statistically significant similarity with IMP dehydrogenase, causing a widespread confusion among genome annotators. In the revision of the M. jannaschii genome, for example, Kyrpides et al. [1996] annotated 12 CBS domain-containing proteins as similar to IMP dehydrogenase. Remarkably, in 5 cases these misleading annotations were offered as revisions of the original, more appropriate annotations of these proteins as "hypothetical" [Bult et al., 1996]. GenQuiz identified another protein from M. jannaschii (MJ0392, see Table 4) containing the CBS domain and duely annotated it as an IMP dehydrogenase. Even after the illuminating report of Bateman [1997] has been published, and CBS domains were clearly marked in SwissProt entries, ten proteins of Methanobacterium thermoautotrophicum, containing this domain, were annotated as IMP dehydrogenase-related ones [Smith et al., 1997]. In the recently published genome of Archaeoglobus fulgidus, some of such proteins are annotated simply as conserved hypothetical ones, while others (AF0847, AF1259) are still annotated as putative IMP dehydrogenases [Klenk et al., 1997].

 The simplest (but not most reliable) way to circumvent the problem of multi-domain organization of proteins is to compare the length of a match for each database hit with the length of the query sequence, which could indicate possible conflicts. This method is implemented in the annotation engine of the WIT database. The new WWW interfaces for gapped BLAST and PSI-BLAST [Altschul et al., 1997] on the NCBI server present schematic graphical alignments, showing the location of the hit as compared to the query sequence. Another option is to compare the query protein with the Clusters of Orthologous Groups database [Tatusov et al., 1997], where multi-domain proteins are divided into separate domains whenever their single-domain orthologs are found in any of the completely sequenced genomes. Significant hits with proteins from more than one COG would indicate a likely multi-domain organization of the query.

 

Non-orthologous gene displacement.

In different organisms, the same function can be performed by unrelated or distantly related proteins [Koonin and Mushegian, 1996; Koonin et al., 1996]. It appears that in many cases, these enzymes have evolved by shifting the substrate specificity of a related but distinct enzyme. Fig. 4 shows an alignment of gluconate kinases from E. coli and B. subtilis. It is clear that GNTK_BACSU is unrelated to GNTK_ECOLI and is a paralog of GLPK_BACSU. Were these activities not known from biochemical data [Fujita et al., 1986], GNTK_BACSU would be confidently annotated as glycerophosphate kinase. Similar cases were discovered in all enzyme classes and appear to be more common than previously thought [M.Y.G., D.R.Walker and E.V.K., manuscript in preparation].
 

 

Figure 4: Non-orthologous gene displacement: two types of gluconate kinase in bacteria.
The sequences are: 1, GNTK_BACSU; 2-4, gluconate kinases; 2, GNTK_ECOLI; 3, GNTV_ECOLI; 4, GNTK_SCHPO; 5-6, xylose kinases; 5, XYLB_ECOLI; 6, XYLB_STAXY; 7-8, glycerol kinases; 7, GLPK_BACSU; 8, GLPK_MYCGE.

 

In many protein families, enzymes and binding proteins with different specificities may be as similar to each other as those with the same specificity. Ignoring this easily results in overpredictions as for example, in the case of MG225 (Table 2). This membrane protein has significant sequence similarity to lysine, histidine, and arginine permeases; it can be confidently predicted that it mediates amino acid transport; the available data, however, are insufficient to predict the exact specificity.

An even more important case is MG137. Originally annotated by the TIGR team as dTDP-4-dehydro-rhamnose reductase RfbD, it was re-annotated by the GeneQuiz team as amine oxidase. We considered the TIGR annotation to be adequate, and were unable to find any justification for its correction by Ouzounis et al. [1996]. As the only readily identifiable functional motif in this protein was the glycine-rich loop typical of dinucleotide-binding enzymes, it was tentatively annotated as a dehydrogenase, participating in the lypopolysaccharide biosynthesis. Subsequent experimental studies on this operon finally identified the E. coli ortholog of MG137 as UDP-galactose mutase [Nassau et al., 1997], which is now reflected in the SWISS-PROT description of MG137 (GLF_MYCGE); this is indeed a FAD-utilizing enzyme, so the the conserved motif is a part of the FAD-binding site as predicted. This example shows the inherent limitations of the functional predictions, manual or automatic, based solely on protein sequence motif conservation; in a number of cases, such predictions may correctly predict certain important structural and functional properties of a protein but miss the specific activity.

 

Operon disruption.

Comparison of genome organization of bacteria and archaea indicated that only very few operons are conserved across large phylogenetic distances [Mushegian and Koonin, 1996; Koonin and Galperin, 1997]. As a result of operon disruption, genes that belong to the same operon in one species are likely to be scattered in other species, which may complicate their identification. This is often observed even in closely related species [Watanabe et al., 1997]; hence, functional prediction based on gene position in operons sometimes leads to errors. Thus, RFAK_ECOLI was predicted to code for lipopolysaccharide 1,2-N-acetylglucosamine transferase solely on the basis of its position in the E. coli rfa operon, even though the products of this genes in E. coli and S. typhimurium showed very little similarity [Klena et al., 1992]. A recent study of a homologous protein in Haemophilus ducreyi identified it as D-glycero-D-manno-heptosyl transferase [Gibson et al., 1997].

 




CONCLUSIONS

It appears that errors in genome annotation most frequently occur when:

These pitfalls plague both manual and automatic precition but are particularly difficult to avoid in automated systems for genome annotation. The solution should be both in continuing involvement of expert biologists in genome annotation and in the incorporation of more sophisticated logic into automated methods.


REFERENCES