ISB Home

- Article -

Volume 1

Full article

In Silico Biology 1, 0007 (1998); ©1998, Bioinformation Systems e.V.  

Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption

Michael Y. Galperin1 and Eugene V. Koonin2

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA


Functional annotation of proteins encoded in newly sequenced genomes can be expected to meet two conflicting objectives: (i) provide as much information as possible, and (ii) avoid erroneous functional assignments and over-predictions. The continuing exponential growth of the number of sequenced genomes makes the quality of sequence annotation a critical factor in the efforts to utilize this new information. When dubious functional assignments are used as a basis for subsequent predictions, they tend to proliferate, leading to "database explosion". It is therefore important to identify the common factors that hamper functional annotation. As a first step towards that goal, we have compared the annotations of the Mycoplasma genitalium and Methanococcus jannaschii genomes produced in several independent studies. The most common causes of questionable predictions appear out to be: i) non-critical use of annotations from existing database entries; ii) taking into account only the annotation of the best database hit; iii) insufficient masking of low complexity regions (e.g. non-globular domains) in protein sequences, resulting in spurious database hits obscuring relevant ones; iv) ignoring multi-domain organization of the query proteins and/or the database hits; v) non-critical functional inferences on the basis of the functions of neighboring genes in an operon; vi) non-orthologous gene displacement, i.e. involvement of structurally unrelated proteins in the same function. These observations suggest that case by case validation of functional annotation by expert biologists remains crucial for productive genome analysis.

Key words: Genome comparison, prediction of gene functions, systematic error, database annotation, automatic genome annotation, manual genome annotation, low complexity, non-globular domains, multi-domain proteins, non-orthologous gene displacement