The next step in genome research is the structural and functional annotation of translated ORFs. These annotations can be used in comparative genomics, screening, drug design and pathway reconstruction. The first step in genome annotation is to search for homologous sequences in protein sequence databases. The state of the art tool for such database searches is PSI-BLAST (Position Specific Iterated Basic Local Alignment Search Tool, [Altschul et al., 1997]. The performance of PSI-BLAST and other database search tools to identify homologues of a given query in a sequence database have been measured by others [Park et al., 1998]. However these benchmarks do not fit the requirements in genome annotation. Our benchmark is meant to measure the performance of PSI-BLAST to correctly annotate genomic protein sequences. We used PSI-BLAST to produce and interpret structural annotations for the bacterial genomes of Mycoplasma genitalium and Mycobacteria tuberculosis.
To measure the performance of PSI-BLAST we first constructed a pseudo genome con- sisting of sequences of known structure from the SCOP database (Structural Classification Of Proteins, [Murzin et al., 1997]) that includes single and multi domain proteins, this is called the SCOP-genome . SCOP groups proteins into the same superfamily if there is strong evidence (from structure, function and sequence) that they diverged from a common ancestor. Proteins from the same superfamily are homologues though homology is not always obvious from the sequence information alone. Another part of SCOP was taken as the database to search for homologues of the SCOP-genome. This database contains only sequences that share weak homology to a particular query focusing our benchmark on the twilight zone to detect remote homologies. PSI-BLAST was then tested to find sequences in that database for each query of the SCOP-genome which are in the same superfamily. The `classical' success rate corresponds to the fraction of possible relationships correctly identified by PSI-BLAST (what we call the `one-to-one' success rate). For genome annotation only one relationship has to be identified and the success rate is the fraction of ORFs in the SCOP-genome for which the correct annotation was found (this relationship we call `one-to-many'). It is obvious that `one-to-many' success must be at least `one-to-one' success but will often be higher.
The results of our benchmark are based on the domain level. PSI-BLAST results were used to construct a stacked multiple sequence alignment from which domain boundaries in the query sequence were identified. Domain boundaries (positions) were found with high accuracy. Sixty five% of the boundaries are within 5 residues oset of the real domain boundaries as defined by SCOP. For 40% of the domains in our SCOP-genome PSI-BLAST found at least one sequence in the target database of the same superfamily (a homologue), this corresponds to the `one-to-many' success rate. The success rate measured by `one-to-one' (finding all possible homologues pairs between the SCOP-genome and the target database) is less than two nd a half times lower. One might expect that the more a superfamily is populated the higher is the `one-to-many' success rate, but this is not observed. This is explained by the decreasing success rate in remote `one-to-one' relationships.
From the benchmark we determined optimal parameters for using PSI-BLAST in database searches. Assigning homologues sequences of known structure to ORFs of the Mycoplasma genitalium and Mycobacterium tuberculosis genomes we found a set of common folds in both genomes but also substantial dierences in fold composition. About 30-40% of the residues in the two bacterial genomes could be assigned to a sequence of known structure.
The results of the benchmark show that 28% of the residues in the SCOP-genome can be assigned to a sequence of known structure, for 59% of the residues in the SCOP- genome PSI-BLAST was not able to find any homologue, 12% of the residues are unique (the only representative of a superfamily) and 1% is assigned to non homologues (false assignments). From the ratio of correctly assigned to missing assignments we calculated the potential fraction of missing assignments in the two bacterial genomes, this is more than 30% of the residues in the genomes. This fraction may be reduced by even more sensitive search methods like structure enhanced fold recognition.
Our benchmark is a simulation of the situation in genome analysis and measures the performance of one of the most widespread database search methods. The benchmark can be used to interpret the results of such analysis and shows areas of future improvements in methodology. Our benchmark uses domain information from structure and therefore domain information as provided by databases like ProDOM or PFAM is required when annotations are based on sequence without known structure.