|In Silico Biology 3, 0037 (2003); ©2003, Bioinformation Systems e.V.|
Göttingen Genomics Laboratory, Institut für Mikrobiologie and Genetik,
Grisebachstrasse 8, D-37077 Göttingen, Germany
Phone: ++49(0)551-39-3823, Fax: ++49(0)551-39-3805
1 Abteilung Allgemeine Mikrobiologie, Institut für Mikrobiologie und Genetik
2 Abteilung Molekulare Genetik und Präparative Molekularbiologie, Institut für Mikrobiologie und Genetik
* corresponding author
Edited by H. Michael; received June 23, 2003; revised and accepted August 22, 2003; published September 16, 2003
The performance of gene-predicting tools varies considerably if evaluated with respect to the parameters sensitivity and specificity or their capability to identify the correct start codon. We were interested to validate tools for gene prediction and to implement a metatool named YACOP, which combines existing tools and has a higher performance. YACOP parses and combines the output of the three gene-predicting systems Criticia, Glimmer and ZCURVE. It outperforms each of the programs tested with its high sensitivity and specificity values combined with a larger number of correctly predicted gene starts. Performance of YACOP and the gene-finding programs was tested by comparing their output with a carefully selected set of annotated genomes. We found that the problem of identifying genes in prokaryotic genomes by means of computational analysis was solved satisfactorily. In contrast, the correct localization of the start codon still appeared to be a problem, as in all cases under test at least 7.8% and up to 32.3% of the positions given in the annotations differed from the locus predicted by any of the programs tested. YACOP can be downloaded from http://www.g2l.bio.uni-goettingen.de.
Key words: gene prediction, specificity, sensitivity
One of the first and most critical steps of genome annotation is the process of predicting genes that code for proteins. All subsequent modules of the software pipeline used to assign gene function rely on the initial selection of coding elements. This dependency explains why the identification of genes must be set up carefully. There are two algorithmic concepts appropriate to recognize genes: 1.) A sequence can be classified as a gene, if it shows significant similarity to a sequence, which was annotated as coding and deposited in a database. This method is reliable and robust, however, it fails in identifying gene sequences unknown so far. Therefore, this approach is usually supplemented by other methods (as in Critica [Badger and Olsen, 1999]) 2.) A statistical analysis of a sequence may indicate its coding potential. It is known that the distributions of nucleotides in coding and non-coding sequences differ statistically significantly [Fickett, 1982; Staden, 1984]. Gene prediction based on statistical models circumvents the limitation mentioned above and has the capability of identifying new gene sequences. This is why most tools developed for gene prediction [e. g. Besemer et al., 2001; Borodovsky and McIninch, 1993; Frishman et al., 1998; Guo et al., 2003; Salzberg et al., 1998] implement algorithms, which rely on statistical concepts (reviewed e. g. in Fickett, 1996; Burge and Karlin, 1998).
In principle, gene prediction has to solve a classification problem. In a set of open reading frames (ORFs), those have to be identified that code for proteins. However, as for many other classification problems, it is not possible to separate the objects (ORFs) into two, non-overlapping classes of positives (genes) and negatives (non coding ORFs). Due to the limited decision strength of the statistical methods caused by the noisy signals that have to be analyzed, the two classes overlap. There is no way to escape the following situation: Any definition of a cut-off value, which is necessary to unambiguously separate the objects into the two classes of positives or negatives, implies a specific ratio of false positive and false negative assignments.
For the annotation process, it could be fatal to miss genes. This is why the cut-off value is usually set conservatively in order to guarantee a small number of false negatives (missed genes). However, there is a price to be paid for deciding in favour of safety: Under these conditions, the classification system generates a larger number of false positive predictions. These assignments have to be falsified during the annotation process, which is a tedious and laborious task.
For many classification problems, it is possible to increase the decision strength by combining several parameters or statistical models supplementing each other. There is no reason to assume that this concept does not enhance the performance in gene finding. It was already demonstrated that merging predictions works as expected [Rogic et al. 2002; Guo et al., 2003]. A prerequisite for any rational approach to combining methods is some knowledge about the performance of the tools considered. This is why we first validated the publicly available gene finders Critica [Badger and Olsen, 1999], Glimmer [Salzberg et al., 1998; Delcher et al. 1999], Orpheus [Frishman et al., 1998] and ZCURVE [Guo et al., 2003], before we assessed combinations of these tools.
There is a second argument for considering the combination of gene finding tools: The prediction of the start codon is by no means a trivial task. All codons that might serve as a gene start may also occur within the coding sequence. For this reason the analysis of additional signals like ribosomal binding sites [Suzek et al., 2001] is often integrated into gene predicting systems. Again, a combination of several tools might show a higher performance than any stand-alone system.
The accuracy of gene prediction [Fickett and Tung, 1992] has been assessed both for eukaryotic [Guigó, 1997; Guigó et al., 2000] and prokaryotic [Guo et al., 2003] genomes. We focused on the improvement of predicting genes and the correct start codon in prokaryotic sequences. We found that the combination of the Critica output with a subset of genes predicted by Glimmer and ZCURVE performed best with respect to sensitivity without considerably decreasing the specificity. Our results determined the design of the metatool YACOP (Yet Another Combination Of Predictions), which we introduce here.
In order to perform a critical validation of gene finding tools, it is necessary to select a training set and to define the assessment criteria.
Ideally, a training set should consist of a sufficiently large number of DNA-sequences covering the complete range of GC-content values occurring in prokaryotic genomes. Additionally, it would be necessary to know for these sequences the exact localization of the embedded genes. Unfortunately, such a training set does not exist. Therefore, we selected a number of annotated genomic data sets and compared the output produced by the tools with the entries of the respective annotation. We only considered genomes, which we could positively confirm as not being annotated by means of the gene-finding programs we wanted to evaluate. The genomes comprising the test set were Buchnera sp. APS [Shigenobu et al., 2000], C. acetobutylicum [Nolling et al., 2001], L. lactis [Bolotin et al., 2001], H. pylori [Tomb et al., 1997], B. subtilis [Kunst et al., 1997], E. coli K-12 [Blattner et al., 1997] and M. tuberculosis [Cole et al., 1998]. The lowest GC-content we tested was found in Buchnera sp. APS (26%) the highest one in M. tuberculosis (65%).
Criteria for the evaluation of classification systems
For the evaluation of a classification system, the criteria defined by formulas (1) - (3) are frequently used. Let TP be the number of true positives, TN be the number of true negatives, FP be the number of false positives and FN be the number of false negatives. Then we can define the following terms:
The term selectivity quantifies the portion of negative cases correctly identified by the classification system. The term sensitivity gives the probability of correctly classifying a positive case. The specificity of a classification system is the fraction of correctly predicted positive cases among all cases predicted as positive.
Here we report on the results we achieved for Critica (version 105b [Badger and Olsen, 1999]), Glimmer (versions 2.02 and 2.10 [Delcher et al., 1999]), Orpheus (version Orpheus 2 [Frishman et al., 1998]) and ZCURVE [Guo et al., 2003]. Orpheus was used in combination with DPS [Huang, 1996], the minimal prediction length was set to 50 codons. Glimmer utilized rbsfinder [Suzek et al., 2001] to predict ribosomal binding sites.
Rating Gene-Finding Tools
For the evaluation of gene-finding programs it is difficult to unambiguously define the number of true negatives. In addition, it is more important to rate the tools with respect to their capability of identifying positive cases (genes). Therefore, we only compared sensitivity and specificity of the tools. In order to determine the numbers TP, FP and FN we parsed the output produced by the gene-finding programs and compared the predictions with the annotation of the genomic data sets. We observed that the number of false positive predictions increased significantly for short genes. Therefore, and for a first analysis, we did not consider genes and predictions comprising less than 50 codons. A compilation of the results is listed in Tab. 1.
Table 1: Performance of gene finding tools and their combination.
|For the programs Critica, Glimmer, Orpheus, ZCURVE and YACOP the number of false positive (FP) and true positive (TP) predictions were determined by comparing the output of the programs with the annotation of the tabulated genomic data sets. Predictions and entries shorter than 50 codons were ignored. The column TP_SC gives the number of predicted start codons, which are in agreement with the annotation. For the combination Glimmer 2.10 ZCURVE (marked Gl 2.10 ZC), the start codons predicted by ZCURVE were assigned. These three parameters are given in absolute numbers and percent values. Example: Critica correctly predicted 2006 (89.0%) of the 2254 genes > 50 codons listed in the annotation of L. lactis. GC is the GC-content of the genome in percent. The column Annot gives the number of annotated genes > 50 codons. The last row gives the mean of the above percent values.|
The results can be summarized as follows: Critica was the program that predicted in all cases the lowest number of false positives and the lowest number of true positives. Glimmer and ZCURVE were the tools that predicted in most cases the highest number of true positives. ZCURVE and Orpheus generated in all cases a larger number of false positives than Glimmer. Version 2.10 of Glimmer predicted both a smaller number of false positives and true positive cases than version 2.02 (data not shown). For the application intended here, we preferred version 2.02, as it predicted a higher number of true positives.
Our first aim was to find a formula for combining the different predictions, which resulted in an optimal pair of high sensitivity and specificity values. To reach this goal, we evaluated subsets created according to the rules of set theory. We obtained best results when we merged for YACOP the output of three programs according to the following Boolean operation: Critica (Glimmer ZCURVE). In order to reduce the large number of false positives, YACOP accepts from Glimmer and ZCURVE only predictions comprising at least 50 codons.
The mean number of false positives created by YACOP was 3.9% higher than the lowest value of 1.7%, achieved by Critica. However, YACOP gained, if compared to Critica, on average 6.4% of true positive predictions. Compared to ZCURVE, which attained the highest mean value of true positives, 0.6% of true positive cases were lost. However, the mean number of false positives predicted by YACOP was 8.7% lower. In the worst case, (genome of M. tuberculosis) the number of false positives increased from 2.2% (attained by Critica) to 10.0% (YACOP). The number of correctly predicted genes grew in the same genome from 91.0% (value for Critica) to 97.6%. This genome had the highest GC-value we tested. In all other cases, the increase of false positives was < 4.6%. As expected, YACOP and the combination (Glimmer ZCURVE) showed a similar performance with respect to the parameters discussed so far.
Up to now, we only considered predictions > 50 codons. YACOP performed even better, if the tools were evaluated without a lower limit for gene length. The enhancement gained by YACOP can be seen best, if specificity and sensitivity of the approaches are compared (see Tab. 2). Our combination outperforms each of the other tools due to its constantly high performance values. The comparison of mean values (see last two rows of Tab. 2) makes clear that most of the false positive predictions are short ORFs. Critica was the only tool not loosing several percent of specificity when we subsumed predictions < 50 codons. Therefore, we accepted for YACOP only those predictions < 50 codons that were generated by Critica. Using this configuration, YACOP did not predict 215 genes found in the annotation of the E. coli K-12 genome. 167 of these entries were annotated as "hypothetical" or "putative". 209 of these genes were longer than 80 codons.
Table 2: Specificity, Sensitivity and performance of gene start prediction for Critica, Glimmer, Orpheus, ZCURVE, Glimmer ZCURCE and YACOP.
|GC gives in percent the GC-content of the genomic data sets analyzed. The column Spec gives the specificity calculated according to formula (3), sensitivity (Sens) was calculated according to formula (2). The column SC gives the fraction of predicted start codons that were in agreement with the annotation. For the determination of the mean value (genes > 50 codons) given in the last row, predicted genes and annotations shorter than 50 codons were ignored. For Glimmer, versions 2.02 and 2.10 were tested. Orpheus was only used with a lower limit of 50 codons for predictions. The start codons predicted by ZCURVE were assigned to the genes of the set (Glimmer 2.10 ZCURVE).|
Start Codon Prediction
It is known that the identification of the correct start codon is a difficult task [Frishman et al., 1999]. The comparison of the predicted gene starts with the annotations showed that Critica identified start positions in all genomes most specifically. Critica's ratio (TP_SC)/TP was maximal in all genomes, followed by ZCURVE. This is why YACOP accepts for all genes predicted by Critica also the predicted start position; it assigns the start codon predicted by ZCURVE to all other genes. Surprisingly, in all genomes we tested, at least 6.5% and up to 35.6% of the gene starts (genes > 50 codons) given in the annotations differed from the predictions made by any of the four tools. These findings suggest that the concepts used to identify the gene start are not yet sufficient, even for the simpler case of prokaryotic genomes. This especially applies to genomes with a high GC-content. Again, by combining the tools, the mean number of correctly predicted start positions could be increased by 2.2% (compared to ZCURVE).
It was shown that the predictive quality of a gene-finding algorithm is correlated with its capability of identifying ribosomal binding sites [Frishman et al., 1999]. We wanted to find out, whether the identification of ribosomal binding sites enhances the performance of localizing the gene start. Among the genomic data sets currently deposited in GenBank, only the annotation of S. typhimurium LT2 [McClelland et al., 2001] listed the positions of ribosomal binding sites (rbs). Again, we restricted the analysis to genes > 50 codons. By parsing the annotation, we created two sets: RBS, which contained those 3180 genes downstream of a predicted rbs, and NO_RBS, which contained the remaining 1217 genes. We determined the number of gene starts concordant with the annotation (see Tab. 3). For all tools, the number of start positions, which were in agreement with the annotation, was approximately 3% higher for the set RBS than for NO_RBS. Interestingly, this increase was also true for ZCURVE, which does not explicitly model ribosomal binding sites. It might be that the sequence composition around the gene start downstream of a rbs differs from the non rbs case and thus allows a more precise localization. This hypothesis relies however on the correct annotation of ribosomal binding sites in the Salmonella data set. Under this assumption, it seems that the identification of ribosomal binding sites does - at least for the genome of S. typhimurium - not have a strong effect on the prediction of genes and the localization of the gene start. None of the four programs predicted more than 82.3% of the gene starts found in the annotation. As Glimmer was used for gene finding in S. typhimurium and as Critica was trained on Salmonella data, we did not interpret the results in more detail and did not consider additional performance parameters derived from this data set for the evaluation of the tools presented above.
Table 3: Identification of genes and start positions in S. typhimurium.
|The output of the programs Critica, Glimmer (version 2.10), Orpheus, ZCURVE and YACOP was compared with the entries > 50 codons given in the annotation. These 4397 genes were separated into two sets: Set RBS consisted of 3180 genes, which follow an annotated ribosomal binding site (rbs). The set NO_RBS consisted of 1217 genes having no rbs. For each entry, the number of predictions consistent with the annotation and the respective fraction in percent is given. Abbreviations: TP number of true positives, TP RBS true positives for the set RBS, TP NO_RBS true positives identified for NO_RBS, TP_SC number of start positions in agreement with the annotation, TP_SC RBS ditto for genes downstream of an annotated rbs, TP_SC NO_RBS ditto for genes having no rbs.|
The metatool YACOP is written in Perl. Its source code plus additional documentation is available for download at http://www.g2l.bio.uni-goettingen.de. We have implemented several modes e. g. to control the composition of the final prediction via Boolean operators. These modes can be set by editing a configuration file. One goal of our design was to support the integration of additional tools to the greatest possible extent. We explained the architecture in a readme-file, which is part of the download. A prerequisite for the usage of YACOP is, of course, the correct installation of all tools introduced above.
It would be interesting to identify those algorithmic concepts or parameters that determine the performance of the tools tested and combined here. A prerequisite for such an in depth analysis is a precise and up to date documentation of the algorithms and the parameter settings used or the source code of the programs. In many cases, only the basic principles of the algorithms are published and frequently the source code is not available. Therefore, we had to rely on the authors of the tools by assuming an implementation and parameter set trimmed for optimal performance. Critica, Glimmer and ZCURVE extract the training set needed for parameter optimization from the input data. However, as our results show, even this dynamic adaptation of the parameters does not guarantee a performance that is completely independent of the GC-content. In general, the authors of Critica selected a configuration that reduced the number of false positive predictions. Consequently, the number of true positives is lowered too. These settings are presumably responsible for Critica's better performance on short genes. Due to the reduced sample size, statistical parameters determined for short genes vary in a broader range and therefore a classification may fail more frequently.
Our results demonstrate that each of the algorithmic concepts tested here predicts a different set of putative genes. This is why we combined tools with individual and dissimilar approaches: Critica implements a heuristic and rather simple statistical model based on codon frequencies inferred from subsequences homologous to entries deposited in databases. The statistical concepts utilized in Glimmer are Hidden Markov Models; ZCURVE relies on a specific combination of indicators deduced from frequencies of short subsequences. As mentioned above, YACOP allows - due to its flexible architecture - the integration of additional tools and the adaptation of the Boolean expression necessary to generate the output.
One might argue that the standards, i. e. the annotations we used to evaluate the tools, massively depend on the algorithmic concepts tested here. However, even from that point of view, the comparative analysis is valuable, as it highlights the specific properties of the gene-finding programs. It is possible that any gene-finding approach being founded on computational methods misses out a number of genes. However, if one assumes that the annotation of E. coli K-12, one of the best-studied species, contains all the biochemical information on E. coli genes, one can state that the tools we tested did not miss any relevant entry. This notion is in agreement with the findings of Guo et al., 2003. The authors tested the performance of gene finding and predicting the start codon by using the database EcoGene [Rudd, 2000] containing experimentally validated genes of E. coli as a test bed for ZCURVE and Glimmer. Both programs performed for this set better than for the complete genome of E. coli analyzed here. The number of true positive predictions was higher. We did not consider the analysis of these data using Critica, which was optimized and evaluated on Salmonella genes. This species is taxonomically closely related to E. coli. Therefore, training data and test set may overlap. The performance of predicting the start codon for EcoGene entities corresponds to our findings and supports our preference for ZCURVE over Glimmer. Correlating these results, we deduced that the approach used here and elsewhere [Guo et al., 2003] to evaluate gene-finding programs is valid and informative. It is unquestionable that all tools identify a high percentage of experimentally validated genes. The focus of our study was the reduction of false positive predictions in order to gain an optimal combination of sensitivity and selectivity.
Our results show that the combination of three gene-finding tools increases the performance significantly. In Guo et al., 2003 it was demonstrated that the intersection of the predictions made by Glimmer and ZCURVE contains less false positives than each of the two programs. The combination we propose here outperforms this approach with its higher sensitivity and better quality of predicting the gene start.
The performance of the tools decreased in genomes with a GC-content > 50%. This was observed in several genomic data sets not listed here. Of special importance is the deterioration in predicting the exact gene start. The results we presented were achieved with default settings. It is possible that a parameter set carefully tuned for each genome, stabilizes the predictive power of the tools we tested. Nevertheless, we gained the impression that the problem of identifying the correct gene start has - even for prokaryotic genomes - not yet been solved in a satisfying manner. For E. coli K-12, the tools under test predict at most 76.8% of the gene starts found in the annotation. In the worst case, more than 50% of the predicted start codons are questionable if the annotators accept predictions generated by a single gene finder executed with default parameters. If we assume that the annotators had good reasons for selecting a differing gene start, we can deduce that additional efforts have to be made to enhance the tools. In addition, these findings support the notion that it is essential to critically survey the position of gene starts deposited in the databanks in order to counteract transitive error propagation.
We consider two strategies to further improve YACOP: 1.) The evaluation and integration of novel methods for identifying translational initiation sites like support vector machines [Zien et al., 2000] to enhance the prediction of the gene start. 2.) A second round of fine-tuned gene prediction in order to assess suspiciously long and unassigned DNA-patches. Thus, it might be possible to further increase YACOP's performance.
The project was carried out within the framework of the Competence Network Göttingen "Genome research on bacteria" (GenoMik) financed by the German Federal Ministry of Education and Research (BMBF). We thank the authors of the programs for supplying us with software.