In Silico Biology 2, 0018 (2002); ©2002, Bioinformation Systems e.V.  
G C B ' 0 1

AGenDA: Gene prediction by comparative sequence analysis

Oliver Rinner1,2,3 and Burkhard Morgenstern1,4




1GSF Research Center, MIPS / Institute of Bioinformatics,
Ingolstädter Landstraße 1,
85764 Neuherberg, Germany
2Physiologisch-Chemisches Institut,
Universität Tübingen,
Hoppe-Seyler-Str. 4,
72076 Tübingen, Germany

3current address: Brain Research Institute, ETH Zurich,
Winterthurerstrasse 190,
CH-8057 Zurich, Switzerland
4current address: International Graduate School for Bioinformatics and Genome Research,
University of Bielefeld,
Postfach 10 01 31
33501 Bielefeld, Germany





Edited by E. Wingender; received November 30, 2001; revised and accepted January 7, 2002; published March 15, 2002


Abstract

Comparative sequence analysis is a powerful approach to identify functional elements in genomic sequences. Herein, we describe AGenDA (Alignment-based GENe Detection Algorithm), a novel method for gene prediction that is based on long-range alignment of syntenic regions in eukaryotic genome sequences. Local sequence homologies identified by the DIALIGN program are searched for conserved splice signals to define potential protein-coding exons; these candidate exons are then used to assemble complete gene structures. The performance of our method was tested on a set of 105 human-mouse sequence pairs. These test runs showed that sensitivity and specificity of AGenDA are comparable with the best gene- prediction program that is currently available. However, since our method is based on a completely different type of input information, it can detect genes that are not detectable by standard methods and vice versa. Thus, our approach seems to be a useful addition to existing gene-prediction programs.

Availability: DIALIGN is available through the Bielefeld Bioinformatics Server (BiBiServ) at http://bibiserv.techfak.uni-bielefeld.de/dialign/ The gene-prediction program AGenDA described in this paper will be available through the BiBiServ or MIPS web server at http://mips.gsf.de.

Keywords: gene prediction, sequence alignment, comparative genome analysis, cross-species sequence comparison, phylogenetic footprinting, genome annotation, dynamic programming



Introduction

A major goal of large-scale sequencing projects such as the human genome project is to identify all genes in a given organism. Consequently, the development of automated gene-finding procedures has become one of the most active areas of research in Bioinformatics. For prokaryotic organisms, the task of gene identification is relatively easy as prokaryotic genomes are rather small and genes are not interrupted by introns. Here, all open reading frames exceeding some threshold length are likely to code for proteins. The gene-finding problem is much more complicated for eukaryotic organisms where the density of genes in the genome is by about two orders of magnitude lower than in bacterial genomes and genes typically consist of multiple exons separated by introns of varying length. During the last years, a large number of computer programs have been developed to identify genes in eukaryotic genome sequences, see Claverie (1997) and Stormo (2000) for excellent reviews. Recent studies show, however, that the reliability of these methods is limited [Burset and Guigó, 1996]. Moreover, many gene-prediction programs have originally been tested on genomic sequences of only a few kb in length where each sequence contained only a single gene. The performance of standard gene-prediction methods drops significantly when tested under more realistic conditions, i.e. on BAC-sized sequences (200 kb in length) containing multiple genes [Guigó et al., 2000]. Thus, the problem of gene-prediction is far from solved and progress made in gene-prediction methodology is not only theoretically interesting but also of highest practical importance.

Practically all existing gene-prediction programs rely on information derived from known genes. Roughly spoken, these methods can detect new genes if they "look like" those known genes, and the major differences between existing methods are in how they assess if a stretch of genomic DNA "looks like" known genes. Here, two distinct approaches are used: ab-initio or intrinsic methods use content statistics such as ORF length or codon usage together with signals like splice junctions to distinguish coding from non-coding regions. GRAIL [Uberbacher and Mural, 1991], GENEID [Guigó et al., 1992], GENSCAN [Burge and Karlin, 1997] and GENEMARK [Lukashin and Borodovsky, 1998] are among the most popular ab-initio programs. By contrast, extrinsic methods work by comparing genomic sequence to ESTs or proteins in databases. That is, instead of asking if a stretch of genomic DNA looks like genes usually look, these methods ask if a piece of genomic sequence looks like one particular known gene or protein. This idea has been implemented, for example, in the GENEWISE [Birney and Durbin, 1997] and PROCRUSTES [Gelfand et al., 1996] computer programs.

With the huge amount of genomic data that are now available, a new way of predicting genes and other functional elements in genomic sequences is emerging: it is possible to identify functional regions in genomic DNA by comparing evolutionary related genomic sequences with each other. In the last few years, this approach has been successfully applied to identify new functional elements in genomic sequences [Ansari-Lari et al., 1998; Batzoglou et al., 2000; Göttgens et al., 2000 and 2001; Jareborg et al., 1999; Loots et al., 2000], and it is now widely accepted that comparative sequence analysis is a powerful and universally applicable tool for functional genomics. For gene prediction, this means that rather than asking if a given genomic sequence shows local sequence similarity to a known gene, one would ask if it exhibits statistically significant similarity to some arbitrary region in the genome sequence of an evolutionary related organism. The idea behind this approach is simple: during evolution, functional parts of DNA or protein sequences are under selective pressure, so they tend to evolve slower and are generally more highly conserved than non-functional sequence. Thus, local sequence conservation usually indicates biological functionality. Bafna and Huson (2000), Batzoglou et al. (2000), Wiehe et al. (2001) and Novichkov et al. (2001) utilized this fact and proposed gene-prediction methods that rely on comparing syntenic sequences from related organisms to each other, see also Miller (2001) for a review of related approaches. An interesting combination of intrinsic and comparative methods has recently been proposed by Korf et al. (2001).

The dependence of traditional approaches to gene prediction on features derived from known genes makes it difficult to detect new genes with different properties. Moreover, wrongly annotated genes are automatically used to build new models for gene prediction tools that will then, in turn, pick up even more false genes thereby leading to a vicious circle of wrong sequence annotation. In contrast, homology-based approaches such as the one that we propose in this study, do not rely on information on known genes and can therefore avoid the pitfalls of the more traditional approaches.

The approach that we propose in this article starts with a pair of syntenic sequences from evolutionary related organisms that may have been found with standard database search tools such as BLASTX [Altschul et al. 1994]. In a first step, we compare these and identify a chain of local sequence similarities. In a second step, our program searches for conserved splice signals and start or stop codons near the boundaries of the identified sequence similarities. Finally, local homologies that are bounded by conserved splice signals are chained together in a biologically consistent way using a straight-forward dynamic programming procedure.



Alignment of large genomic sequences

The first and most critical step in sequence comparison is to align the sequences in question and the results of any comparative sequence-analysis method can be only as good as the quality of the underlying alignments. Most standard alignment methods are either global methods that try to align sequences over their entire length or local methods that return isolated regions of local sequence similarity. These methods are not appropriate for alignment of large genomic sequences where local homologies are typically separated by large stretches of un-related "junk DNA". In these situations, global alignment methods would try to align even completely unrelated parts of the sequences. Local methods, on the other hand, would identify high-scoring local similarities, but would not give an overall picture of the homologies among the input sequences.

For these reasons, alternative alignment procedures have been proposed that combine local and global aspects of sequence alignment. The DNA Block Aligner [Jareborg et al., 1999] or the Wobble Aware Bulk Aligner [Kent and Zahler, 2000] are, for example, new alignment methods that integrate multiple local sequence similarities into one resulting output alignment. The most popular method of this type is PipMaker [Schwartz et al., 2000] that is based on the BLAST local alignment program [Altschul et al., 1997]. For our approach to gene prediction, we use a new version of the DIALIGN alignment program [Morgenstern et al., 1996; Morgenstern, 1999]. This program integrates local and global features by assembling pair-wise and multiple alignments from gap-free local segment alignments (so-called fragments). Each possible fragment is given a weight score based on the probability of random occurrence of a fragment of the corresponding length and sum of matches. The program then selects a consistent collection of fragments with maximum total weight score [Abdeddaïm and Morgenstern, 2001]. For pair-wise alignment, this means that the program returns a chain of fragments of maximum total weight, see Morgenstern (2000) for algorithmical details.


Figure 1: DIALIGN alignment of human and murine genomic sequences. Lines between the sequences (seq1 and seq2) and horizontal red bars represent fragments (segment pairs) identified by the program; vertical bars on the top line represent the weight scores of these fragments. Green bars correspond to known exons.


It has been shown that, for large genomic sequences, high-scoring fragments returned by DIALIGN are highly correlated to exons in genomic sequences [Morgenstern et al., 2001]. Fig. 1 shows a DIALIGN alignment of human and murine genomic sequences. Here, peaks of local sequence similarities identified by the program clearly correspond to annotated protein-coding regions, so alignments produced by DIALIGN can give a first hint as to where exons can be expected in anonymous genomic sequences. However, one cannot expect the extent of local sequence similarity to precisely coincide with protein-coding regions and it is clearly not possible to predict entire gene structures solely based on sequence similarity information. Moreover, if the evolutionary distance between the compared species is small as is the case in the human-mouse alignment shown in Fig. 1, even non-functional parts of the sequences may be conserved and it becomes more difficult to discriminate between functional and non-functional parts of the sequences. It is therefore necessary to take more information into account to identify conserved gene structures in syntenic genome sequences. In our approach, we adopted the following procedure to identify potential protein- coding exons (at present, we do not try to detect non-coding components of genes such as 3'- and 5'-UTRs or non-coding RNA genes).

  1. In a first step, high-scoring segment pairs (fragments) identified by DIALIGN are clustered by bridging small gaps between them. The difference in length between the gaps in both sequences is required to be a multiple of three in order to preserve the reading frame in both sequences.

  2. Conserved splice junctions and start/stop codons near the cluster boundaries are identified using standard models. For this purpose, we used consensus matrices proposed by Salzberg et al. (1997). Since these matrices are short (12 bp in length) and contain only two fully conserved positions, they are not very specific. In our approach, however, we search for pairs of conserved signals, i. e. we accept only those splice signals and start/stop codons that occur in both respective segments at the same relative position. This greatly reduces the noise generated by false positive splice signals.

  3. Candidate exons (CEs) are obtained by elongating or shortening fragment cluster such that they start with a conserved start codon or acceptor site and end with a conserved stop codon or donor site. In addition, we require that CEs are open reading frames (ORFs), i.e. they are not allowed to contain internal stop codons. Each candidate exon E is associated with exactly one reading frame fr(E) {0; 1; 2} so a region of local sequence similarity flanked by conserved splice signals or start/stop codons can correspond to up to three distinct CEs, depending on how many open reading frames it comprises. If no conserved signals can be found near the boundaries of a fragment cluster or if no ORF can be found, the cluster is discarded. Note that, though we have reduced the noise of false positive splice junctions, a cluster of fragments may still be flanked by multiple conserved splice signals, so each cluster can give rise to an arbitrary number of alternative CEs as illustrated in Fig. 3. Since we want to be able to identify genes on both DNA strands, we also consider candidate exons located on the inverse strand; these candidate exons start with donor sites or stop codons and end with acceptor sites or start codons.

  4. A candidate gene is a chain of CEs that is biologically consistent. This means that all CEs are located on the same strand; A candidate gene on the plus strand starts with a start codon, ends with a stop codon and each CE ending with a donor splice site is followed by one starting with an acceptor site (for candidates on the minus strand, start/stop codons and splice signals appear in reversed order). In addition, the total length of a candidate gene is required to be a multiple of three, no internal stop codons are allowed and gaps between candidate exons have to meet certain length restrictions.


A dynamic-programming procedure for finding optimal chains of candidate genes

Next, we try to find a chain of candidate genes that is most likely to correspond to the real genes in the input sequences. To do so, we define a scoring function that assigns a quality score to every possible chain of candidate genes. We start with defining a score for each candidate exon that takes into account (a) the degree of similarity among the respective segments of the sequences as measured by the DIALIGN weight scores of the underlying fragments, (b) the discrepancy between a CE and the corresponding fragment cluster that is caused by the requirement that a CE is bounded by start/stop codons or splice signals and (c) the quality of the splice signals. Consider a candidate exon E that is derived from a cluster C of fragments f1, ..., fk where w(fi) is the weight score of fragment fi and let lenC be the length of C. Moreover, let scSP be the score of the splice signals by which E is bounded and let ND be the length length of the regions where C and E do not coincide, i. e. ND is the number of residues that belong to C but not to E or vice versa. Scores for splice signals are defined based on consensus matrices [Salzberg et al. 1997]. The score sc(E) of E is then defined as

    

Next, we define the remainder re(E) of a candidate exon E to be the length of the last (possibly truncated) codon of E, so we have re(E) {1, 2, 3} for all candidate exons E.

With these definitions, the requirement that a chain C = (E1, ... , Ek) of candidate exons represents a biologically consistent chain of candidate genes is equivalent to the following conditions

(1a) if Ei is on the plus strand ending with a donor splice site then Ei+1 is on the plus strand and starts with an acceptor splice site and re(Ei) + fr(Ei+1) = 3.
(1b) if Ei is on the minus strand ending with an acceptor splice site then Ei+1 is on the minus strand and starts with a donor splice site and re(Ei+1) + fr(Ei) = 3.
(1c) if Ei is on the plus strand ending with a stop codon or on the minus strand ending with a start codon then Ei+1 is either on the plus strand starting with a start codon or on the minus strand starting with a stop codon.
(2a) if E1 is on the plus strand then E1 starts with a start codon; if E1 is on the minus strand then E1 starts with a stop codon
(2b) if Ek is on the plus strand Ek ends with a stop codon; if Ek is on the minus strand, Ek ends with a start codon.

Consider a set = {E1, ... , EN} of candidate exons together with a scoring function sc assigning a score sc(E) to every E . The algorithm outlined in Fig. 2 is a modification of a standard one-dimensional interval chaining algorithm as described, for example, in Gusfield (1997, pp. 325 ff.); it returns a chain of candidate exons ( 1, ... ,k), i , satisfying conditions (1a) - (2b) with maximal total score.


Figure 2: Pseudo code description of an algorithm that calculates a biologically consistent chain of candidate exons (CEs) with maximum total score. Input data are a set = {E1, ... , EN} of CEs together with a score sc(Ej ) for each Ej . I is a list containing the starting and end points of all Ej . After sorting the list I, the algorithm calculates for each Ej the total score SC(Ej ) of an optimal chain ending in Ej together with the predecessor PR(Ej ) of E in this chain. is the maximal total score so far of a chain ending with a CE Ej on the plus strand with a remainder re(Ej ) = i, is the last CE Ej in this chain; and are defined accordingly. max* is the maximal total score so far of an optimal chain ending with a start or stop codon, pr* is the last CE in this chain.





Results

To evaluate our method, we used a set of 117 pairs of genomic sequences from human and mouse compiled by Batzoglou et al. (2000). The sequence lengths are between 0.6 kb and 35 kb. According to the authors, these sequences are carefully annotated so they can be considered as a standard of truth. We used 12 sequence pairs as training data to optimize the parameters used in our program; the remaining 105 sequence pairs were used for our test runs. We compared our results with the output of GenScan [Burge and Karlin, 1997], the most successful software tool for gene prediction currently available. Standard measures of prediction accuracy were used namely sensitivity and specificity at the exon level. That is, a predicted exon is considered a true positive if its boundaries precisely coincide with the boundaries of an annotated exon; predicted exons that only partially overlap with annotated exons are counted as false positives. Fig. 3 shows an example of gene prediction on a 4-kb sequence stretch. DIALIGN fragments are clustered and boundaries of candidate exons are determined by conserved signals as described above. This example demonstrates the filter effect of conserved signals. Although there is significant conservation outside the coding sequence parts, only few candidate exons are found in this area because of the absence of conserved splice signals.

Figure 3: Example for gene prediction in a 4-kb stretch of human DNA with AGenDA. Green lines below the ruler indicate gene- coding regions. Above the ruler are the DIALIGN fragments. Blue lines indicate alignment based on DNA level, red lines indicate alignment on peptide level. Fragments with thin lines have low alignment scores. Above the DIALIGN fragments the orange lines show the candidate exons constructed by AGenDA. Black candidate exons are actually chosen by the dynamic programming routine.


As shown in Fig. 4, the results of AGenDA were comparable with the results of GenScan in terms of both sensitivity and specificity. The sensitivity of our method was 76 % (GenScan: 82 %) while our specificity was 78 % (GenScan: 77 %). 64 % of all annotated exons were correctly identified by both methods. Our method could identify an additional 12 % of the annotated exons not identified by GenScan while GenScan identified 17 % of the exons that were not found by our approach (Fig. 3).

Figure 4: Sensitivity and specificity of GenScan and AGenDA applied to a set of 105 human and murine sequence pairs.

Figure 5: Percentage of exons correctly predicted by GenScan and by our alignment-based gene-prediction method. The two methods rely on different types of input information, so exons not detectable by one method can be detected by the respective other method.



Discussion

We have developed a novel method for gene prediction that is based on comparative sequence analysis. In a first step, syntenic regions from evolutionary related genomes are aligned using the DIALIGN program. Local homologies identified by the alignment procedure are searched for conserved splice junctions to obtain potential exons. These candidate exons are then assembled to consistent gene models via a dynamic programming approach. In terms of sensitivity and specificity, our method is about as eficient as GenScan. It is remarkable that a method based solely on the evolutionary concept of selective conservation yields results comparable to the best intrinsic methods that is currently available. The crucial difference between these two methods is, however, that while GenScan uses sophisticated species-dependent statistical models to distinguish coding from non-coding regions, our method is based on a simple and universally applicable measure of local sequence similarity and on basic models for splice junctions. These two approaches therefore complement each other in that they use completely different types of input information. Consequently, AGenDA could detect exons that were overlooked by GenScan and vice versa (Fig. 3).

Except for the splice site detection, our method does not depend on statistical content measures. Therefore it can be applied to genome sequences from newly sequenced organisms without training data - provided syntenic sequences are available from reference species at an appropriate evolutionary distance. Appropriate in this context means high divergence of non-coding genomic regions but preservation of the coding parts. Our study and also others [Batzoglou et al., 2000; Hardison et al., 1997] suggest that the distance between human and mouse is well suited for comparative studies, although recent results indicate that larger evolutionary distances may lead to even better results [Novichkov et al., 2001].

We should remark that the performance of both GenScan and AGenDA depends on the composition of the data that they are applied to, so one should be cautious in generalizing the test results that we have reported. In particular, the sequences in our test set are much shorter than average BAC fragments that are typically used in sequencing projects. An assessment of gene prediction accuracy in large genomic sequences [Guigó et al., 2000] indicates that prediction specificity of GenScan drops significantly if long sequences are analyzed while sensitivity of the program remains quite high. For methods based on comparative sequence analysis like AGenDA, conserved non-genic parts in the sequences [Clark, 2001] will probably limit the accuracy in long genomic stretches, see also Novichkov et al. (2001) where this problem is discussed.

Because intrinsic and comparative methods are based on different information sources, a natural way of improving gene prediction accuracy is to combine the prediction power of both approaches. Comparative sequence analysis could help to improve the selectivity of more traditional methods in finding candidate exons. Statistical models, on the other hand, could help to discriminate between gene-coding and non-gene coding conservation in syntenic genomic regions. The potential effectiveness of such a combined approach has been demonstrated by Korf et al. (2001) with their TWINSCAN program. In short, they re-implemented GenScan but also included homology information obtained from high-scoring BLAST alignments of syntenic sequences. This resulted in markedly increased sensitivity and specificity. To gain profit from the information source of comparative sequence analysis, syntenic sequences from a reference genome must be available. This is still the bottleneck of homology- based approaches, but with the increasing number of whole-genome sequencing projects, this restriction will become less severe in the near future.



Acknowledgments

We would like to thank Dr. Serafim Batzoglou for providing his data base of annotated test sequences. B. M. was supported by BMBF grant FKZ 01KW9703.



References