Gene prediction by comparative sequence analysis

Oliver Rinner1,2 and Burkhard Morgenstern1




1GSF Forschungszentrum, Institut für Bioinformatik,
Ingolstädter Landstraße 1,
85764 Neuherberg, Germany
2Universität Tübingen, Physiologisch-Chemisches Institut,
Hoppe-Seyler-Str. 4,
72076 Tübingen, Germany







ABSTRACT

Comparative sequence analysis is a powerful approach for detecting functional regions in genomic sequences. Herein, we propose a novel method for gene prediction that is based on the DIALIGN sequence alignment program. Local similarities identified by DIALIGN are combined with conserved splice signals to predict gene structures in pairs of evolutionary related sequences. The performance of this method has been tested using a set of 105 human-mouse sequence pairs. These test runs showed that sensitivity and specificity of our method are comparable with the best gene-prediction program currently available.



AVAILABILITY

DIALIGN is available through the Bielefeld Bioinformatics Server (BiBiServ) at http://bibiserv.techfak.uni-bielefeld.de/dialign/ The gene-finding program described in this paper will be available through the BiBiServ or MIPS web server at http://mips.gsf.de.



INTRODUCTION

Traditionally, there are two different approaches to computational gene prediction: ab-initio or intrinsic methods use statistical features such as ORF length and codon usage to distinguish coding from non-coding regions. By contrast, extrinsic methods try to find similarities between genomic sequences and known proteins. A certain limitation of both approaches is that they critically rely on information derived from already known genes, so they tend to be biased towards finding genes that are similar to known genes.

With the huge amount of genomic data that are now available, a third way of predicting genes and other functional elements in genomic sequences is emerging: it is possible to identify functional regions in genomic DNA by comparing evolutionary related genomic sequences with each other. The rationale behind this approach is simple: during evolution, functional parts of sequences tend to be more highly conserved than non-functional parts, so local sequence conservation usually indicates biological functionality. Bafna and Huson (2000) and Batzoglou et al. (2000) utilized this fact and proposed gene-prediction methods that rely on comparing genomic sequences from related organisms, see also Miller (2001) for a review of these and related approaches. An interesting combination of intrinsic and comparative methods has been proposed by Korf et al. (2001).



METHOD

The first and most critical step in sequence comparison is to align the sequences in question and the results of any comparative method can be only as good as the underlying alignment. Most standard alignment methods are either global methods that try to align sequences over their entire length or local methods that return only the most highly conserved region of local similarity. These methods are not appropriate for alignment of large genomic sequences where local homologies may be separated by large stretches of un-related 'junk DNA'.

Our approach to gene prediction is therefore based on the DIALIGN alignment program [Morgenstern et al., 1996; Morgenstern, 1999] which combines local and global aspects of sequence alignment by assembling pair wise and multiple alignments from locally conserved gap-free segment pairs (so-called fragments). Each possible fragment is given a so-called weight score based on the probability of random occurrence of a fragment of the corresponding length and sum of matches. The program then selects a consistent collection of fragments with maximum total weight score [Morgenstern, 1999; Abdeddaim and Morgenstern, 2001]. For pair-wise alignment, this means that the program returns a chain of fragments of maximum total weight, see Morgenstern (2000) for algorithmical details.

It has been shown that high-scoring fragments returned by DIALIGN are highly correlated to exons in genomic sequences [Morgenstern et al., 2001]. However, the extent of local sequence conservation cannot be expected to exactly coincide with protein-coding regions and it is not possible to predict whole gene structures solely based on sequence similarity information. Moreover, if the evolutionary distance between the compared species is close, even non-functional parts of the sequences may be conserved and it becomes difficult to distinguish functional from non-functional parts of the sequences. It is therefore necessary to take more information into account to identify conserved gene structures in syntenic genome sequences. In our approach, we adopted the following procedure to identify potential protein-coding exons (at present, we do not try to detect non-coding components of genes such as 3' and 5' UTRs).

(1) In a first step, high-scoring segment pairs (fragments) identified by DIALIGN are clustered by bridging small gaps between them. The difference in length between the gaps in both sequences is required to be a multiple of three in order to preserve the reading frame in both sequences.

(2) Conserved splice junctions and start/stop codons near the cluster boundaries are identified using standard procedures. Only those signals are selected that occur in both respective segments at the same relative position; this greatly reduces the noise generated by false positive splice signals and start/stop codons.

(3) Potential exons (PEs) are obtained by elongating or shortening the clustered segments such that they start with conserved start codon or acceptor site and end with conserved stop codon or donor site. If no conserved signals can be found near the boundaries of a fragment cluster, the cluster is discarded. Note that a cluster of fragments may be flanked by multiple conserved splice signals, so each cluster can give rise to several alternative PEs per sequence.

(4) A potential gene is a chain of PEs that is biologically consistent in that it begins with a start codon, ends with a stop codon and each PE ending with a donor splice site is followed by a PE starting with an acceptor site. In addition, the total length of a potential gene must be divisible by three, internal stop codons must be excluded and gaps between potential exons are required to meet certain length restrictions.

(5) We defined an objective function on the set of all potential genes: each potential exon is given a quality score by adding up the weight scores of the underlying DIALIGN fragments from which a penalty is subtracted for elongating or shortening the clustered fragments. A recursive algorithm is used to find a potential gene with maximum score.




RESULTS AND DISCUSSION

To evaluate our method, we used a set of 117 pairs of genomic sequences from human and mouse compiled by Batzoglou et al. (2000). According to the authors, these sequences are carefully annotated so they can be considered as a standard of truth. We used 12 sequence pairs as training data to optimise the parameters used in our program; the remaining 105 sequence pairs were used for our test runs. We compared our results with the output of GenScan [Burge and Karlin, 1997], the most successful software tool for gene prediction currently available. Standard measures of prediction accuracy were used, namely sensitivity and specificity at the exon level. That is, a predicted exon is considered a true positive if its boundaries precisely coincide with the boundaries of an annotated exon. Predicted exons that partially overlap with annotated exons are counted as false positives.

As shown in Figure 1, the results of our method were comparable with the results of GenScan in terms of both sensitivity and specificity. The sensitivity of our method was 76 % (GenScan: 82 %) while our specificity was 78 % (GenScan: 77 %).


Figure 1: Sensitivity and specifity of GenScan and of the alignment-based method for gene prediction proposed in this paper.


The crucial difference between these two methods is, however, that while GenScan uses sophisticated species-dependent statistical models to distinguish coding from non-coding regions, our method is based on a simple and universally applicable measure of local sequence similarity and on basic models for splice junctions. These two approaches therefore complement each other in that they use completely different types of input information. Consequently, DIALIGN could detect exons that were overlooked by GenScan and vice versa. 64 % of all annotated exons were correctly identified by both methods. Our method could identify an additional 12 % of the annotated exons not identified by GenScan while GenScan identified 17 % of the exons that were not found by our approach.


Figure 2: Percentage of exons correctly predicted by GenScan and by our alignment-based gene-prediction method. The two methods rely on different types of input information, so exons not detectable by one method can be detected by the respective other method.


Moreover, since comparative gene-prediction approaches do not rely on statistical models derived from known genes of a given species, they can be applied to genome sequences from newly sequenced organisms where no training data are available - provided syntenic sequences are available from a second species at an appropriate evolutionary distance. With the increasing number of whole-genome sequencing projects, it will become easy to find syntenic sequence pairs from related organisms. Thus, we think that our method should be a useful addition to existing gene-prediction methods.


REFERENCES