Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites

A. Zien 1, G. Rätsch 2, S. Mika 2, B. Schölkopf 2, C. Lemmen 1, A. Smola 2, T. Lengauer 1 and K.­R. Müller 2




1GMD.SCAI, Schloss Birlinghoven
53754 Sankt Augustin, Germany
E-mail: {zien, lemmen, lengauer}@gmd.de
2GMD.FIRST, Rudower Chaussee 5
12489 Berlin, Germany
E-mail: {raetsch, mika, bs, smola, klaus}@first.gmd.de






ABSTRACT

In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points from which regions encoding pro­ teins start, the so­called translation initiation sites (TIS). This can be modeled as a classification prob­ lem. We demonstrate the power of support vector machines (SVMs) for this task, and show how to suc­ cessfully incorporate biological prior knowledge by engineering an appropriate kernel function.


INTRODUCTION

Living systems are determined by the proteins they produce according to their genetic informa­ tion. But only parts of the nucleotide sequences carrying this information encode proteins (CDS, for coding sequence), while other parts (UTR, for untranslated regions) do not. Given a piece of DNA or mRNA sequence, it is most interest­ ing to know whether it contains CDS, and, if so, which protein it encodes.

In principle, both CDS and the encoded protein can be characterized using alignment methods. Programs capable of aligning nu­ cleotide sequences to protein databases include FASTX/FASTY, SearchWise and BLASTX. However, this approach is hampered by two se­ vere problems: First, there are several sources of noise making the task more difficult and error­ prone than pure protein alignment: (i) The cor­ rect strand and reading frame have to be found. (ii) Additional false hits may result from UTR sequences. (iii) Sequencing errors may disrupt the correct reading frame. Second, alignment based approaches rely on homologous proteins being known. This renders them unapt to find novel genes. Thus, a method to identify CDS in nucleotide sequences is desirable, both in order to ease the task for alignment­based approaches as well as to find new genes.

Since living cells are able to distinguish be­ tween CDS and other nucleotide sequence parts without utilizing any homology information, this should in principle also be possible for computer programs. In fact, there are algo­ rithms that identify CDS merely relying on properties intrinsic to nucleotide sequences. The most successful programs include GENSCAN [5] for genomic DNA and ESTScan [6] for ESTs. ESTs are single­read partial sequences derived from mRNA that are particularly error­ prone. ESTScan implements a fifth­order hid­ den Markov model that simultaneously recog­ nizes CDS by typical oligo­nucleotide frequen­ cies and corrects sequencing errors. It does not incorporate a model of translation initiation site (TIS) sequences, that mark the beginning of CDS. GENSCAN employs generalized hidden Markov models to capture the structure of an en­ tire genome. Despite its overall sophistication, GENSCAN uses a relatively crude TIS model: a piece of sequence is assigned a probability for being a TIS, based on the positional relative frequencies of individual nucleotides observed around a true TIS.

There exists a number of more elaborate mod­ els of TIS themselves. Salzberg extends the po­ sitional probabilities (as used by GENSCAN) to first order Markovian dependencies [13]. This is essentially a proper probabilistic modeling of positional di­nucleotides, and leads to a signifi­ cant increase in recognition performance. There also exist methods to explicitly capture correla­ tions between non­adjacent positions near TIS or other signals [1], possibly aiding insight into the structural functioning of these signals. How­ ever, since few such correlations can be proved to be significant in TIS sequences, there is little gain for TIS recognition.

All models discussed so far can be called generative, as they can be used to generate poten­ tial TIS sequences with approximately the true probability distribution. Applying such models, a sequence is considered a TIS if the probability it is assigned exceeds some threshold. The more closely the true distribution is approximated, the better this approach works. By using so­called discriminative methods, in general a superior distinction between true and pseudo TIS can be achieved. These methods aim at learning to discriminate certain objects from others, without explicitly considering probability distributions.

For example, the program ATGpr [12] uses a linear discriminant function that combines several statistical measures derived from the se­ quence. Each of those features is designed to be distinctive for true versus pseudo TIS. Learning allows to find a (linear) weighted combination of features that achieves a good discrimination.

A radically different approach to learning a discriminating function is taken by Pedersen and Nielsen [11]. They train an artificial neural net­ work (NN) to predict TIS from a fixed­length sequence window around a potential start codon (ATG). The input of the NN consists of a binary encoding of the sequence; no higher­level fea­ tures are supplied. The intriguing idea is that the NN learns by itself which features derived from the sequence are indicative of a true TIS.

Of the described methods, only ATGpr makes use of the ribosome scanning model [9]. According to this model, the translation starts at the first occurrence of an appropriate signal sequence in the mRNA, and thus sequences fur­ ther downstream that resemble typical TIS are inactive (pseudo sites). The scanning model can be combined with any TIS recognition method, and is confirmed by the resulting improvements of recognition [2]. Since the model is orthog­ onal to TIS signal sequence recognition itself, and is limited to complete mRNA sequences, we will not consider it in the following.

In this paper, we show that we can top the performance of established methods for TIS recognition by applying support vector machines (SVMs) [4]. Like NNs, SVMs are a discrimi­ native supervised machine learning technology, i.e. they need training with classified data in order to learn the classification. For the task of TIS recognition, we show that SVMs can be superior to NNs. To achieve this performance gain we use a particularly desirable property of SVMs: the ability to adapt them to the problem at hand by including prior knowledge into the so­called kernel function. Here, we demonstrate how to make use of rather vague biological knowledge. The paper is structured as follows: we first give a brief description of the SVM technique, then present experiments and finally discuss results and potential applications.


METHODS

Support Vector Machines

Formally, SVMs like any other classification method aim at estimating a classification func­ tion using classified train­ ing data from such that f will cor­ rectly classify unseen examples (test data). In our case, will contain simple representations of sequence windows, while +1 corresponds to true TIS and pseudo sites, respectively.

In order to be successful, two conditions have to be respected. First, the training data must be an unbiased sample from the same source as the test data will be. This concerns the experimental setup. Second, the size of the class of functions that we choose our estimate f from, the so­called capacity of the learning machine, has to be sensibly restricted. If the capacity is too small, complex discriminant functions can not be sufficiently well approximated by any selectable function f -- the learning machine is too dumb to learn well. On the other hand, too large a capacity bears the risk of loosing the ability to learn a function that generalizes well to unseen data. The reason lies in the existence of in­ finitely many functions that are consistent with the training examples, but disagree on unseen (test) examples. Most of those functions perfectly memorize the particular examples used for training, but do not reflect general properties of the classification. Picking such a function is called overfitting.

In neural network training, overfitting is avoided by early stopping, regularization or asymptotic model selection [3; 10]. In contrast, the capacity of SVMs is limited according to the statistical theory of learning from small samples [17]. For learning machines implement­ ing linear decision functions this corresponds to finding a large margin separation of the two classes. The margin is the minimal distance of training points to the separation surface (cf. Fig­ ure 1). Finding the maximum margin separation can be cast as a convex quadratic program­ ming (QP) problem [4]. The time complexity of solving such a QP scales approximately between quadratic and cubic in the number of training patterns (see [14]), making the SVM technique computationally comparably expensive.


Figure 1: A binary classification toy problem: se­ parate dots from crosses. The shaded region consists of training examples, the other regions of test data (spatial separation for illustration clarity only). The data can be separated with a margin indicated by the slim dotted lines, implicating the slim solid line as decision function. Misclassifying one training point (circled cross) leads to a considerable extension (arrows) of the margin (fat lines) and thereby to the correct classification of two test examples (circled dots).

With respect to good generalization, it often is profitable to misclassify some outlying training data points in order to achieve a larger margin between the other training points. See Figure 1 for an example. This 'neglectful' learning strategy also masters inseparable data [16], which is frequent in real­world applications. The trade­ off between margin size and number of misclassified training points is controlled by a parame­ ter of the SVM, which therefore can be used to control its capacity. This extension still permits optimization via QP [4].

It is tempting to think that linear functions can be insufficient to solve complex classification tasks. A little thought reveals that this in fact depends on the representation of the data points. Canonical representations, as frequently used to define input space, tend to minimize dimension­ ality and avoid redundancy. Then, linearity may easily be too restrictive. However, one is free to define (possibly redundant) features that non­ linearly derive from any number of input space dimensions. Even for complex problems, well chosen features could ideally be related to the respective classification by rather simple means, e.g. by a linear function (cf. Figure 2).
Figure 2: Three different views on the same dot versus cross separation problem. Linear separation of input points (a) does not work well: a reasonably sized margin requires misclassifying one point. A better separation is permitted by non­linear functions in input space (b), which corresponds to a linear function in a feature­space (c). Input space and feature space are related by the kernel function (see main text)..

Any linear learning machine can be extended to functions non­linear in input space by explicitly transforming the data into a feature space F using a map (see Figure 2). SVMs can do so implicitly, thanks to their mathematical niceness: all that SVMs need to know in order to both train and classify are dot products of pairs of data points in feature space. Thus, we only need to sup­ ply a so­called kernel function that computes these dot products. This kernel function k implicitly defines the feature space (Mercer's Theorem, e.g. [4]) via

Note that neither the SVM nor we need to know , because the mapping is never performed explicitly. Therefore, we can computationally afford very large (e.g. 10 10 ­ dimensional) feature spaces. SVMs can still avoid overfitting thanks to the margin maximization mechanism. Simultaneously, they can learn which of the features implied by k are dis­ tinctive for the two classes. So, instead of having to design well­suited features by ourselves (which can often be difficult), we can use the SVM to select them from a sufficiently rich feature space. Of course, it well be helpful if the kernel supplies a type of features related to the correct classification. In the next sections, we will show how to boost the process of learning by choosing appropriate kernel functions.

Data sets

Little experience exists in the application of SVMs to biomolecular problems (to our know­ ledge, only work by Jaakkola and Haussler [8]). Therefore, we compare the performance of SVMs to that of the most popular alternative general purpose machine learning technology, neural networks (NNs). In order to do so, we use the vertebrate data set provided by Pedersen and Nielsen [11]. We take care to only replace the learning machinery while retaining the set­ ting: the definition of training and test data sets as well as the definition of input space.

The sequence set is built from high quality nu­ clear genomic sequences of a selected set of ver­ tebrates taken from GenBank. All introns are removed, in analogy to the splicing of mRNA sequences. The set is thoroughly reduced for re­ dundancy, to avoid over­optimistic performance estimates resulting from biased data. This pro­ tocol leaves 3312 sequences (see [11]).

From these sequences, the data set for TIS recognition is built as follows. For each potential start codon (the nucleotide sequence ATG) on the forward strand, one data point is generated. This leads to 13503 data points, of which 3312 (24.5%) correspond to true TIS and the other 10191 (75.5%) correspond to pseudo sites. Each data point is represented by a sequence window of 200 nucleotides centered around the respective ATG triplet. Pedersen and Nielsen di­ vide the data into six parts of nearly equal size ( 2200 points) and fraction of true TIS. Each part is in turn reserved for testing the classification learned from the other five parts.

Engineering the kernel function

Given the data sets, we have to choose n the data sets, we have to choose a kernel function k for training. A standard kernel func­ tion is the simple polynomial kernel . This kind of kernel takes two parameters: the degree d and an additional constant . Here, we use homogenous ( = 0) poly­ nomials of first to fifth degree (d = 1,...,5). Degree one corresponds to a linear separation in input space. The input space is defined by a sparse bit­encoding scheme as used by Pedersen and Nielsen (personal communication): each nucleotide is coded by five bits, exactly one of which is set. The position of the set bit indicates whether the nucleotide is A, C, G or T, or if it is unknown. Thus, the dot product x * y simply counts the number of nucleotides that coincide in the two sequences represented by x and y. If the degree d is set to two, the feature space reflects all pairwise correlations of the nucleotide frequencies at any two sequence positions. A degree of three would correspond to all correlations of (possibly scattered) triplets, and so on. With this simple kernel function we already achieve results competitive to those of the NN devised by Pedersen and Nielsen (see Table 1).


Table 1: Comparison of classification errors (first three columns: on all, on positive and on negative data points). All results (except for preliminary codon­improved figures) are averages over the six data partitions. SVMs are trained on 8000 data points, leaving the remaining training data (ß 3300 points) for the deter­ mination of suitable parameter values. The NN results are those achieved by Pedersen and Nielsen [11], personal communication). Here, model selection seems to have involved test data, which might lead to slightly overoptimistic figures. Positional preference scores are calculated analogously to Salzberg [13], but extended to the 200 nucleotides around the ATG triplet also supplied to the other methods. All values shown correspond to the optimal overall performance, though the true vs. pseudo TIS (or equivalently sensitivity vs. specificity) trade­off can be controlled by varying the classification function threshold.


We design an improved kernel function by incorporating basic biological knowledge. We make use of only one observation: While certain local correlations are typical for TIS, depen­ dencies between distant positions are of minor importance or do not even exist. We want the feature space to reflect this fact. Thus, we modify the kernel utilizing a technique is described in [15]: At each sequence position, we compare the two sequences locally, within a small window of length 2l+1 around that position. Again, we count matching nucleotides, this time multiplied with weights w increasing from the bor­ ders to the center of the window. The resulting weighted counts are taken to the d th 1 power. d 1 reflects the order of local correlations (within the window) we expect to be of importance.

Here, matchp p+j (X,Y) is one for matching nucleotides at position p + j and zero other­ wise. The window scores computed with win p are added up over the whole length of the se­ quence. Correlations between up to d 2 windows are taken into account by applying potentiation with d 2 to the resulting sum.

We call this kernel locality­improved. In Ta­ ble 2 its TIS recognition performance is compared to that of the polynomial kernel for differently sized training sets.

Our kernel function poses the problem of how to set a number of parameters (in addition to the general parameter for SVM capacity control described in section 2.1). Since we consider long distance correlations unimportant we set d 2 to one. For each of the remaining parameters, we select a small number of values in an appropriate range. SVMs are trained with all combina­ tions of these values, while excluding a part of the training set. This part is then used to mea­ sure the performance of the trained SVM and to select the corresponding parameters. With respect to window size (2l + 1), we consider nucleotide composition (two and more), inter­ actions between neighboring amino acids (six) and the assumed length of the ribosomal bind­ ing site (up to 14). For the degree of local cor­ relations (d 1 ), we consider values up to five. Table 2 shows that the optimal parameterization of the kernel depends on the training set size. The more data available to the SVM, the more complex features it can reliably learn. The table also shows that the performance improvements over the polynomial kernel are very substantial for small numbers of training vectors, but decrease for larger training sets. Again, this is consis­ tent with statistical learning theory. It is known that SVMs (as well as other learning algorithms) are asymptotically optimal with whatever kernel one chooses, i.e. they will perform well when supplied with enough training data (e.g. [17]).

In an attempt to further improve performance we try to incorporate another piece of knowledge into the kernel, that again is rather diffuse: the codon­structure of coding sequence. By def­ inition the difference between a true TIS from pseudo sites is that downstream of a TIS there is CDS (which shows codon structure), while upstream there is not. CDS and UTR show statistically different compositions. It is likely that the SVM exploits this difference for classification. We could hope to improve the kernel by reflecting the fact that CDS shifted by three nucleotides still looks like CDS. Therefore, we further modify the locality­improved kernel function to account for this translation­ invariance. In addition to counting matching nucleotides on corresponding positions, we also count matches that are shifted by three positions. We call this kernel codon­improved.


Table 2: Comparison kernel functions using differently sized subsets of the training data set, averaged over the six partitions. The percentages denote the overall classification errors that are achieved using the indicated supposedly optimal parameter settings. Note that the windows consist of 2l + 1 nucleotides.

Tables 2 and 1 suggest that this modifica­ tion actually decreases performance. On the other hand, a similar modification to the simple polynomial kernel leads to a significant increase of recognition accuracy (data not shown). We therefore conclude that the process of learning some relevant features (e.g. subtile local correlations) is distorted by the modification. In contrast to the simple polynomial kernel, the locality­improved kernel seems to be rich enough to easily learn translation­invariance by itself, wherever this proves advantageous.

Nevertheless, both engineered kernel func­ tions clearly outperform the NN as devised by Pedersen and Nielsen by reducing the overall number of misclassifications by about 20% (see Table 1). The SVM also beats the performance of positional conditional probabilities, which work surprisingly well when applied to larger windows than suggestest by Salzberg.


DISCUSSION

First, we will briefly discuss applications of our TIS recognition method. This leads to potential paths of improvement. Finally, we suggest promising fields of application of techniques similar to those presented above.

TIS recognition can be used to improve reliability and accuracy of amino acid prediction from nucleotide sequences. The two main fields of application are ESTs and genomic sequences. For EST data, the program ESTScan aims at identifying CDS as accurately as possible. In a slight misuse of ESTScan, we investigate how well it finds the correct TIS within the set of 3312 mRNA­like sequences described in section 2.2. Results are shown in Table 3. On average, the program misses the true TIS position by 41.6 nucleotides. Both this figure and the table indicate that ESTScan could profit from a TIS recognition module. For genomic sequences and programs like GENSCAN, a similar situation could be expected.


Table 3: Application of ESTScan for TIS recognition on the original set of 3312 sequences (cf. section 2.2). For each predicted CDS, an ATG triplet near the supposed start point of the CDS is selected as predicted TIS. Evaluation is shown for three different selection strategies. (Column labels are as in Table 1.)


However, caution would be necessary in order to use our method within a rigorous probabilistic framework like those of GENSCAN or ESTScan. The SVM (as well as Pedersen and Nielsen's NN) seems to exploit the different oligo­nucleotide preferences of CDS in comparison to UTRs. These preferences are already incorporated in GENSCAN and ESTScan, leading to probability distribution dependencies that must be taken into account. In order to avoid these dependencies, it would be easiest to restrict the sequence window presented to the SVM to the ribosome binding site. This would most probably lead to a decrease of classification performance. In addition, it would be desirable that the TIS recognition method computes probability values for potential TIS to be true TIS. Meanwhile, it should be useful to heuristically combine our TIS recognition with GENSCAN or ESTScan output. We plan to devote work to this area.

There are far too many interesting classification tasks in bioinformatics too be covered here, so we restrict ourselves to two of the most popular problems. First, we could imagine that the excellent protein classification performance of the Fisher kernel method developed by Jaakkola and Haussler [7] could still be improved by considering local amino acid correlations in a manner similar to our locality­improved kernel. Sec­ ond, we believe that SVMs will prove successful for exploiting the information gathered with DNA chips. Here, kernel functions could be engineered that reflect the structure of expression data as collection of unrelated time series. These are fields of further future work.

In summary, we have compared the performance of important methods for sequence classification on a bio­molecular problem of practical relevance. We show that SVMs are competitive to other, more frequently used machine learning methods and offer the unique advantage of an easy way to include prior knowledge to improve performance.


ACKNOWLEDGEMENTS

This work was supported by the BMBF (TargId, 0311615), by the DFG (JA 379/5­2,7­1), and by the EC STORM project 25387. We thank Pedersen and Nielsen for providing their data sets and sharing unpublished data. We thank M. Schwan for help with the experiments.


REFERENCES


  1. P. Agarwal and V. Bafna. Detecting non­ adjoining correlations within signals in DNA. In RECOMB'98, 1998.
  2. P. Agarwal and V. Bafna. The Ribosome Scan­ ning Model for Translation Initiation: Impli­ cations for Gene Prediction and Full­Length cDNA Detection. In ISMB'98, pages 2--7, 1998.
  3. C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.
  4. B. Boser, I. Guyon, and V. N. Vapnik. A train­ ing algorithm for optimal margin classifiers. In COLT, pages 144--152, 1992.
  5. C. Burge and S. Karlin. Prediction of Complete Gene Structures in Human Genomic DNA. JMB, 268(1):78--94, 1997.
  6. C. Iseli, C. V. Jongeneel, and P. Bucher. ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. In ISMB'99, pages 138--147, 1999.
  7. T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies, 1998.
  8. T. S. Jaakkola and D. Haussler. Exploiting Generative Models in Discriminative Classi­ fiers. In NIPS, 1998.
  9. M. Kozak. Interpreting cDNA sequences: some insights from studies on translation. Mammalian Genome, 7:563--574, 1996.
  10. J. Orr and K.­R. Müller, editors. Neural Net­ works: Tricks of the Trade. Springer, 1998.
  11. A. G. Pedersen and H. Nielsen. Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome analysis. In ISMB'97, pages 226--233, 1997.
  12. A. A. Salamov, T. Nishikawa, and M. B. Swindells. Assessing protein coding region in­ tegrity in cDNA sequencing projects. Bioinfor­ matics, 14(5):384--390, 1998.
  13. S. L. Salzberg. A method for identifying splice sites and translational start sites in eukaryotic mRNA. CABIOS, 13(4):365--376, 1997.
  14. B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors. Advances in Kernel Methods -- Support Vector Learning. MIT Press, 1999.
  15. B. Schölkopf, P. Simard, A. Smola, and V. Vap­ nik. Prior knowledge in support vector kernels. In Advances in Neural Information Processing Systems 10, pages 640--646. MIT Press, 1998.
  16. B. Schölkopf, A. Smola, R. Williamson, and P. Bartlett. New support vector algorithms. Technical report, NeuroColt2, 1998. to appear in Neural Computation.
  17. V. N. Vapnik. The nature of statistical learning theory. Springer­Verlag, 1995.