In Silico Biology 4, 0022 (2004); ©2004, Bioinformation Systems e.V.  


Using amino acid patterns to accurately predict translation initiation sites

Huiqing Liu, Hao Han, Jinyan Li and Limsoon Wong




Institute for Infocomm Research,
21 Heng Mui Keng Terrace, Singapore, 119613
Email: huiqing@i2r.a-star.edu.sg
hanhao@i2r.a-star.edu.sg
jinyan@i2r.a-star.edu.sg
limsoon@i2r.a-star.edu.sg





Edited by E. Wingender; received December 11, 2003; revised and accepted February 19, 2004; published March 24, 2004



Abstract

The translation initiation site (TIS) prediction problem is about how to correctly identify TIS in mRNA, cDNA, or other types of genomic sequences. High prediction accuracy can be helpful in a better understanding of protein coding from nucleotide sequences. This is an important step in genomic analysis to determine protein coding from nucleotide sequences. In this paper, we present an in silico method to predict translation initiation sites in vertebrate cDNA or mRNA sequences. This method consists of three sequential steps as follows. In the first step, candidate features are generated using k-gram amino acid patterns. In the second step, a small number of top-ranked features are selected by an entropy-based algorithm. In the third step, a classification model is built to recognize true TISs by applying support vector machines or ensembles of decision trees to the selected features. We have tested our method on several independent data sets, including two public ones and our own extracted sequences. The experimental results achieved are better than those reported previously using the same data sets. Our high accuracy not only demonstrates the feasibility of our method, but also indicates that there might be "amino acid" patterns around TIS in cDNA and mRNA sequences.

Key words: translation initiation site, feature generation, k-gram amino acid patterns, feature selection, classification



Introduction

The selection of the start site for translation is an important step in the initial phase of protein synthesis. In eukaryotic mRNA, the context of the start codon (normally "AUG") and the sequences around it are crucial for recruitment of the small ribosome subunit. Thus, the characterization of the features around translation start site will be helpful in a better understanding of translation regulation and mRNA/cDNA sequences. However, since DNA sequences and protein sequences represent the spectrum of biomedical data, they do not possess explicit signals or features. For example, a genomic sequence is presented by a string consisting of the letters "A", "C", "G", and "T". Therefore, when applying traditional machine learning techniques to this recognition problem, there is a need for good methodologies for generating explicit features underlying translation initiation site.

Since 1987, the recognition of TIS has been extensively studied using biological approaches, data mining techniques, and statistical models [Hatzigeorgiou, 2002; Kozak, 1987; Kozak, 1989; Kozak, 1996; Kozak, 1999; Nishikawa et al., 2000; Pedersen and Nielsen, 1997; Peri and Pandey, 2001; Salamov et al., 1998; Zeng et al., 2002; Zien et al., 2000]. In one of the works, Pedersen and Nielsen, 1997 directly fed DNA sequences into an artificial neural network (ANN) for training the system to recognize true TIS. They achieved a result of 78% sensitivity and 87% specificity on start ATGs (true TISs) on a vertebrate data set, giving an overall accuracy of 85%. In Zien et al., 2000, they studied the same vertebrate data set, but replaced ANN with support vector machines (SVMs) using different kinds of kernel functions. They believe that carefully designed kernel functions are useful for achieving higher TIS prediction accuracy. One of their kernels is called "locality improved" kernel, which emphasizes correlations between any two sequence positions that are close together, and a span of 3 nucleotides up- and down-stream is empirically determined as optimal. Recently, Hatzigeorgiou [Hatzigeorgiou, 2002] built a multi-step ANN system named "DIANA-TIS" to study the recognition problem. This ANN system combines a consensus ANN and a coding ANN with the ribosome scanning model. They obtained an overall accuracy of 94% on a data set containing full-length human cDNA sequences. All of these methods used nucleotide sequence data directly; they did not generate any new and explicit features for the differentiation between true and false TISs.

There are some related works that make use of statistical features. The program ATGpr [Salamov et al., 1998] used a linear discriminant function that combined some statistical features derived from the sequence. Each of those features was proposed to distinguish true TIS from false TIS. In a more recent work [Nishikawa et al., 2000], an improved version of ATGpr called ATGpr_sim was developed, which used both statistical information and similarities with other known proteins to obtain higher accuracy of fullness prediction for fragment sequences of cDNA clones. In our previous study [Liu and Wong, 2003; Zeng et al., 2002], features were generated from nucleotide acid patterns and feature selection was conducted to choose significant features for building classification models using machine learning techniques. 

Our proposed method consists of the following three steps:

  1. generating new features using k-gram amino acid patterns and generating the values of the new features using the frequency of the patterns in the amino acid sequences coded from the original cDNA or mRNA sequences;
  2. ranking the newly generated features via their entropy value and selecting important ones;
  3. integrating the selected features by machine learning techniques - support vector machines (SVMs) or an ensembles of decision trees method - to recognize true TIS.

Our idea is different from traditional methodologies because it generates new features and also transforms the original nucleotide sequence data to amino acid sequence data, and finally to k- gram frequency vectors.

We apply our method on three independent data sets. The first set (data set I) contains 3312 vertebrate sequences [Pedersen and Nielsen, 1997] while the second one (data set II) contains 480 completely sequenced and annotated human cDNA sequences [Hatzigeorgiou, 2002]. Our cross validation accuracy on the first data set is 92.45%, which is better than 89.4%, the best reported result on this data set. Our cross validation accuracy on the second data set is 98.16%; we did not find literature cross validation results on this data set for comparison. Note that the results of Hatzigeorgiou [Hatzigeorgiou, 2002] is not directly comparable due to the use of the ribosome scanning model and different way of data splitting. We also conduct experiment to test the accuracy on one data set when using our model trained on another data set. Besides, this kind of cross data sets validation is further applied to genomic data. We formed our data set III by extracting a number of well characterized and annotated human genes of Chromosome X and Chromosome 21 from Human Genome Build30. Such a validation is highlighted because all the previously reported results were based on only one data set, the performance of those methods on other independent data were not reported. Our good accuracy in different experiments not only demonstrates the feasibility of our method, but also indicates that there might be "amino acid" patterns around TIS in cDNA and mRNA sequences.



Materials and methods


Methods

Our method comprises three steps: (1) generating candidate features from the original sequence data; (2) selecting relevant features using an entropy-based algorithm; and (3) integrating the selected features by classification algorithm to build a system to correctly recognize true TIS. An important component of our method is to generate a new feature space. After that, we follow some standard feature selection and feature integration ideas to make TIS predictions.

Feature generation

We generate the new feature space using k-gram (k = 1, 2, 3, ...) amino-acid patterns. In this study, a k-gram pattern is defined as k consecutive letters, which can be amino acid symbols or nucleotide symbols [Liu et al., 2003; Liu and Wong, 2003; Zhang, 1998]. Then, we use each k-gram amino acid pattern as a new feature. For example, "AR" is a 2-gram pattern constituted by an alanine followed by an arginine. Our aim is to recognize a TIS from large amount of candidate ATGs by analyzing k-gram amino acid patterns around it. In our study, up-stream and down-streamk-gram amino acid patterns of an ATG (every ATG is a candidate of TIS) are treated as different features. Since there are 20 standard amino acids plus 1 stop symbol, there are 2 * 21k possible combinations of k-gram patterns for each k.

Next, we follow the same way to that presented in Liu et al., 2003 to generate the values for features --- using frequency of the k-gram amino acid patterns. For example, (1) UP-X (DOWN-X), which counts the number of times the amino acid letter X appears in the up-stream (down-stream) part of an ATG in its amino acid sequence, for X ranging over the standard 20 amino acid letters and the special stop symbol. (2) UP-XY (DOWN-XY), which counts the number of times the two amino acid letters XY appear as a substring in the up-stream (down-stream) part of an ATG in its amino acid sequence, for X and Y ranging over the standard 20 amino acid letters and the special stop symbol. In this paper, we use 1-gram and 2-gram patterns only. Thus, there are 924 (= ( 21 + 212) * 2) possible amino acid patterns, i. e. new features.

In the framework of the new feature space, the initial nucleic acid sequences will be transformed into frequency data under the similar work flow given in the Figure 4 of Liu et al., 2003. In this TIS study, we set both up-stream and down-stream as 99 bases. Thus, for each ATGs in the a sequence, a 201 bases sequence segment is extracted with the ATG in the center. As such, for data set I, we get 3312 sequence segments containing true TIS and 10063 containing false TIS; for data set II, 480 sequence segments containing true TIS and 13581 containing false TIS. For ease of discussion, in an extracted sequence segment, we refer to each position in the segment relative to the target ATG of that segment. The "A" in the target ATG is numbered as +1 and consecutive down-stream positions - that is, to the right - from the target ATG are numbered from +4 onwards. The first upstream position - that is, to the left - adjacent to the target ATG is -1 and decreases for consecutive positions towards the 5' end - that is, the left end of the sequence segment [Liu and Wong, 2003].

Next, for both upstream and down-stream parts of the extracted sequence segments, we code every triplet nucleotides into an amino acid using the standard codon table. Besides these k-gram amino acid patterns, we also present some bio-knowledge as 3 new features. From the original work for the identification of the TIS in cDNA sequences, Kozak developed the first weight matrix from an extended collection of data [Kozak, 1987]. The consensus motif from this matrix is GCC[AG]CCATGG, where (1) a G residue tends to follow a true TIS, which indicates that a "G" appears in position +4 of the original sequence window; (2) purine (A or G) tends to be found 3 nucleotides up-stream of a true TIS, which indicates that an "A" or a "G" appears in position -3 of the original sequence window. Also, according to the ribosome scanning model [Agarwal and Bafna, 1988; Cigan et al., 1988; Kozak, 1989], an mRNA sequence is scanned from left (5') to right (3'), and the scanning stops as soon as an ATG is recognized as TIS. The rest of the ATGs in the mRNA sequence to the right of this ATG are then treated as non-TIS. To incorporate these knowledge to our feature space, we add three boolean type features  "DOWN4-G", "UP3-AorG" and "UP-ATG" (whether an in-frame up-stream ATG exists). Thus, there are 927 features in the new feature space.

After this process of feature generation and data transformation, we get 3312 true TIS samples and 10063 false TIS samples from data set I, 480 true TIS samples and 13581 false TIS samples from data set II. Each sample is a vector of 924 integers and three boolean values. Figure 1 presents a diagram for the data transformation with respect to our new feature space.



Figure 1: A diagram for data transformation aiming for the description of the new feature space.



Feature selection

To find significant amino acid patterns to efficiently separate true TIS from false TIS, next, we conduct feature selection to reduce the dimensionality. In this study, we adopt a method that selects features based on class entropy measure. With this method, all features are first ranked accoding to their class entropy values in an ascending oreder, then certain number of top-ranked features are chosen to build the model. Please refer to Fayyad and Irani, 1993 for the calculation of class entropy and our publications [Liu et al., 2003; Liu and Wong, 2003] for the applications of the method to some biological data.

Feature integration

In this step, we select support vector machines (SVMs) and our own developed method of constructing ensembles of decision trees [Li and Liu, 2000] to build classification model to distinguish true TISs from false TISs. SVMs is a representative of learning algorithms using kernel functions [Burges, 1998]. It has achieved good classification performance in the biological domain [Furey et al., 2000; Liu et al., 2003; Zien et al., 2000]. However, the prediction models of SVMs are not easy to interpret and understand. In contrary to SVMs, decision tree methods output learning models in a comprehensive and simple format (i. e. tree/rule). Therefore, from our decision tree classifier, we expect to get some interesting rules for TIS prediction.

SVMs:   An SVM separates samples in the different classes by constructing a linear discriminant function built on so called support vectors. These support vectors are a number of critical boundary samples from each class selected from the training data. In the case that no linear separation is possible, the technique of kernel will be used to automatically inject the training samples into a higher-dimensional space, and to learn a separator in that space [Witten and Frank, 2000]. A discriminant function G(T) for a test sample T is  a linear combination of kernels computed at the training data points (i. e. support vectors) and is constructed as

where [X]i are the training data points, [Y]i are the class labels (which are assumed to have been mapped to 1 or -1) of these data points, k(. , .) is the kernel function, b and [a]i are parameters that determine the discriminant function and can be learned from the training data. The training of an SVM is a quadratic programming problem and there are several ways to solve the problem. One of the fastest algorithms is developed by Platt [Keerthi et al., 1999; Platt, 1998], which solves the above quadratic programming problem by sequential minimal optimization (SMO). In our experiments, we use the implementation of SMO in Weka (version 3.2), a free machine learning software package written in Java and developed at University of Waika to in New Zealand [http://www.cs.waikato.ac.nz/ml/weka/]. The kernel is the polynomial function and the transformation of the output of SVM into probabilities is conducted by a standard sigmoid function. In most of cases in this paper, we present our results obtained by linear and quadratic polynomial kernels.


Ensembles of decision trees:   Compared with SVMs, approaches of constructing decision trees organize what they learned from data in a more comprehensible way - tree format. Furthermore, every branch of a decision tree (from root to leaf) can be easily translated into a rule in the format of "if ..., then ..." [Witten and Frank, 2000]. Figure 2 shows a decision tree of disease diagnosis (captured from http://www.cs.utah.edu/classes/cs5350/slides/d-trees4.pdf). This tree has seven leaves and thus, there are seven rules embedded. For example, its second branch from left indicates that "if pain on throat and fever is yes, then the patient gets flu". Here, pain and fever are features while throat and yes are values of the features.



Figure 2: A sample of decision tree.


In general, decision trees try to find an optimal partitioning of the space of possible observations, mainly by the means of subsequent recursive splits. Most of the algorithms implement this induction process in a top-down manner [Witten and Frank, 2000]: (1) determining the root feature that most discriminatory with regard to the entire training data; (2) using the root feature to split the data into non-overlapping subsets; (3) selecting a significant feature of each of these groups to recursivly partition them until reaching one of stopping criteria. The most widely used decision tree induction method is called C4.5 [Witten and Frank, 2000], which uses information gain ratio as the measure to choose node features. C4.5 is also served as the base classifier of our ensembles method.

Ensemble of decision trees have shown significant effectiveness in improving the accuracy of single base classifiers by constructing committees of decision trees [Breiman, 1996; Freund and Schapire, 1996; Li and Liu, 2000]. The main idea of this classification algorithm is to use different top-ranked features as the root node of a member tree. Different from Bagging or Boosting [Breiman, 1996; Freund and Schapire, 1996] which uses bootstrapped data, we always build decision trees using exactly the same set of training samples. In detail, to construct k number of decision trees (k < n, n is the number of features describe the data), we have following steps:

  1. Ranking all the n features according to certain criterion, with the best feature at the first position.
  2. i = 1.
  3. Using the ith feature as root node to construct ith decision tree using C4.5 algorithm.
  4. If i < k, increasing i by 1 and goto (3); otherwise stop.

In this paper, the number of decision trees k is set as 20. When doing classification, we define the coverage of a rule in a tree as the percentage of the samples in its class satisfying the rule. Suppose we have discovered k  decision trees from our training set containing true TIS and false TIS samples. Then, all the rules derived from k trees can be categorized into two groups: one group only containing rules for true TIS samples, another containing rules for false TIS samples. In each group, we rank the rules in descending order according to their coverage, such as

and

Given a test sample T, each of the k trees will have a rule to fit this sample and therefore, give a prediction for this sample. Suppose that T satisfies the following k1 true TIS rules and k2 false TIS rules

and

Where 0 < k1, k2 < k and k1 + k2 = k. The order of these rules is also based on their coverage. When we make a prediction for T, two scores will be calculated as following:

If Score(T)true > Score(T)false, then T will be predicted as a true TIS; Otherwise, T predicted as a false TIS. In practice, the tie-score case occurs rarely [Li and Liu, 2000].


Model evaluation

We adopt standard performance measures defined as follows. Sensitivity measures the proportion of TIS that are correctly recognized as TIS. Specificity measures the proportion of false TIS that are correctly recognized as false TIS. Precision measures the proportion of the claimed TIS that are indeed TIS. Accuracy measures the proportion of predictions, both for true TIS and false TIS, that are correct. Let TP be the true positives, TN the true negatives, FP the false positives, and FN the false negatives. Then the above measures are defined as: sensitivity = TP/(TP + FN), specificity = TN/(TN + FP), precision = TP/(TP + FP), and accuracy = (TP + TN)/(TP + FN + TN + FP). Besides, we also plot ROC (Receiver Operating Characteristic) curve for the testing on genomic sequences where we select same number of true TIS and false TIS samples. From a ROC curve, the tradeoff between sensitivity and specificity can be illustrated clearly.


Data

The first data set (data set I) is provided by Dr. Pedersen. It consists of vertebrate sequences extracted from Gen-Bank (release 95). The sequences are further processed by removing possible introns and joining the remaining exon parts to obtain the corresponding mRNA sequences [Pedersen and Nielsen, 1997]. From these sequences, only those with an annotated TIS, and with at least 10 upstream nucleotides as well as 150 downstream nucleotides are considered in our studies. The sequences are then filtered to remove homologous genes from different organisms, sequences added multiple times to the database, and those belonging to same gene families. Since the data are processed DNA, the TIS site is ATG - that is, a place in the sequence where "A", "T", and "G" occur in consecutive positions in that order. We are aware that some TIS sites may be non-ATG; however, this is reported to be rare in eukaryotes [Kozak, 1999] and is not considered in this paper. 

An example entry from this data set is given in Figure 3. There are 4 ATGs in this example. The second ATG is the TIS. The other 3 ATGs are non-TIS (false TIS). ATGs to the left of the TIS are termed up-stream ATGs. So the first ATG in the figure is an up-stream ATG. ATGs to the right of the TIS are termed down-stream ATGs. So the third and fourth ATGs in the figure are down-stream ATGs. The entire data set contains 3312 sequences. In these sequences, there are a total number of 13375 ATGs, of which 3312 ATGs (24.76%) are true TIS, while 10063 (75.24%) are false. Of the false TISs, 2077 (15.5%) are up-stream ATGs.



Figure 3: An example annotated sequence from data set I. The 4 occurrences of ATG are underlined. The second ATG is the TIS. The other 3 ATGs are non-TIS. The 99 nucleotides up-stream of the TIS are marked by an overline. The 99 nucleotides down-stream of the TIS are marked by a double overline. The ".", "i", and "E" are annotations indicating whether the corresponding nucleotide is up-stream (.), TIS (i), or down-stream (E).


The second data set (data set II) is provided by Dr. Hatzigeorgiou. The data collection was first made on the protein database Swissprot. All the human proteins whose N-terminal sites are sequenced at the amino acid level were collected and manually checked [Hatzigeorgiou, 2002]. Then the full-length mRNAs for these proteins, whose TIS had been indirectly experimentally verified, were retrieved. The data set consists of 480 human cDNA sequences in standard FASTA format. In these sequences, there are as many as 13581 false TIS, 96.59% of total number of ATGs. However, only 241 (1.8%) of them are up-stream ATGs. 

To reduce the similarity between the training and testing data, a BLAST search between the data set I and II is performed. Two sequences are considered similar if they produce a BLAST hit with an identity >75%. We found 292 similar sequences and removed them from data set II. As a result, after being removed similar sequences, data set II contains 188 real TIS, while there are total number of 5111 candidates. 

Besides these two data sets that have been analyzed by others, we also formed our own genomic ATG data set (data set III) by extracting a number of well-characterized and annotated human genes of Chromosome X and Chromosome 21 from Human Genome Build30. Note that we eliminated those genes that were generated by other prediction tools. The resulting set consists of 565 sequences from Chromosome X and 180 sequences from Chromosome 21. These 745 sequences containing true TIS are used as positive data in our experiment-d. For negative data, we extracted a set of sequences around all ATGs in these two chromosomes but excluded annotated ones.



Results

To verify the effectiveness of our method, we designed a series of experiments on three data sets:

  1. Conducting computational cross validations in data set I and data set II separately. In k-fold cross validation, data set is divided randomly into k disjoint subsets of approximately equal size, in each of which the class is represented in approximately the same proportions as in the full data set [Witten and Frank, 2000]. We train the model k times, each time one of the subsets is held out in turn from training while feature selection and classification model building are conducted on the remaining k-1 subsets and evaluated on the holdout set. After all subsets being tested, an overall performance is produced.
  2. Selecting features and building classification model using data set I. Applying the well-trained model to data set II to obtain a blind testing accuracy.
  3. Incorporating the idea of ribosome scanning into the classification model.
  4. Applying the model built in experiment-b to genomic sequences.

Validation in different data sets

To strictly compare with our previous study results presented in Liu and Wong, 2003; Zeng et al., 2002, we conduct the same 3-fold cross validation. Table 1 shows our results on the data set I and data set II when the top 100 features are selected by the entropy-based algorithm. Using the simple linear kernel function, SVMs achieves accuracy of 92.45% at 80.19% sensitivity and 96.48% specificity on data set I. This is better than the accuracy of 89.4% at 74.0% sensitivity and 94.4% specificity, which is the previous best result reported on the same data set [Zeng et al., 2002]. On data set II, SVMs (with linear kernel) achieves an accuracy of 98.16% at 63.75% sensitivity and 99.67% specificity. Note that we can not find previously reported results on this data set under similar across validation.

Table 1: The results by 3-fold cross validation on the two data sets when 100 top features are considered (experiment-a). SVM(linear/quad) means the classification model is built by linear/quadratic polynomial kernel function.
Data Algorithm Sensitivity Specificity Precision Accuracy
I SVMs(linear) 80.19% 96.48% 88.24% 92.45%
SVMs(quad) 80.19% 96.17% 87.34% 92.22%
Ensemble Trees 76.18% 96.14% 86.67% 91.20 %
II SVMs(linear) 63.75% 99.67% 87.18% 98.16%
SVMs(quad) 71.25% 99.42% 81.24% 98.46%
Ensemble Trees 83.54% 97.67% 55.93% 97.19%


Validation across two data sets

The good cross validation results within the individual data set encouraged us to extend our study to span the two data sets. In this experiment, we use the whole data set I as training data to select features and build the classification model, then we test the well-trained model on data set II to get a test accuracy. Before doing this validation test, we removed from data set II 292 sequences that are similar to the training data. Using the classification model learnt from 100 top-ranked features of data set I, we got a test accuracy of 89.42% at 96.28% sensitivity and 89.15% specificity on data set II using SVMs built on the linear kernel function. The training accuracy is 92.77% at 80.68% sensitivity and 96.75% specificity (on data set I).We note that the testing accuracy on the original data set II (without the removal of the similar sequences) is quite similar. See Table 2 for a summary of these two results. Remarkably, this cross-validation spanning the two data sets shows a much better sensitivity on data set II than that achieved in the 3-fold cross-validation on this data set. A reason may be that only 3.41% ATGs in data set II are true TIS, which leads to an extremely unbalanced numbers of samples between the two classes. However, this bias is rectified significantly by the model built on data set I where the population size of true TIS v.s. false TIS is more balanced.

Table 2: Classification accuracy when using data set I as training and data set II as testing (experiment-b). The row of II** is the testing accuracy on data set II before similar sequences being removed.
Data Algorithm Sensitivity Specificity Precision Accuracy
I (train) SVMs(linear) 80.68% 96.75% 89.10% 92.77%
SVMs(quad) 86.05% 98.14% 93.84% 95.15%
Ensemble Trees 85.54% 97.91% 93.10% 94.85%
II (test) SVMs(linear) 96.28% 89.15% 25.31% 89.42%
SVMs(quad) 94.14% 90.13% 26.70% 90.28%
Ensemble Trees 92.02% 92.71% 32.52% 92.68%
II**(test) SVMs(linear) 95.21% 89.74% 24.69% 89.92%
SVMs(quad) 94.38% 89.51% 24.12% 89.67%
Ensemble Trees 87.70% 93.26% 28.60% 92.11%

Incorporation of scanning model

Hatzigeorgiou [Hatzigeorgiou, 2002] reported a high accuracy on data set II by an integrated method which combines a consensus ANN with a coding ANN together with a ribosome scanning model. The model suggests to scan from the 5' end of a cDNA sequence and predicts TIS at the first ATG in a good context [Agarwal and Bafna, 1988; Cigan et al., 1988; Kozak, 1989]. The rest of the ATGs in the cDNA sequence to the right of this ATG are then automatically classified as non-TIS. Thus, one and only one ATG is predicted as TIS per cDNA sequence. We also incorporate this scanning model into our experiment. This time, in a sequence, we test ATGs in turn from left to right, utill one of them is classified as TIS. A prediction on a sequence is correct if and only if the TIS itself is predicted as a TIS. Since the scanning model indicates that the first ATG that in an optimal nucleotide context wouldbe TIS, a higher prediction accuracy is expected if only upstream ATGs and true TIS are used. Thus, we ignore all down-stream ATGs in data set I and obtain a new training set containing only true TISs and their up-stream ATGs. Then feature selection and classification model learning are based on this new training data. Table 3 shows our results with scanning model being used. Under this scanning model idea, Hatzigeorgiou reported that 94% of the TIS were correctly predicted on data set II [Hatzigeorgiou, 2002]. Since in her paper [Hatzigeorgiou, 2002], the data set was split into training and testing parts in some way, the results reported there are not directly comparable with our results.

Table 3: Classification accuracy under scanning model when using data set I (3312 sequences) as training and data set II (188 sequences) as testing (experiment-c). The row of II** is the testing accuracy on data set II before similar sequences being removed (480 sequences). NoCorPred is the number of sequences whose TIS is correctly predicted.
Data Algorithm NoCorPred Accuracy
I (train) SVMs(linear) 3161 95.44%
SVMs(quad) 3156 95.29%
Ensemble Trees 3083 93.09%
II (test) SVMs(linear) 174 92.55%
SVMs(quad) 172 91.49%
Ensemble Trees 176 93.62%
II** (test) SVMs(linear) 453 94.38%
SVMs(quad) 450 93.75%
Ensemble Trees 452 94.17%

Testing on genomic sequences

In order to further evaluate the feasibility and robustness of our method, we apply our model built in experiment-b to our own prepared data (data set III), which contains gene sequences of Chromosome X and Chromosome 21. Using the simple linear kernel function, SVMs gives 397 correct prediction out of a total of 565 true TISs found in Chromosome X while 132 correct prediction out of a total of 180 true TISs in Chromosome 21. The prediction rates are 70.27% and 73.33%, respectively. One point needs to be addressed here is that in this validation, we removed the feature built on the ribosome scanning model since that model is not true for genomic data. To illustrate the tradeoff between the prediction sensitivity and specificity, we randomly selected same number of sequences containing non-start ATGs (false TIS) from our own extracted negative data set. Figure 4 gives the ROC curve showing the changes of prediction accuracy on true and false TIS samples.



Figure 4: ROC curve of our model on prediction TIS in genomic data Chromosome X and Chromosome 21 (experiment-d). Our SVM model is built on the linear kernel function.



Discussion

In this study, we proposed a method for recognition of TIS in cDNA or mRNA sequences via generating amino acid patterns. We designed a series of experiments by applying our method to some public data sets as well as our own extracted sequences. Our testing accuracy are better than the previously reported best ones (where available) on the same data sets. Most importantly, we not only conducted the cross validation within the individual data sets separately, but also established the validation across the different data sets, including genomic data. The success of such a validation may indicate that there are common amino acid patterns around true TIS in cDNA or mRNA sequences.


Significant features

"What are the key features to predict TIS?". To answer this question, let us have a look of an interesting discovery on the features selected in the 3-fold cross validation on data set I in our experiment-a. Table 4 shows the ranking positions of the 10 top-ranked features selected by the entropy-based algorithm for the each fold. Observe that they are the same features though their ordering is slightly different from one fold to another. This suggests that these features, or exactly amino acid patterns, are the possible patterns around true or false TISs. Furthermore, "UP-ATG" can be explained by the ribosome scanning model [Agarwal and Bafna, 1988; Cigan et al., 1988] - seeing such an up-stream ATG makes the candidate ATG less likely to be the TIS. "DOWN-STOP" is the in-frame stop codons down-stream from the target ATG and it is consistent with the biological process of translating in-frame codons into amino acids stops upon encountering an in-frame stop codon - seeing such a down-stream stop codon makes the candidate protein improbably short. "UP3-AorG" is correspondence to the well-known Kozak consensus sequence [Kozak, 1987]. Remarkably, these amino acid patterns, except "DOWN-L", all contain "G" residue. Note also that "UP-M" is one of the top features in each fold, but we exclude it as it is redundant given that UP-ATG is true if and only if UP-M>0. The significance of these features is further verified when we find that both sensitivity and specificity drop down greatly if all of these features are excluded from prediction. However, we do not observe obvious decrease when we remove any one of them from the classification model. In addition to the result when the 100 top-ranked features are used, we also obtained cross-validation results when the whole feature space (i. e. without feature selection) and other numbers (such as 5, 10, 20, 50, 200 and 300) of top ranked features were used. We found that (all the results are on the basis of SVMs using linear kernel function), (1) using the whole feature space could not let us achieve best results. In fact, we got an accuracy of only 90.94% at 79.86% sensitivity and 94.58% specificity for data set I when running 3-fold cross validation on data set I. This result is not as good as that on the 100 top-ranked features. (2) using a small number of features could not achieve good results, either. For example, the accuracy is only 87.44% if only top 5 features are used on data set I; (3) results by top 200, 300 or more features show no much difference with the result by the 100 features. All these observations indicate that the results achieved by using the top 100 features is reasonable. This also suggests that in real biological process of translation there are some factors other than Kozak consensus that may regulate the recognition of TIS.

Table 4: Ranking of the top 10 features selected by the entropy-based algorithm as relevant in each of the 3 folds of data set I. Feature "UP-ATG" indicates whether an in-frame up-stream ATG exists (boolean type). Feature "UP3-AorG" tests whether purine A or G tends to be found 3 nucleotides up-stream of a true TIS (boolean type). Feature "UP(DOWN)-X" counts the occurrence that an in-frame (relative to the candidate ATG) triplet coding for the amino acid letter X appears in the up-stream (down-stream) part of a candidate ATG. Feature "DOWN-STOP" is the occurrence of in-frame stop condons down-stream of a candidate ATG.)
Fold UP-ATG DOWN-STOP UP3-AorG DOWN-A DOWN-V UP-A DOWN-L DOWN-D DOWN-E UP-G
1 1 2 4 3 6 5 8 9 7 10
2 1 2 3 4 5 6 7 8 9 10
3 1 2 3 4 5 6 8 9 7 10
 

Classification algorithms

For the classification methods, overall speaking, SVMs performs slightly better than our ensembles of decision trees method, in terms of prediction accuracy. However, our tree committee achieves very good sensitivity when running 3-fold cross validation on data set II where the number of true TISs is much less than the number of false TISs. Besides, decision trees can output comprehensive rules to disclose the essence of learning and prediction. Some discovered interesting and biologically sensible rules with larger coverage are listed below.

  1. If UP - ATG = 'Y' and DOWN-STOP > 0, then prediction is false TIS.
  2. If UP3 - AorG = 'N' and DOWN-STOP > 0, then prediction is false TIS.
  3. If UP - ATG = 'N' and DOWN-STOP < 0 and UP3 - AorG = 'Y' , then prediction is true TIS.

On the other hand, in our series of experiments, SVMs built on quadratic polynomial kernels do not show their advantages over those built on simple linear kernel functions. Note that quadratic kernels need much more time on training process.


Comparison with model built on nucleotide acid patterns

In [Zeng et al., 2002], we studied data set I using k-gram nucleotide acid patterns and several classification methods including SVMs, Naive Bayes, Neural Network and decision tree. In that study, feature selection was also conducted, but by a method called CFS (Correlation-based Feature Selection). CFS selects a set of features which are highly correlated with classes but less correlated with each other. The best accuracy achieved on the 3-fold cross validation was 89.4% at 74.0% sensitivity and 94.4% specificity when some 3-gram nucleotide acid patterns were used. This result was not as good as that we presented in this paper - 92.45% accuracy at 80.19% sensitivity and 96.48% specificity. However, the good features selected by these two experiments are highly consistent. Besides those 3 features built on bio-knowledge, CFS picked out down-stream TAA (stop codon), TAG (stop codon), TGA (stop codon), CTG (amino acid L), GAC (D), GAG (E) and GCC (A). If code these 3-gram nucleotide acide patterns into 1-gram amino acid patterns, we will find they are all among the best features listed in above Table 4. On the other hand, although there are no 2-gram amino acid patterns among the 10 best features in Table 4, some of them are indeed included in the top 100 feature set that has been used to achieve better results in this study. Note that, our previous study [Zeng et al., 2002] also illustrated that using 4-gram, 5-gram nucleotide acide patterns could not help improve the prediction performance.


Comparison with ATGpr

As mentioned earlier, ATGpr [Nishikawa et al., 2000; Salamov et al., 1998] is a TIS prediction program that makes use of a linear discriminant function, several statistical measures derived from the sequence and the ribosome scanning model. It can be accessed via http://www.hri.co.jp/atgpr/. When searching TIS in a given sequence, the system will output several (5 by default) ATGs in the order of decreasing confidence but we always predict the ATG with highest confidence as TIS. For the 3312 sequences in our data set I, ATGpr can predict correctly true TIS in 2941 (88.80%) of them. This accuracy is 6.64% lower than that we achieved. For our data set II, true TIS in 442 (92.0%) of 480 sequences are properly recognized, which is about 2.38% lower than the accuracy obtained by us. Our results quoted here are based on SVM model using the linear kernel function.

When we fed the genomic data used in our experimentd to ATGpr, the program gave correct TIS prediction on 128 (71.11%) of 180 Chromosome 21 gene sequences and 417 (73.81%) of 565 Chromosome X gene sequences, giving the overall sensitivity as 73.15%. On the other hand, ATGpr achieved 70.47% specificity on the same number of negative sequences that were used in our experiment-d. From the ROC curve in Figure 4, we can find our prediction specificity is around 80% when sensitivity is set to 73.15% - 9.5% higher than that of ATGpr on specificity. This indicates that our program may also outperform ATGpr when dealing with genomic data sequences.



Acknowledgement

We wish to thank Dr. Anders G. Pedersen and Dr. Artemis G. Hatzigeorgiou for providing their data sets.



References