| In Silico Biology 2, 0043 (2002); ©2002, Bioinformation Systems e.V. |
1 Science of Bioresources Program, United Graduate School of Agricultural Sciences (UGAS)
Iwate University, 18-8, Ueda 3-chome, Morioka, Iwate 020-8550, Japan
2 Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University
Hirosaki 036-8561, Japan; Tel./Fax: +81-172-39-3638
E-mail: slsimi@si.hirosaki-u.ac.jp
* corresponding author
Edited by E. Wingender; received May 11, 2002; revised and accepted July 16, 2002; published July 30, 2002
Reported performance of existing transmembrane (TM) topology prediction methods were often based on evaluations which neglected the risk of signal peptides (SP) being predicted as putative TM as well. Here, we evaluated 12 selected TM topology prediction methods (TMpred, TopPred II, DAS, TMAP, MEMSAT 2, SOSUI, PRED-TMR2, TMHMM 2.0, HMMTOP 2.0, SPLIT 3.5, TM Finder, and MPEx) for the effect of SP in prediction performance considering three SP treatments, namely: "remain" (untreated), "removed first", and "removed later". The results showed that the presence of SP significantly affected the prediction performance of the 12 selected TM topology prediction methods for all three predicted attributes (the number of transmembrane segments (TMSs), the number of TMSs plus position, and the N-tail location) and for the predicted topology (combined predictions of three attributes) by causing a reduction in prediction accuracy. In particular, lower prediction accuracies were obtained if SP is left untreated (remain) while significant increases were observed if SP is removed either first or later. However, between "removed first" and "removed later" SP treatments, the difference was statistically insignificant. In addition, we found that machine learning-based prediction methods were less affected by the presence of SP than hydropathy-based methods, but still the potential risk of degrading the prediction performance is there however to a lesser degree. Thus, when performing genome-wide analysis, the SP issue should be addressed during TM topology prediction.
Key words: Signal peptide, transmembrane, evaluation, transmembrane topology prediction, prediction analysis
The past two decades since Kyte and Doolittle (1982) proposed a simple hydropathy-based method to predict the location of transmembrane (TM) segments in a protein sequence, we have seen several TM topology prediction methods being put forward for use by the research community. These prediction methods have features that range from enhancing the original approach of Kyte and Doolittle by applying additional processing of the simple hydrophobicity scales [Klein et al., 1985; Claros and von Heijne, 1994; Hirokawa et al., 1998] or by using indices that are more refined [Hofmann and Stoffel, 1993; Juretic et al., 1993; Deber et al., 2001; Pasquier et al., 1999; Jayasinghe et al., 2001] to implementing complex algorithms such as dynamic programming [Jones et al., 1994; McGuffin et al., 2000] to determine the best topology models. Other methods have used additional information from results of multiple sequence alignment [Persson and Argos, 1997]. Recent methods have espoused the use of machine learning approaches, such as neural networks [Rost et al., 1996] and hidden Markov models [Krogh et al., 2001; Tusnady and Simon, 2001], in predicting TM topology.
However, the often-overlooked aspect in evaluating the performance of these methods is the possibility that a signal peptide (SP) region in the protein sequence is predicted as a putative TM segment during prediction analysis, which is not remote considering that SP possesses a hydrophobic h-region [von Heijne, 1986b; Nakai, 2000; Rusch and Kendall, 1995]. Apparently, developers of these methods have assumed that their methods would likely miss the SP region during TM prediction analysis for one reason or another. Whether this implicit assumption has a sound basis or is merely speculative, is what we are trying to determine in this study, and if the latter is true, then to identify the appropriate treatment for the SP issue during prediction analysis.
In genome-wide analyses, the likely SP region is treated in several ways. It was either masked from topological calculations [Jones, 1998] or omitted [Arkin et al., 1997; Stevens and Arkin, 2000] when picked up as the first TM segment by the TM topology prediction method within a specified region from the N-terminal. One treatment simply avoided the SP by considering a minimum number of predicted TM segments for further analysis [Wallin and von Heijne, 1998], while another treated the SP directly by removing it first from the query sequence before performing prediction analysis by the TM topology prediction method [Kihara and Kanehisa, 2000]. However, in a recent paper by Krogh et al. (2001), the SP was removed only after the prediction of TM helices.
A recent study by Möller et al. (2001), which conducted an evaluation of methods for predicting membrane-spanning regions, showed that a majority of the existing prediction methods have the tendency to predict the hydrophobic region of the SP as the first TM segment. However, in their study, it was not shown to what degree the SP affected the prediction accuracy of the different methods, except to show that most of the methods do not discriminate well between a SP and a membrane-spanning region.
Even if the prediction of SP as membrane-spanning region means an over prediction of only one TM segment, it has a serious implication in inferring protein function from the predicted topology [Lao and Shimizu, 2001a]. In addition, it has deleterious effects on the overall prediction accuracy especially if measured in per sequence entry basis, in which an over prediction of just one TM segment consequently results in a wrong prediction for the whole protein sequence entry.
In our previous study [Lao et al., 2002], we used three SP treatments ("remain", "removed first", and "removed later") to demonstrate the influence of SP in TM topology prediction analysis. We have also looked into the possible effect of transit peptide, but unlike in SP, we found that it does not pose any potential problem during TM prediction analysis. However, in both cases, we used the actual SP and transit peptide annotations of the protein sequence entries. In here, we have used a SP prediction method first, SignalP [Nielsen et al., 1997a; 1997b; Nielsen and Krogh, 1998], to predict the probable SP region and have used this predicted region when applying the three SP treatments during TM topology prediction analysis. Moreover, we have increased the number of TM topology prediction methods in the present evaluation including two relatively new prediction methods. Then, we have compared the results obtained in the previous study to determine whether there are any differences in the prediction accuracies and observed statistical significance. We have decided not to include transit peptides in the present evaluation since there is no existing method for predicting the transit peptide that is recommended. Besides, we already have proven that it is not a pitfall in TM prediction analysis.
As in the previous study, when performing prediction analysis, the SP region (predicted by SignalP-HMM) of the query TM sequence is subjected to the following treatments: "remain", "removed first", and "removed later". Three attributes were considered in assessing for the prediction performance: the number of predicted TM segments (TMSs), the number of predicted TMSs plus TMS-position, and the predicted N-tail location. The combination of the three predicted attributes determines the predicted TM topology. Prediction accuracy is measured on per sequence entry basis. Results have shown that the prediction performance of the selected TM topology prediction methods hardly changed if the predicted SP region does not significantly deviate from the actual SP region. Furthermore, newer prediction methods are still likely to predict the SP as TM segment. Thus, to assume that existing TM prediction methods are less likely to predict SP as TM segment is a dangerous proposition.
Although model-based prediction methods tend to be less affected by the presence of SP than hydropathy-based methods, still the potential risk of degrading the prediction performance is there however in a lesser degree. We further report that removing SP either first or later does not result in a significant difference in prediction accuracy, and what is clear is that the SP issue should be resolved either before or after TM prediction analysis depending on the prediction method used.
Data set
TM protein sequences with SP annotation and with high experimental evidence were collected from SWISS-PROT database [Bairoch and Apweiler, 2000], Möller's TM data set [Möller et al., 2000], and TMPDB [Shimizu and Nakai, 1994; Ikeda et al., 2002]. The collected sequence entries were checked for sequence similarity (<30% for every pair of sequence entries) using CLUSTALW [Thompson et al., 1994]. Twenty-three entries were filtered and used as the TM SP data set. The selected sequences are A4_HUMAN, AMD2_XENLA, CD7_HUMAN, GHR_HUMAN, GLPA_HUMAN, GP21_RAT, HA12_MOUSE, MPRD_BOVIN, OSTB_YEAST, PGDR_MOUSE, RIB1_HUMAN, RIB2_HUMAN, RMP1_HUMAN, RMP2_HUMAN, RMP3_HUMAN, STS_HUMAN, GLK2_RAT, FRIZ_DROME and LSHR_RAT for eukaryotes, while COAB_BPFD, COAB_BPPF1, COX2_PARDE, and CYOA_ECOLI for prokaryotes.
Selection of the TM topology prediction methods
Twelve TM topology prediction methods that can be accessed publicly through the Internet were selected for assessment of SP effect in prediction performance as well as sensitivity in predicting the SP region as TM segment, namely: TMpred [Hofmann and Stoffel, 1993], TopPred II [Claros and von Heijne, 1994], DAS [Cserzo et al., 1997], TMAP [Persson and Argos, 1997], MEMSAT 2 [Jones et al., 1994; McGuffin et al., 2000], SOSUI [Hirokawa et al., 1998], PRED-TMR2 [Pasquier et al., 1999], TMHMM 2.0 [Krogh et al., 2001], HMMTOP 2.0 [Tusnady and Simon, 2001], SPLIT 3.5 [Juretic et al., 1993; Juretic and Lucin, 1998], TM Finder [Deber et al., 2001], and MPEx (http://blanco.biomol.uci.edu/mpex). TMpred uses a combination of several statistical preference matrices, derived from an expert-compiled data set of membrane proteins, for scoring to predict TM helices. TopPred II introduces a sliding trapezoid window to detect hydrophobic segments and evaluates the generated topology models using the "positive-inside rule" [von Heijne, 1986a]. DAS compares low-stringency dot-plots of the query sequence against a collection of non-homologous TM proteins using the RreM scoring matrix. TMAP utilizes the extra information coming from multiple sequence alignments of homologous proteins to determine membrane-spanning segments. MEMSAT 2 employs statistical tables (log-likelihoods) compiled from well-characterized TM protein data and uses a dynamic programming algorithm to recognize TM topology models by expectation maximization. SOSUI combines hydrophobicity, relative and net charges, and protein length to detect TM helices. PRED-TMR2 is an extension of PRED-TMR that compiles scores for the termini of each putative segment using propensities of amino acid residues at the termini of TM helices collected by the authors. TMHMM 2.0 is the latest version of TMHMM, which implements a cyclic hidden Markov model with seven states for TM-helix core, TM-helix caps on the N- and C-terminal sides, loop region on the cytoplasmic side, two loop regions on the non-cytoplasmic side, and a globular domain state in the middle of each loop region. Likewise, HMMTOP 2.0 is an updated version of HMMTOP that uses also a hidden Markov model distinguishing the following five structural states: inside loop region, inside TM-helix cap, TM helix, outside TM-helix cap, and outside loop region. SPLIT 3.5 is the newer version of SPLIT, which utilizes integrated multiple scales for amino acids to predict TM regions. TM Finder uses a combination of HPLC-derived hydrophobicity and nonpolar phase helical propensity scales to detect TM segments. MPEx makes use of a whole-residue hydropathy scale derived from the Wimley-White experiments that includes the backbone constraints. The decision to include these methods in the evaluation was based on their availability on the web, which made us think that they are likely to be tapped by the research community. Prediction analysis was performed using default settings for all the methods, single-sequence mode for TMAP, and selected only the first predicted topology model for TopPred II.
Predicting the SP region
The SP prediction system SignalP V2.0.b2 (http://www.cbs.dtu.dk/services/SignalP-2.0) was used to predict the probable SP region of each of the 23 protein sequence entries contained in the SP data set. Specifically, the output of the hidden Markov model method was selected as the reference. The choice of SignalP V2.0.b2 over other SP prediction systems was based on the results of a recent evaluation of currently available SP prediction systems, which found it to offer the best performance [Menne et al., 2000].
Evaluating for the effect of SP
During TM topology prediction analyses by the 12 selected prediction methods, three treatments of the predicted SP regions were applied. First, the predicted SP region of the query sequence was not removed ("remain") and prediction analysis was carried out immediately. Second, the predicted SP region was removed first before applying the TM topology prediction methods ("removed first"). Third, the same as that of the first treatment however if the predicted first TM segment overlaps with the predicted SP region, then the predicted first TM segment is excluded from the number of predicted TM segments and the predicted N-tail location is changed automatically to the opposite direction ("removed later"). Subsequently, correctness of the prediction was assessed based on the remaining predicted TM segments and the altered N-tail location. On the other hand, if the predicted first TM segment does not overlap any part of the predicted SP region, then no changes are made to the prediction outcome.
Three attributes were considered in deciding for the predicted topology, namely: the number of predicted TMSs, the number of predicted TMSs plus TMS-position, and the predicted N-tail location. Prediction performance on each of the three predicted attributes as well as the predicted topology was assessed in terms of the prediction accuracy ratings obtained by applying the three SP treatments.
The correctness of the prediction for a TM segment was based on the distance between the center positions of the predicted TM segment and the actual TM segment which should be less than or equal to 11 residues. For the number of predicted TMSs and the predicted N-tail location, the actual number of TMSs and the actual N-tail location annotations for the query sequence should match to indicate a correct prediction. Since prediction accuracies were measured on per sequence entry basis, a wrong prediction in at least one of the predicted TM segments among multi-spanning TM sequences will eventually result in a wrong prediction for the whole sequence entry.
To verify statistically the observed differences in the prediction accuracy ratings for each of the three predicted attributes and the predicted TM topology applying the three SP treatments, a non-parametric statistical test, Friedman test, was used. Since this test begins transforming the prediction accuracy ratings into ranks for each TM topology prediction method, the subsequent interpretation of significance should not be viewed directly as that of the observed level of prediction rating. Instead, it should be interpreted from the statistical standpoint of randomness of the rank-transformed prediction accuracy ratings among the three SP treatments across the 12 selected TM topology prediction methods. Furthermore, if the Friedman test is significant, then the non-parametric Wilcoxon Signed-Ranks test was employed to further test whether there is a significant difference between removing the SP first or later. The free website, VassarStats (http://faculty.vassar.edu/lowry/VassarStats.html), was accessed to perform all statistical calculations used in the study.
Distribution of the length difference between the predicted and actual SP regions
The graph in Figure 1 shows the distribution of the length difference, in terms of the number of amino acid residues, between the predicted and actual SP regions. Apparently, SignalP V2.0.b2-HMM tends to predict longer SP regions (black bars to the right of zero) with slightly over 50% of the total SP data set entries predicted to have SP one residue longer than the actual length. Surprisingly, not one SP region among the 23 protein sequence entries was correctly predicted (i. e., length difference of zero), which may serve as a good test to determine if the usage of erroneously predicted SP region in addressing the SP issue during topology prediction analysis has implications in the prediction performance of TM topology prediction methods. We have applied a t-test for two correlated samples to verify whether the predicted SP regions differ significantly from the actual SP regions. Since the observed p-value (p = 0.415) of the paired t-test is greater than the standard minimum acceptable level of significance of 0.05, we conclude that the observed deviation in the predicted SP regions is not statistically significant, or in other words, the distribution of the length of predicted SPs is statistically not different from that of the actual SPs.
Effect of SP in prediction performance
Table 1 reveals that ignoring the presence of SP, or SP being left untreated, resulted in dismally low prediction accuracies for all the three predicted attributes and the predicted TM topology. On the contrary, removing the SP either before or after prediction analysis has markedly improved the prediction accuracies at various rates depending on the TM topology prediction method used. Furthermore, except for TopPred II and MEMSAT 2, the remaining methods (excluding SPLIT 3.5, TM Finder, and MPEx) have obtained prediction accuracies identical to the results of the previous study where we used the actual SP annotation. This implies that TopPred II and MEMSAT 2 are sensitive to any over- and/or under-prediction of the SP region if the SP treatment applied is "removed first" which translates into a slight reduction in prediction accuracy. Although, compared with MEMSAT 2, TopPred II is less at risk to this downside since it is expected only when predicting the N-tail location attribute. This observation brings up the idea that the presence of SP somehow is a good indicator for the sidedness of the TM protein (i. e., outside) since the cleavage process by the signal peptidase occurs in the exoplasmic side of the membrane. However, extra care should be exercised in implementing the concept during topology prediction because if the SP prediction method overpredicts real TM helix as SP in TM proteins, then it has also serious implication in the predicted N-tail location which may not be necessarily correct. For the other methods, the treatment of mispredicted SP region during topology prediction analysis does not result in any potential negative effect in the prediction performance as long as the deviation is not statistically significant (Figure 1).
Table 1: Prediction accuracies for three predicted attributes and predicted TM topology by 12 selected TM topology prediction methods applied to a signal peptide data set considering three treatments ("remain", "removed first", and "removed later") of the SignalP-HMM predicted signal peptide region.
|
|
|
|||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| TMpred |
|
|
|
|
|
|
|
|
|
|
|
|
| TopPred II |
|
|
|
|
|
|
|
|
|
|
|
|
| DAS |
|
|
|
|
|
|
|
|
|
|
|
|
| TMAP |
|
|
|
|
|
|
|
|
|
|
|
|
| MEMSAT 2 |
|
|
|
|
|
|
|
|
|
|
|
|
| SOSUI |
|
|
|
|
|
|
|
|
|
|
|
|
| PRED-TMR2 |
|
|
|
|
|
|
|
|
|
|
|
|
| TMHMM 2.0 |
|
|
|
|
|
|
|
|
|
|
|
|
| HMMTOP 2.0 |
|
|
|
|
|
|
|
|
|
|
|
|
| SPLIT 3.5 |
|
|
|
|
|
|
|
|
|
|
|
|
| TM Finder |
|
|
|
|
|
|
|
|
|
|
|
|
| MPEx |
|
|
|
|
|
|
|
|
|
|
|
|
| Mean |
|
|
|
|
|
|
|
|
|
|
|
|
To confirm the observed differences in prediction accuracy statistically, the prediction accuracy ratings for each of the three predicted attributes and the predicted TM topology using the three SP treatments were analyzed using the Friedman test. This statistical test calculates a version of the Chi-square statistic which is compared to a Chi-square critical value to decide for statistical significance. Table 2 shows the statistical significance of the Friedman test for each of the predicted attributes and the predicted TM topology. Statistical significance of the test was compared to the standard acceptable level of 1%. All three predicted attributes and the predicted TM topology have yielded statistically significant results with estimated p-values of the Chi-square statistic less than the standard acceptable level of significance. This implies that the observed differences in prediction accuracy ratings for the three SP treatments are real and that removing the SP gives higher prediction accuracy. Furthermore, unless the SP is treated properly, its presence significantly influences the outcome of prediction analysis.
Table 2: The observed significance of Friedman1 and Wilcoxon Signed-Ranks2 tests for each of the three predicted attributes and the predicted TM topology of the signal peptide data set.
|
|
|
|||
|---|---|---|---|---|
|
|
|
|
|
|
| 1 Re = RF = RL |
|
|
|
|
| 2 RF = RL |
|
|
|
|
Although mean prediction accuracies for the two treatments of SP, that is "removed first" and "removed later", have slight differences of around 1-7 percentage points only, these observed differences were further tested statistically for significance. Test results for the three predicted attributes and the predicted TM topology have indicated statistical insignificance (Table 2). Thus, regardless whether the SP is removed either first or later from the query sequence, the same effect in the prediction accuracy can be achieved. However, some of the selected TM topology prediction methods such as TMpred, TopPred II, TMHMM 2.0, TM Finder, and possibly MPEx, have shown higher prediction accuracies if the SP is removed later, while the others have shown no difference at all or the opposite except for SPLIT 3.5, which shows somewhat ambiguous trend.
Sensitivity of prediction methods in detecting the SP region
As regards with the sensitivity level in predicting any part of the SP region by the 12 selected TM topology prediction methods, Figure 2 shows the number of hits as percentage of all entries in the SP data set. A hit is defined by at least one overlapping residue between the actual SP region and the predicted first TM segment. The DAS method apparently is highly sensitive to SP with 100% hits while TMHMM 2.0 is the least affected with the lowest percentage of hits and with the highest percentage of misses (~74%). This corresponds well to the obtained prediction accuracies for the number of predicted TMSs and the number of predicted TMSs plus position attributes presented in Table 1 when SP is left untreated ("remain"). Even the two relatively new prediction methods, TM Finder and MPEx, are both susceptible in predicting the SP region as TM segment with sensitivity rates above 80% which shows that the application of some kind of refined hydrophobicity scales and other similar approaches does not help much in discriminating between SP and putative TM segment. The majority of the methods sensitive to SP, i. e. having percentage of hits around 70% and above, are based on either hydropathy analysis (TopPred II and SOSUI) or some sort of refined hydropathy approaches (SPLIT 3.5, TM Finder, and MPEx). On the contrary, except for PRED-TMR2 that uses a neural network in the pre-processing stage, all model-based methods (HMMTOP 2.0, MEMSAT 2.0, and TMHMM 2.0) are less affected by SP. These findings highlight once again the tendency of TM topology prediction methods, especially those using a hydropathy-based approach, to predict the SP region as TM segment brought about by the similar hydrophobic characteristics of the SP's h-region and the putative first TM segment [Lao and Shimizu, 2001a; 2001b]. Although neural networks and model-based methods are less likely to predict any part of the SP region as TM segment, still they are prone to predict this region but to a lesser degree (~26-39%).
In summary, the presence of SP in TM protein sequences has significantly affected the prediction performance of the 12 selected TM topology prediction methods by effecting a reduction in the prediction accuracy if not treated properly during topology prediction analysis. Furthermore, the majority of the existing TM topology prediction methods publicly available are sensitive in detecting SP as membrane-spanning region. Thus, the implicit assumption that existing prediction methods would likely miss the SP region is just mere speculation.
In addition, higher prediction accuracy can be expected if the SP is removed either before or after prediction analysis. Regarding the question on which of the two SP treatments is appropriate - "removed first" or "removed later" - the choice depends on the TM topology prediction method used, although statistically, the improvement in prediction accuracy is the same. Furthermore, model-based TM topology prediction methods tend to be less sensitive to SP than non-model based methods.
Finally, it is imperative that when performing genome-wide analyses for TM protein topology, the issue on SP should be properly addressed. This study also has demonstrated that SP should be removed if present which can be done either before or after prediction analysis. Moreover, for as long as the predicted SP region does not deviate significantly from the actual one, the use of the former rather than the latter during SP treatment when performing topology prediction analysis does not affect the prediction performance of most TM topology prediction methods evaluated here. Hence, slight under- and/or over-prediction in the cleavage site of the predicted SP is of less concern when applying the appropriate SP treatment during topology prediction analysis.