Transmembrane Topology Prediction Methods: A Re-assessment and Improvement by a Consensus Method Using a Dataset of Experimentally-Characterized Transmembrane Topologies

Masami Ikeda1, 2, Masafumi Arai1, Demelo M. Lao1 and Toshio Shimizu1,*




1Department of Electronic Information System Engineering, Faculty of Science and Technology,
Hirosaki University, Hirosaki 036-8561, Japan
2Present address:
Science of Bioresources Program, The United Graduate School of Agricultural Sciences,
Iwate University, Morioka 020-8550, Japan
*Corresponding author.
Phone/Fax: +81-172-39-3638.
E-mail: slsimi@si.hirosaki-u.ac.jp





Edited by E. Wingender; received June 21, 2001; revised September 11, 2001 and accepted September 22, 2001



ABSTRACT

We selected 10 transmembrane (TM) prediction methods (KKD, TMpred, TopPred II, DAS, TMAP, MEMSAT 2, SOSUI, PRED-TMR2, TMHMM 2.0, and HMMTOP 2.0) and re-assessed its prediction performance using a reliable dataset with 122 entries of experimentally- characterized TM topologies. Then, we improved prediction performance by a consensus prediction method. Prediction performance during re-assessment and consensus prediction were based on four attributes: (i) the number of transmembrane segments (TMSs), (ii) the number of TMSs plus TMS-position, (iii) N-tail location and (iv) TM topology.

We noted that hidden Markov model-based methods dominate over other methods by individual prediction performance for all four attributes. In addition, all top-performing methods generally were model-based. Among prokaryotic sequences, HMMTOP 2.0 solely topped among other methods with prediction accuracies ranging from 64% to 86% across all attributes. However, among eukaryotic sequences, prediction performance for all the attributes was relatively poor compared with prokaryotic ones.

On the other hand, our results showed that our proposed consensus prediction method significantly improved prediction performance by, at least, an additional nine percentage points particularly among prokaryotic sequences for the number of TMS (84%), number of TMS and position (80%), and TM topology attributes (74%). Although our consensus prediction method improved also the prediction performance among eukaryotic sequences, the obtained accuracies for all attributes were relatively lower than that obtained by prokaryotic counterparts particularly for TM topology.

Keywords: consensus prediction, prediction performance assessment, transmembrane proteins, transmembrane protein database, transmembrane topology prediction



INTRODUCTION

For the past years and until now, different transmembrane (TM) topology prediction methods have been proposed, each claiming to have the same or better prediction performance than the previously reported ones.

Usually, we tend to assess the performance of a method based on its reported accuracy which may be problematic [King, 1996]. Mainly, these methods used different datasets of proteins in testing for their prediction performance, which makes it difficult to come up with a fair comparison, if not impossible.

In view of the aforesaid situation, it is therefore relevant to re-assess the performance of these prediction methods using a single dataset of highly reliable protein entries to come up with a fair assessment of their respective performance. Moreover, with the advent of genome scale analyses, a fair assessment of these methods becomes more important as to determine how much level of confidence we could attach to the results when applying it to genomic sequences as well as to give us a better estimate of the protein functions in a particular genome.

In another perspective, this is one way to determine also whether there is a need to improve the currently assessed level of prediction accuracy.

So far, we found a few published papers reporting about the topology prediction performances based on reliable datasets, but in all cases, used only a small number of entries [Fasman and Gilbert, 1990; Jayasinghe et al., 2001]. In an effort to address this drawback of using limited number of entries for reliable datasets, recent work to collect well-characterized TM topology data in a larger scale were initiated [Shimizu and Nakai, 1994; Kihara et al., 1998; Ikeda et al., 2000; Moeller et al., 2000].

Now, the next step is to use these reliable datasets with a larger number of entries as a benchmark for re-assessing the performance of the different TM topology prediction methods in one setting. For this study, we used our TransMembrane Protein Database (TMPDB) [Shimizu and Nakai, 1994; Ikeda et al., 2000].

Quite recently, a paper has been published by Moeller et al. (2001), on the evaluation of TM topology prediction methods using their well-characterized TM topology data. We would like to compare our re-assessed results with their results in this paper.

We have selected 10 TM topology prediction methods for re-assessment purposes, namely: KKD [Klein et al., 1985], TMpred [Hofmann and Stoffel, 1993], TopPred II [Claros and von Heijne, 1994], DAS [Cserzo et al., 1997], TMAP [Persson and Argos, 1997], MEMSAT 2 [Jones et al., 1994; McGuffin et al., 2000], SOSUI [Hirokawa et al., 1998], PRED- TMR2 [Pasquier and Hamodrakas, 1999], TMHMM 2.0 [Krogh et al., 2001] and HMMTOP 2.0 [Tusnady and Simon, 2001]. These were chosen based on what we believed as the most often used TM topology prediction methods of the time when we conceptualized this study, except for the latest versions of TMHMM and HMMTOP, which we deliberately included in this re-assessment.

At this point, we review briefly the selected 10 TM prediction methods. KKD uses the Kyte and Doolittle hydrophobicity index (1982), and allocates the boundaries of transmembrane segments (TMSs) by a discriminant function. TMpred employs a combination of several weight-matrices based on the statistical analysis of TMbase, a database of TM proteins from SWISS-PROT (Release 25), for scoring. TopPred II applies the "positive-inside rule" [von Heijne, 1986] to evaluate the validity of topology models derived from the hydropathy analysis. DAS is based on the low-stringency dot-plots of the query sequence against a collection of non-homologous TM proteins using a previously derived scoring matrix. TMAP utilizes the extra information coming from multiple sequence alignments of homologous proteins. MEMSAT 2 uses a set of statistical tables, a dynamic programming algorithm to recognize TM topology models by expectation maximization, and making use of multiple sequences alignment generated by PSI-BLAST [Altschul et al., 1997]. SOSUI utilizes physicochemical properties of amino acid sequences such as hydrophobicity, charges, and sequence length. TMHMM 2.0 is the latest version of TMHMM [Sonnhammer, et al., 1998], which is based on a hidden Markov model (HMM) that is cyclic with seven types of states for helix core, helix caps on either side, loop on the cytplasmic side, two loops for the non-cytoplasmic side, and a globular domain state in the middle of each loop. Likewise, HMMTOP 2.0 is an updated version of HMMTOP [Tusnady and Simon, 1998], which determines five structural parts of TM proteins using a HMM formalism. Lastly, PRED-TMR2 is an extension of PRED-TMR [Pasquier et al., 1999], which incorporates a pre-processing stage by using a simple hierarchical feed-forward artificial neural network to classify proteins into either membrane or non-membrane proteins.

On the other hand, we have also explored the possibility of improving further the TM topology prediction accuracy through a "simple majority voting" or consensus using the individual results for the methods in combination. Promponas et al. (1999) combined the results of seven TM prediction methods (DAS, ISREC-SAPS [Brendel et al., 1992], PHD [Rost et al., 1996], SOSUI, TMpred, TopPred II, and PRED-TMR) to predict the location of TMS by using a joint prediction histogram. They simply predicted an amino acid residue to be inside a TM region if three or more methods predicted it as part of a TMS domain. However, in our case, we used "special" criteria to predict the segment by consensus. Nilsson et al. (2000) reported that a considerable improvement in prediction accuracy of the number of TMSs is achieved by consensus prediction among four or five methods (TopPred II, PHD, MEMSAT, HMMTOP 1.1, and TMHMM 1.0). However, they used only a limited number of Escherichia coli TM proteins.

In this paper, we report the re-assessed TM prediction performance of the selected 10 TM topology prediction methods using our reliable dataset with a larger number of entries than the dataset originally used by these methods, and the improvement of prediction performance using our proposed consensus prediction method. We also highlight here that the prediction accuracies reported are entry-based and not the usual segment-based as commonly used to assess the prediction performance among the 10 TM topology prediction methods.


MATERIALS AND METHODS

Dataset Construction

We searched in MEDLINE [Wheeler et al., 2001] for the keywords "transmembrane" and "topology". From the search results, we identified 794 published papers as related to what we are looking for and collected a copy for each of it. Then, by manually reviewing the contents of each of the collected copies, we were able to extract 145 articles, which reported TM topology models based on experimental evidence.

Since the articles lack the information necessary to have a complete annotation of the sequences, we crosschecked the sequences in question to SWISS-PROT [Bairoch and Apweiler, 2000], PIR [Barker et al., 2001], PRF (Protein Research Foundation), or PDB [Berman et al., 2000] databases, in which the remaining information were extracted. By combining these extracted information with the information contained in the published articles, we constructed our transmembrane protein database, (TMPDB) [Shimizu and Nakai, 1994; Kihara et al., 1998; Ikeda et al., 2000], following the SWISS-PROT format.

Next, we subjected TMPDB (145 entries) to a sequence similarity check (<30%) using CLUSTALW [Thompson et al., 1994], and finally obtained a non-redundant dataset with 122 entries, in which 70 are prokaryotic and 52 are eukaryotic entries. Tables 1 and 2 show the list of transmembrane proteins we used in this study for prokaryotes and eukaryotes, respectively.


Table 1: List of the 70 prokaryotic entries in our non-redundant dataset.

 
IDAccession No.#TMSN-tail IDAccession No.#TMSN-tail

 
CVAA_ECOLI
DIVB_BACSU
EXBD_ECOLI
FTSL_ECOLI
HOKC_ECOLI
LCND_LACLA
LHB5_RHOAC
MOTB_ECOLI
PBPB_ECOLI
PTND_ECOLI
TOLR_ECOLI
TONB_SALTY
TOXR_VIBCH
VG1_BPFD
CPXA_ECOLI
CYOA_ECOLI
ENVZ_ECOLI
FTSH_ECOLI
IMM_BPT4
LEP4_ECOLI
MCP1_ECOLI
PHOR_ECOLI
CYOD_ECOLI
EXBB_ECOLI
KDGL_ECOLI
SECE_ECOLI
TOLQ_ECOLI
BLAR_BACLI
DSBB_ECOLI
FIXL_RHIME
IMMA_CITFR
KDPD_ECOLI
LSPA_ECOLI
MOTA_ECOLI
VRXB_LAMBD
P22519
P16655
P18784
P22187
P22982
Q00565
P26790
P09349
P02919
P08188
P05829
P25945
P15795
P03655
P08336
P18400
P02933
P28691
P08986
P25960
P02942
P08400
P18403
P18783
P00556
P16920
P05828
P12287
P30018
P10955
P05701
P21865
P00804
P09348
P03759
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
4
4
4
in
out
in
in
in
in
in
in
in
in
in
in
in
in
in
out
in
in
in
out
in
in
in
out
in
in
out
out
in
in
in
in
in
in
in
  YSCU_YERPS
ATP6_ECOLI
CYOC_ECOLI
DHG_ECOLI
HISM_SALTY
HISQ_SALTY
MALG_ECOLI
OPPB_SALTY
OPPC_SALTY
PTMA_ECOLI
PTNC_ECOLI
SECD_ECOLI
A39512*
BACR_HALHA
CYDA_ECOLI
CYOE_ECOLI
PROW_ECOLI
CYB_RHOSH
CYDB_ECOLI
DMSC_ECOLI
LCRD_YERPE
2221300A**
CITN_KLEPN
RHAT_ECOLI
SECY_ECOLI
MTR_ECOLI
ARB1_ECOLI
CODB_ECOLI
GLPT_ECOLI
KGTP_ECOLI
LACY_ECOLI
LYSP_ECOLI
MELB_ECOLI
PUCC_RHOCA
TCR2_ECOLI
P40300
P00855
P18402
P15877
P02912
P02913
P07622
P08005
P08006
P00550
P08187
P19673
A39512
P02945
P11026
P18404
P14176
Q02761
P11027
P18777
P31487
2221300A
P31602
P27125
P03844
P22306
P08691
P25525
P08194
P17448
P02920
P25737
P02921
P23462
P02980
4
5
5
5
5
5
6
6
6
6
6
6
7
7
7
7
7
8
8
8
8
9
9
10
10
11
12
12
12
12
12
12
12
12
12
in
out
in
in
out
out
in
in
in
in
in
in
in
out
in
in
out
in
in
out
in
in
in
out
in
in
in
in
in
in
in
in
in
in
in

 
* PIR (Release 68.0) database entries.
** PRF database (Release 77) entry.
The remaining entries are from SWISS-PROT database (Release 39.0).




Table 2: List of the 52 eukaryotic entries in our non-redundant dataset.

 
IDAccession No.#TMSN-tail IDAccession No.#TMSN-tail

 
A41766*
AMD2_XENLA
BCS1_YEAST
COX4_BOVIN
COXD_BOVIN
COXH_BOVIN
COXK_BOVIN
COXO_BOVIN
COXQ_BOVIN
CP5A_CANTR
CYB5_RAT
GHR_HUMAN
JQ2019*
MPRD_BOVIN
NRAM_IAPUE
OCH1_YEAST
OSTB_YEAST
PGDR_MOUSE
RIB1_HUMAN
RIB2_HUMAN
VNB_INBLE
COX2_BOVIN
SCAA_RAT
STS_HUMAN
DHSD_BOVIN
CXA1_RAT
A41766
P12890
P32839
P00423
P07471
P04038
P07470
P00430
P10175
P10615
P00173
P10912
JQ2019
P11456
P03468
P31755
P33767
P05622
P04843
P04844
P06817
P00404
P37089
P08842
Q95123
P08050
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
3
4
in
out
out
out
out
out
out
out
out
in
in
out
out
out
in
in
out
out
out
out
out
in
in
out
out
in
  IM17_YEAST
IM23_YEAST
MYPR_HUMAN
GLK2_RAT
ADT2_YEAST
CNG1_BOVIN
MPCP_BOVIN
NO26_SOYBN
UCP1_RAT
COX3_BOVIN
LSHR_RAT
V2R_HUMAN
A27717*
ATN1_PIG
HMDH_CRIGR
SE12_CAEEL
TAP1_HUMAN
G6PT_HUMAN
CLC1_HUMAN
FUR4_YEAST
COX1_BOVIN
GTR1_HUMAN
MOT1_RAT
NNTM_BOVIN
MRP1_HUMAN
CINA_ELEEL
P39515
P32897
P06905
P42260
P18239
Q00194
P12234
P08995
P04633
P00415
P16235
P30518
A27717
P05024
P00347
P52166
Q03518
P35575
P35523
P05316
P00396
P11166
P53987
P11024
P33527
P02719
4
4
4
5
6
6
6
6
6
7
7
7
8
8
8
8
8
9
10
10
12
12
12
14
17
24
out
out
in
out
out
in
out
in
in
out
out
out
in
in
in
in
in
out
in
in
out
out
in
out
out
in

 
* PIR (Release 68.0) database entries.
The remaining entries are from SWISS-PROT database (Release 39.0).


Out of these 122 non-redundant entries, seventy-eight used gene fusion analysis to determine the topology model, while the remaining entries utilized either X-ray diffraction (16 entries), fluoroimmunoassay (7), two-dimensional crystals (1) or other experimental methods (11).

Based on the distribution of the number of TMS shown in Figure 1, we could claim that our dataset of experimentally characterized TM topology roughly covers all the TM proteins having different numbers of TMSs in the proteome.


Figure 1: Distribution of the number of transmembrane segments in our non-redundant and reliable transmembrane protein dataset. Grey bars represent prokaryotic, white bars eukaryotic entries. The inset chart shows the numbers of prokaryotic and eukaryotic entries in total.



Re-assessment of Transmembrane Topology Prediction Methods

We re-assessed 10 TM topology prediction methods, namely: KKD, TMpred, TopPred II, DAS, TMAP, MEMSAT 2, SOSUI, PRED-TMR2, TMHMM 2.0, and HMMTOP 2.0. Except for KKD, these can be accessed through the Internet, and thus are capable of analyzing query sequences online. We used default settings for all the methods, single-sequence mode for TMAP, and selected only the first predicted topology model for TopPred II. For KKD, we wrote a special program in C-language following the algorithm described in the original paper of Klein et al. (1985).

With regard to the re-assessment parameters, we considered four attributes, namely: the number of TMSs, the number of TMSs plus TMS- position, location of N-terminal loop region/N-tail (cytoplasm or periplasm), and TM topology (i.e., the number of TMSs plus TMS-position, and N-tail location).

Since each of these methods has different output format, we devised a scheme to standardize the format of the output to facilitate checking for the prediction accuracy. Actually, we made another copy of our dataset and appended to each sequence entry the prediction results for each of the methods. The appended lines for each sequence entry correspond to the following: the name of the TM topology prediction method, the number of predicted TMSs, the positions for each of the predicted TMS, and the predicted N-tail location. The process of determining the prediction accuracy was done automatically using a special program written in C-language.

Furthermore, to determine a correct prediction for TMS-position, we formulated a criterion wherein the center position of the predicted segment is compared to the corresponding center position of the actual segment. If the distance between the two center positions is less than or equal to 11 residues, it is predicted as correct, otherwise it is predicted as wrong. In the case of multi-spanning sequences, a wrong prediction in at least one of the predicted segments resulted in a wrong prediction for the TMS-position attribute since accuracy is measured in per sequence basis. For the predicted number of TMSs and the predicted location of N-tail, it is predicted as correct if they match correspondingly to the actual attributes.

On the other hand, 11 entries with cleaved signal peptides and eight entries of mitochondria-targeting peptides were treated differently. The signal peptide region and the mitochondria-targeting peptide of the sequences were removed first prior to the prediction by the 10 TM topology prediction methods.



Consensus Transmembrane Topology Prediction Method

We first tried several odd-numbered combinations using a minimum of three and a maximum of nine methods per combination among the 10 selected TM topology prediction methods. We selected the best combinations as based on the highest accuracies obtained for the number of TMSs plus TMS-position attribute. In predicting the N-tail location, however, we tested only the different combinations among six methods (TMpred, TopPred II, TMAP, MEMSAT 2, TMHMM 2.0, and HMMTOP 2.0), since the other four were incapable of predicting the TM orientation. Then, we determined the optimum combinations for each of the attributes separately for eukaryotic and prokaryotic sequences.

For predicting the number of TMSs plus TMS-position, we scanned simultaneously the standardized output of the different methods comprising a particular combination from the N-terminus to C-terminus direction (Figure 2a). The predicted segment first encountered during scanning served as the reference segment. From the center position of this segment, a window of 12 residues for prokaryotic sequences or eight residues for eukaryotic, was extended towards the C-terminus. Then, the simple majority voting (by count) is performed for all predicted segments (including the reference segment) with center positions within this window. If the vote (or count) results to a majority (>50%), then TMS prediction is pursued (details below). Otherwise, no prediction is made (Figure 2b). If a majority vote is obtained, then we computed the average of the center positions from all the predicted segments (including the reference segment) that have voted. This computed average value now becomes the center position of the segment predicted by consensus. Then, we expanded from this position by 10 residues to both N- and C-termini to determine the start and end positions of the predicted segment (Figure 2c). These processes are repeated until all the predicted segments are scanned to the C-terminus, but masking the segments already used in the voting to prevent it from being included again in the next round of voting (Figure 2d).

In the case of an overlap between two predicted segments by the consensus method, the overlapping ends of both segments are shrank by half the number of overlapping residues leaving a single-residue loop region if it is odd, otherwise the two segments are in tandem.


Figure 2: Illustration of the consensus prediction algorithm using simple majority voting (a combination of three methods for this example). The prediction outputs are scanned simultaneously from the N-terminus using a sliding column window of one residue length to find for the reference segment, the predicted segment first encountered (a). A window of 12 residues for prokaryotic or eight for eukaryotic sequences is extended towards the C-terminus from the center position of the reference segment, and majority voting is performed among predicted segments (including the reference segment) with center positions within this window (b). When a majority vote is obtained, the average of the center positions among the predicted segments that voted is computed, and the edges of the predicted segment by consensus are determined by expanding 10 residues to both N- and C-terminal directions from this average center position (c). Then, the segments used in the voting are masked, and the outputs are scanned again for the next round of voting (d).

The prediction for the number of TMSs is performed by using the best combinations determined for predicting the number of TMSs plus TMS-position in the two sets of sequence entries separately.

In the case of predicting the N-tail location, the simple majority voting is employed, that is, by counting the number of methods that agreed in their prediction and the highest count for a particular location determines the predicted location by consensus. By applying the two consensus combinations used for the two attributes (in no particular order), number of TMSs plus TMS-position and N-tail location, we could predict the TM-topology.

The same procedures, as in the evaluation section, are applied to determine whether the prediction is correct or wrong.

Furthermore, to test objectively the stability of the prediction performance of the consensus method, and the optimal combination of prediction methods of the consensus method, we have applied a jack-knife procedure to our dataset. We have divided our non-redundant dataset into five parts with two parts with 15 entries and the other three parts of 14 entries for prokaryotes, while three parts with 10 entries and the other two parts of 11 entries for eukaryotes. Next, the training is carried out among the four of the five parts, and the remaining part is used as test set to assess the prediction performance. Then, this process is repeated five times for both prokaryotic and eukaryotic sequences.


RESULTS AND DISCUSSION


Re-assessment of Transmembrane Topology Prediction Methods

All reported prediction accuracies in this study are based on per-entry basis. We believe that to assess the actual performance of a prediction method, it should treat the query sequence as a whole unit rather than pieces of segments. Thus, we consider our re-assessment as stringent, particularly for the attribute on the number of TMSs plus TMS-position, since a wrong prediction of even one segment, results in a wrong prediction for the whole sequence.

Table 3 shows the overall prediction performance of each of the 10 selected TM prediction methods applied to our reliable non-redundant dataset, while Tables 4 and 5 show the prediction performance specifically for prokaryotes and eukaryotes, respectively.



Table 3: Overall prediction accuracies of the 10 selected methods.

Prediction accuracies (%)
 
Methods #TMS #TMS&position N-tail location TM topology

KKD
TMpred
TopPred II
DAS
TMAP
MEMSAT 2
SOSUI
PRED-TMR2
TMHMM 2.0
HMMTOP 2.0

55.7 (88.6 , 42.5)
55.7 (71.4 , 49.4)
59.8 (71.4 , 55.2)
37.7 (45.7 , 34.5)
54.1 (77.1 , 44.8)
62.3 (82.9 , 54.0)
56.6 (74.3 , 49.4)
49.2 (71.4 , 40.2)
63.9 (80.0 , 57.5)
68.0 (77.1 , 64.4)
50.8 (88.6 , 35.6)
50.0 (71.4 , 41.4)
52.5 (71.4 , 44.8)
32.0 (45.7 , 26.4)
45.1 (74.3 , 33.3)
58.2 (82.9 , 48.3)
52.5 (74.3 , 43.7)
46.7 (71.4 , 36.8)
59.0 (80.0 , 50.6)
62.3 (77.1 , 56.3)
     -
61.5 (51.4 , 65.5)
72.1 (65.7 , 74.7)
     -
54.1 (54.3 , 54.0)
67.2 (60.0 , 70.1)
     -
     -
73.8 (65.7 , 77.0)
77.9 (68.6 , 81.6)
     -
34.4 (40.0 , 32.3)
41.0 (45.7 , 39.1)
     -
27.9 (42.9 , 21.8)
45.1 (54.3 , 41.4)
     -
     -
48.4 (62.9 , 42.5)
54.1 (57.1 , 52.9)
(a , b): prediction accuracies for 35 single-spanning entries (a) and 87 multi-spanning entries (b), respectively.



Table 4: Prediction accuracies of the 10 selected methods for prokaryotic entries.

Prediction accuracies (%)
 
Methods #TMS #TMS&position N-tail location TM topology
KKD
TMpred
TopPred II
DAS
TMAP
MEMSAT 2
SOSUI
PRED-TMR2
TMHMM 2.0
HMMTOP 2.0

61.4 (92.9 , 53.6)
60.0 (85.7 , 53.6)
67.1 (78.6 , 64.3)
41.4 (57.1 , 37.5)
57.1 (92.9 , 48.2)
62.9 (64.3 , 62.5)
58.6 (78.6 , 53.6)
51.4 (78.6 , 44.6)
68.6 (71.4 , 67.9)
75.7 (85.7 , 73.2)
55.7 (92.9 , 46.4)
54.3 (85.7 , 46.4)
60.0 (78.6 , 55.4)
34.3 (57.1 , 28.6)
45.7 (85.7 , 35.7)
57.1 (64.3 , 55.4)
51.4 (78.6 , 44.6)
48.6 (78.6 , 64.3)
62.9 (71.4 , 60.7)
70.0 (85.7 , 66.1)
     -
57.1 (42.9 , 60.7)
84.3 (85.7 , 83.9)
     -
55.7 (42.9 , 58.9)
72.9 (57.1 , 76.8)
     -
     -
74.3 (50.0 , 80.4)
85.7 (57.1 , 92.9)
     -
35.7 (35.7 , 35.7)
55.7 (64.3 , 53.6)
     -
25.7 (28.6 , 25.0)
50.0 (42.9 , 51.8)
     -
     -
52.9 (50.0 , 53.6)
64.3 (57.1 , 66.1)
(a , b): prediction accuracies for 14 single-spanning entries (a) and 56 multi-spanning entries (b), respectively.



Table 5: Prediction accuracies of the 10 selected methods for eukaryotic entries.

Prediction accuracies (%)
 
Methods #TMS #TMS&position N-tail location TM topology
KKD
TMpred
TopPred II
DAS
TMAP
MEMSAT 2
SOSUI
PRED-TMR2
TMHMM 2.0
HMMTOP 2.0

48.1 (85.7 , 22.6)
50.0 (61.9 , 41.9)
50.0 (66.7 , 38.7)
32.7 (38.1 , 29.0)
50.0 (66.7 , 38.7)
61.5 (95.2 , 38.7)
53.8 (71.4 , 41.9)
46.2 (66.7 , 32.3)
57.7 (85.7 , 38.7)
57.7 (71.4 , 48.4)
44.2 (85.7 , 16.1)
44.2 (61.9 , 32.3)
42.3 (66.7 , 25.8)
28.8 (38.1 , 22.6)
44.2 (66.7 , 29.0)
59.6 (95.2 , 35.5)
53.8 (71.4 , 41.9)
44.2 (66.7 , 29.0)
53.8 (85.7 , 32.3)
51.9 (71.4 , 38.7)
     -
67.3 (57.1 , 74.2)
55.8 (52.4 , 58.1)
     -
51.9 (61.9 , 45.2)
59.6 (61.9 , 58.1)
     -
     -
73.1 (76.2 , 71.0)
67.3 (76.2 , 61.3)
     -
32.7 (42.9 , 25.8)
21.2 (33.3 , 12.9)
     -
30.8 (52.4 , 16.1)
38.5 (61.9 , 22.6)
     -
     -
42.3 (71.4 , 22.6)
40.4 (57.1 , 29.0)
(a , b): prediction accuracies for 21 single-spanning entries (a) and 31 multi-spanning entries (b), respectively.

By comparing the performance of the 10 selected methods in predicting the number of TMSs, HMMTOP 2.0 consistently topped among other methods with an overall accuracy of 68%, except when applied to eukaryotic entries where it ranked second to MEMSAT 2 along with TMHMM 2.0. However, the accuracy level is lower than what we have expected. In addition, prokaryotic sequences are predicted better (76%) than their eukaryotic counterparts (62%). For the second top-performing method, TMHMM 2.0 prevailed with an overall accuracy (64%), but tied to HMMTOP 2.0 among eukaryotic entries. It seems also that DAS is the lowest performing method for this particular attribute among the 10 selected methods either by overall accuracy or by accuracy for a specific type of sequences. Probably, the reason for its poor performance is that it tends to overpredict the number of segments by predicting more short segments than usual, and since we computed accuracy based on per-entry basis, then the likelihood of getting a wrong prediction would be high.

Considering the performance for correctly predicting the number of TMSs plus TMS-position, still the same pattern of prediction performance is observed with HMMTOP 2.0 as the top performer although overall accuracy decreases to 62% and with DAS as the lowest performer. Among eukaryotic entries, the top spot is occupied by MEMSAT 2 with accuracy (60%).

In predicting for the TM topology (i.e., the number of TMS plus TMS position and N-tail location), only six out of the 10 selected methods are capable of predicting for the TM orientation: TMpred, TopPred II, TMAP, MEMSAT 2, TMHMM 2.0, and HMMTOP 2.0. Among the six methods, HMMTOP 2.0 still shows the best performance with overall accuracy 54%. Again, in eukaryotic entries, HMMTOP 2.0 is ranked only second to TMHMM 2.0 but accuracy level falls below 50%.

Now, the figures in parentheses in Tables 3, 4 and 5 show the prediction accuracies for single- and multi-spanning sequences. Overall, the performance for single-spanning sequences is apparently superior to the multi-spanning ones both for correctly predicting the number of TMSs and TMSs plus TMS position attributes. This observation is more obvious among the eukaryotic sequences. However, in the case of the N-tail location attribute, the opposite is true, especially among the prokaryotic sequences.

Comparing our result to the result obtained by Moeller et al. (2001), HMMTOP 2.0, TMHMM 2.0 and MEMSAT 2 still have higher prediction accuracies than other methods in both the results, although the accuracies by DAS and TopPred II are higher by 0.7% and 34%, respectively, even if using for number of TMSs plus TMS position attribute based on per sequence accuracy. In another case, considering the number of TMSs plus TMS position attribute overall prediction accuracies by TopPred II and SOSUI are reported at only 26% and 36% by Moeller et al., respectively, while we obtain 53% in both methods. These distinct differences maybe are due to the different datasets used in which the ratio of our non-redundant set is approximately two thirds the dataset of Moeller's et al. Although it is difficult to make a direct comparison since their evaluation dataset is not available, we surmise that there is quite a difference in proportion of prokaryotic and eukaryotic sequences between the two datasets. Probably, Moeller's et al. dataset might contain a higher number of eukaryotic proteins than prokaryotic ones resulting in lower combined prediction accuracies. From Tables 4 and 5, eukaryotic sequences appear more difficult to predict their TM topology than prokaryotic ones.

In summary, HMM-based algorithms seem to be superior in prediction performance among the 10 selected methods. When predicting the number of TMSs, and the number of TMSs plus TMS-position among eukaryotic entries, however, MEMSAT 2 prevails over the HMM-based methods. Generally speaking, MEMSAT 2 also falls within the model-based algorithm category since it utilizes topological models during prediction. Thus, it can be said that model-based methods perform better than non-model based ones.

In addition, we have noted the following prediction errors among some of the 10 selected methods. MEMSAT 2 failed to perform prediction among seven entries in our dataset (5 prokaryotic and 2 eukaryotic). SOSUI and PRED-TMR2 predicted eight entries (2 prokaryotic and 6 eukaryotic) in our dataset as soluble protein, and six entries for TMHMM 2.0 (3 prokaryotic and 3 eukaryotic). HMMTOP 2.0, TMAP, and KKD predicted zero TMS for one, two, and three entries (all eukaryotic), respectively.


Consensus Transmembrane Topology Prediction

Table 6 shows the results of our consensus prediction method (the best combinations and their prediction accuracies) together with the results from the best performing methods among the 10 TM topology prediction methods re-assessed. By looking at this table, it is clear that there is a considerable improvement in the prediction performance as indicated by the increase in percentage points of the prediction accuracies for all the attributes considered in this study, except for the N-tail location attribute in eukaryotic sequences. We would like also to point out here that the results of the re-assessment of the individual prediction performance among the 10 selected methods have no bearing in deciding which methods a particular combination would comprise. Instead, the selection for the best combination of methods for a particular attribute was purely based on the combination, among all the possible combinations for the 10 selected methods, which gives the highest accuracy. Thus, we are surprised to see DAS, the lowest performing prediction method, as part of the consensus prediction (or the best combination) for prokaryotic sequences. It is possible that the tendency for DAS to predict more short segments than usual made these segments to be easily included in the voting within the allowed window. As regards to the other TM prediction methods included in the consensus prediction, aside from TMHMM 2.0 and HMMTOP 2.0, these methods are either within the top five in prediction performance or its prediction accuracies are close to the accuracy obtained by the best performing method.


Table 6: Comparison of prediction performance between the top-performing individual method and the proposed consensus prediction method for prokaryotic (70 comprising of 14 single-spanning and 56 multi-spanning) and eukaryotic (52 comprising of 21 single-spanning and 31 multi-spanning) sequences.

Prediction accuracies (%)
 
Methods #TMS #TMS&position N-tail location TM topology
Prokaryotes
  Consensus
  Individual method (best)
84.3a (92.9 , 82.1)
75.7e (85.7 , 73.2)
80.0a (92.9 , 76.8)
70.0e (85.7 , 66.1)
90.0c (85.7 , 91.1)
85.7e (57.1 , 92.9
74.3* (78.6 , 73.2)
64.3e (57.1 , 66.1)
Eukaryotes
  Consensus
  Individual method (best)
71.2b (90.5 , 58.1)
61.5f (95.2 , 38.7)
63.5b (90.5 , 45.2)
59.6f (95.2 , 35.5)
73.1d (76.2 , 71.0)
73.1g (76.2 , 71.0)
46.2** (66.7 , 25.8)
42.3g (71.4 , 22.6)
a combination for prokaryotes: TMpred, DAS, MEMSAT 2, TMHMM 2.0, and HMMTOP 2.0; voting window size = 12 residues.
b combination for eukaryotes: KKD, TMAP, MEMSAT 2, SOSUI, and HMMTOP 2.0; voting window size = 8 residues.
c combination for prokaryotes (N-tail location): TopPred II, MEMSAT 2, and HMMTOP 2.0.
d combination for eukaryotes (N-tail location): TMpred, TMHMM 2.0, and HMMTOP 2.0.
e HMMTOP 2.0.
f MEMSAT 2.
g TMHMM 2.0.
* combination of both a and c.
** combination of both b and d.
(i , j): prediction accuracies for single-spanning entries (i) and multi-spanning entries (j), respectively


For the number of correctly predicted TMSs, there is an increase of nine percentage points in prediction accuracy, i.e. from 76% to 84%, for prokaryotic sequences (combination: TMpred, DAS, MEMSAT 2, TMHMM 2.0, and HMMTOP 2.0). Likewise, an increase of 10 percentage points, from 62% to 71%, is noted in eukaryotic sequences (combination: KKD, TMAP, MEMSAT 2, SOSUI, and HMMTOP 2.0).

On the other hand, for the number of TMSs plus TMS-position attribute, a significant increase of 10 percentage points in prediction accuracy, i.e. from 70% to 80%, is observed in prokaryotic sequences (combination: TMpred, DAS, MEMSAT 2, TMHMM 2.0 and HMMTOP 2.0). For eukaryotic sequences (combination: KKD, TMAP, MEMSAT 2, SOSUI and HMMTOP 2.0), an increase of four percentage points is obtained, which is slightly higher than the prediction accuracy for the number of TMSs alone.

In the case of predicting the TM topology, we observed a significant increase of 10 percentage points in prokaryotic accuracy, i.e. from 64% to 74%, while on the contrary, an increase of merely four percentage points (i.e., from 42% to 46%) is obtained for eukaryotic sequences.

Comparing the performance according to the figures in parentheses in Table 6 shows the prediction accuracies for single- and multi-spanning sequences by consensus method and the highest among re-assessed 10 prediction methods. The same trend was obtained in the re-assessment of individual prediction methods, i.e. the performance among single-spanning sequences is apparently superior to the multi-spanning ones both for correctly predicting the number of TMSs and TMSs plus TMS position attributes. This observation is more obvious among the eukaryotic sequences than the prokaryotic ones, while the opposite is true in the case of the N-tail location attribute.

Table 7 shows the results obtained by consensus prediction method in the jack-knife test. In the case of the performance among prokaryotic sequences, the prediction accuracy ranges from a minimum of 64% to maximum of 93% (average 79%) for correctly predicted number of TMSs plus position, while correspondingly ranges from 80% to 100% (average 87.1%) for correctly predicted N-tail location. The average prediction accuracies and the combination of the prediction methods exhibit slight differences between the results of the self-consistency and the jack-knife test. In particular, the combination of prediction methods in test set 3 are KKD, TMpred, DAS, MEMSAT 2, SOSUI, TMHMM 2.0 and HMMTOP 2.0 for correctly predicted number of TMSs and TMSs plus TMS position attributes, while in test set 4, the combination are TopPred II, TMHMM 2.0 and HMMTOP 2.0 for correctly predicted N-tail location attribute. The results suggest that the consensus prediction method is quite robust for prokaryotic sequences.


Table 7: Results of jack-knife test.

Prediction accuracies (%)
 
Methods #TMS #TMS&position N-tail location TM topology
Prokaryotes
    Test dataset 1
    Test dataset 2
    Test dataset 3
    Test dataset 4
    Test data et 5
    Average
86.7 (83.6)
93.3 (81.8)
76.9 (86.0)
71.4 (87.5)
84.6 (84.2)
82.9 (84.6)
73.3 (81.8)
93.3 (76.4)
76.9 (78.9)
64.3 (83.9)
84.6 (78.9)
78.6 (80.0)
80.0 (92.7)
86.7 (90.9)
84.9 (91.2)
85.7 (89.3)
100.0 (87.7)
87.1 (90.4)
66.7 (76.4)
80.0 (72.7)
76.9 (73.7)
57.1 (76.8)
84.6 (71.9)
72.9 (74.3)
Eukaryotes
    Test dataset 1
    Test dataset 2
    Test dataset 3
    Test dataset 4
    Test dataset 5
    Average
60.0 (73.8)
60.0 (73.8)
72.7 (65.9)
70.0 (69.0)
54.5 (73.2)
63.5 (71.2)
50.0 (66.7)
50.0 (66.7)
72.7 (61.0)
70.0 (61.9)
54.5 (65.9)
59.6 (64.4)
40.0 (81.0)
60.0 (76.2)
54.5 (78.0)
70.0 (73.8)
72.7 (73.2)
59.6 (76.4)
10.0 (50.0)
30.0 (42.9)
45.5 (43.9)
40.0 (38.1)
45.5 (46.3)
34.6 (46.3)
(  ) : prediction accuracies for training dataset.


However, among eukaryotic sequences, two out of five and one out of five combinations are the same as that of the optimal combination of prediction methods for correctly predicted number of TMS plus position and N-tail location attributes, respectively. In addition, for the remaining combinations, there are subtle differences. For instance, in test set 4, the combination of prediction methods comprises DAS, TMAP, MEMSAT 2, SOSUI and HMMTOP for correctly predicted number of TMS plus position attribute, and TMpred, TopPred II and HMMTOP for N-tail location attribute. In other cases, they differ only by one method for the optimal combination of the prediction methods. Unfortunately, the obtained prediction accuracies are less than that of the self-consistency test. Probably, it has something to do with the small number of entries (only 10 entries) for each test set.

Except for the results in predicting TM topology among eukaryotic sequences, which is below 50%, we were able to show that our proposed consensus prediction method is successful in improving the TM prediction performance by effectively increasing the prediction accuracies, at least, by nine percentage points. Although the consensus prediction accuracies are a little lower than what we desire to attain (say, above the 90% level particularly in the number of TMSs plus TMS-position attribute), we show here that the consensus prediction approach yields better prediction results than using independently the different TM topology prediction methods, except for the case of the N-tail location attribute among eukaryotic sequences. This finding is consistent with the recently published report on the use of consensus prediction to predict membrane topology with high reliability among E. coli inner membrane proteins [Nilsson et al., 2000]. One of the reasons for the success of our consensus prediction might be that by using a specified voting window, one increases the possibility of converging towards the "correct" location of the predicted segment as identified in our matching criterion, through the majority votes from the individual methods.

Furthermore, we noticed also that prokaryotic sequences have much higher prediction accuracies than their eukaryotic counterparts. One plausible explanation to this observation is that prokaryotic TM sequences have more conserved features than eukaryotic TM sequences due to simpler cellular structure, hence are less likely to be wrongly predicted.

In addition, we would like to highlight here the consensus prediction result for the number of TMSs plus TMS-position among prokaryotic sequences, which has 80% accuracy, corresponding to 10 percentage points more than the highest prediction accuracy obtained among the 10 methods (70%). This increase might not be high enough to reach the 90% level of accuracy, but it makes a big difference when the correctly predicted sequences are used to predict the protein function of the prokaryotic sequences in the proteome, in which only around 50% have their function known [Bork et al., 1992]. In other words, the increase in accuracy by the consensus prediction method can be translated into more protein functions identified than what is presently known, and much more than what can be identified by using independently the best performing TM topology prediction method among the 10 selected methods.

For our next move, we plan to apply this consensus prediction method in genome scale analyses. Then, using the results from these analyses, we will try to perform comparative genomic analyses as well as analyze for topology patterns [Poluliakh et al., 2000] in order to elucidate the protein functions from the information of the number of TMSs and loop length.


ACKNOWLEDGMENTS

We would like to thank Dr. Kenta Nakai for his useful comments and assistance during gathering of the experimentally characterized TM topology data. Likewise, we are grateful to Dr. Daisuke Kihara for sharing with us his well-characterized TM topology dataset.


REFERENCES