In Silico Biology 6, 0037 (2006); 2006, Bioinformation Systems e.V.  


MemO: A consensus approach to the annotation of a protein's membrane organization


Melissa J. Davis, Fasheng Zhang, Zheng Yuan and Rohan D. Teasdale*




Institute for Molecular Bioscience and ARC Centre in Bioinformatics, University of Queensland, St. Lucia, 4072, Australia

* Corresponding author

   Email: R.Teasdale@imb.uq.edu.au
   Phone: : +61-7-3346 2056; FAX: +61-7-3346 2101





Edited by E. Wingender; received May 24, 2006; revised July 07, 2006; accepted July 10, 2006; published August 07, 2006



Abstract

Membrane organization describes the relationship of proteins to the membrane, that is, whether the protein crosses the membrane or is integral to the membrane and its orientation with respect to the membrane. Membrane organization is determined primarily by the presence of two features which target proteins to the secretory pathway: the endoplasmic reticulum signal peptide and the α-helical transmembrane domain. In order to generate membrane organization annotation of high quality, confidence and throughput, the Membrane Organization (MemO) pipeline was developed, incorporating consensus feature prediction modules with integration and annotation rules derived from biological observations. The pipeline classifies proteins into six categories based on the presence or absence of predicted features: Soluble, intracellular proteins; Soluble, secreted proteins; Type I membrane proteins; Type II membrane proteins; Multi-span membrane proteins and Glycosylphosphatidylinositol anchored membrane proteins. The MemO pipeline represents an integrated strategy for the application of state-of-the-art bioinformatics tools to the annotation of protein membrane organization, a property which adds biological context to the large quantities of protein sequence information available.

Keywords: integral-membrane protein, transmembrane domain, secreted protein, signal peptide, subcellular localization



Introduction

The eukaryotic cell is structured internally into a number of compartments, or organelles, each with distinctive protein and lipid compositions which enable their correct and specialized function. The location of a protein is therefore one key determinant of its potential function. Early in the translation of nascent proteins, the cell must make a decision regarding the fate of the protein and the way it is to be post-translationally processed and trafficked to its eventual destination in the cell: the nascent protein-ribosome complex either remains in the cytoplasm, or is targeted to the membrane of the Endoplasmic Reticulum (ER) for entry to the secretory pathway by one of two features: a signal peptide or an α-helical transmembrane domain [1]. By considering these two features in combination for a given sequence, it is possible to determine the protein's membrane organization, a property we define as the relationship of the protein to cellular membranes, that is, whether the protein crosses a membrane to be secreted as opposed to remaining in the cytoplasm, is embedded in a membrane, as opposed to soluble, and the orientation of the protein with respect to the membrane.

The signal peptide (SP) is located at the amino terminus (N-terminus) of newly synthesized proteins, and is cleaved from the mature protein after translocation into the lumen of the ER. Signal peptides are variable in length, but generally occur within 70 residues of the N-terminus, and are usually around 20-30 residues in length. They contain a hydrophobic core, frequently with charged residues on the N-terminal side, and large, helix disrupting residues after the hydrophobic region that disrupt the structure of the signal peptide and delineate the cleavage site [2]. These properties, and the amino acid propensities that reflect them, are recognized by publicly available methods able to predict the presence of signal peptides in protein sequence [3-5].

The α-helical transmembrane domain (TMD) is a feature that may be located anywhere in a protein sequence, is not cleaved in the ER lumen, and anchors the protein in the membrane. A TMD in a mature protein therefore exists in the hydrophobic environment of the lipid membrane, and typically consists of 17-25 predominantly hydrophobic residues that penetrate the membrane [6]. Proteins which contain one or more TMD are referred to as transmembrane proteins. Many applications are available to predict the presence of TMD in protein sequence utilizing a variety of computational methods and were recently reviewed [7].

While prediction of the SP and TMD features is possible using available computational methods, problems exist with the automatic application of these methods to large data sets and in the practice of combining predicted features. While most TMD prediction methods are reported with very high accuracies, evaluation of the methods on previously unseen data has revealed that performance has been over estimated [8, 9]. Also, resolution of the conflicts that occur when combining feature predictions is particularly problematic for SPs and TMDs, both of which present a similar hydrophobic profile to prediction methods [10]. To address these issues and provide high throughput, automated annotation of membrane organization in whole proteome data sets, we have developed the Membrane Organization (MemO) annotation pipeline. The pipeline uses individually optimized consensus prediction modules to predict the presence of SP and TMD, reporting confidence scores for each feature to assist in the interpretation of prediction strength. Resolution of conflicts between these features is carried out through the application of a discrimination program designed for this task [11]. Annotation of membrane organization across large sets of protein sequences using this strategy thus provides an important biological context for selection of appropriate candidates for targeted experimental and bioinformatic investigation.



Material and methods


Evaluation data sets

Testing and training of components of the pipeline was done using data extracted from the following datasets: TMPDB-alpha non-redundant [12], signal peptide positive and negative eukaryotic data sets [5], signal anchor data set [11], and a negative control set [9]. While the signal peptide sets are exclusively eukaryotic, the other data sets include prokaryotic proteins. Since MemO is being used to annotate eukaryotic protein sequences, eukaryotic subsets were created for the TMPDB and signal anchor data sets. From the TMPDB set, 85 eukaryotic proteins were extracted to form the set EUK-TM85. To evaluate the GPI prediction method, we extracted 35 known GPI proteins from Swiss-Prot release 41.15. Proteins where the lipid attachment was annotated as potential, probable or known by similarity were not included. To evaluate false positive GPI predictions, a set of 645 known soluble protein chains from the Protein Data Bank (PDB) previously used to evaluate transmembrane domain prediction [13] was selected.

Redundancy within the sets was reduced such that no two proteins shared greater than 30% identity. For this redundancy reduction, pair wise identities were collected using ClustalW1.8 [14] and non-redundant sets were generated [15]. For training and testing of the signal peptide prediction module, the redundancy reduced SP positive and negative sets consisted of 566 and 549 proteins respectively. The proteins in these sets were randomly assigned to five sets for training and validation using five-fold cross-validation. The GPI set was reduced to 23 proteins in the redundancy reduction step, and the set of soluble protein chains contained 602 sequences after redundancy reduction.

For the evaluation of membrane organization category annotations, we created a data set, EUK-MO, which contained eukaryotic proteins of known membrane organization to represent each class. We used the EUK-TM85 set for the membrane classes: following the topology annotated in TMPDB [12] to determine the membrane organization of the proteins in the set, this resulted in 12 Type I membrane proteins, 21 Type II membrane proteins and 52 Multi-span membrane proteins. For the soluble secreted and soluble intracellular sets, we randomly extracted 122 eukaryotic proteins from the signal-peptide-positive/transmembrane-domain-negative set (Soluble, secreted proteins), and 122 eukaryotic proteins from the signal-peptide-negative/transmembrane-domain-negative set (Soluble, intracellular proteins) from the Phobius evaluation sets [16].


Prediction methods

Transmembrane domain predictions were generated from five currently available predictors. HMMTOP [17], TMHMM v2.0 [13] and SVMTM v3.0 [18] were run using the program defaults. MEMSAT [19] was run using the program defaults and transmembrane domain predictions with a score below 0.80 were discarded, and DAS [20] was run using the .32 config file and the –s –u options in the command line interface.

A local implementation of SignalP v2.0 [4, 21] was used to generate signal peptide predictions. This program was run using all default parameters. An implementation of the von Heijne weight matrix signal peptide prediction method [3] was developed using the eukaryotic scoring matrix published by Nielsen and others [4]. For signal peptide prediction methods, only predictions in the first 70 residues of sequence were collected.

Determination of N-terminal feature identity was carried out using a program developed to differentiate between transmembrane domains and signal peptides at the N-terminus of a protein sequence [11]. The first 45 residues of sequences with a conflict were submitted, along with the prediction output of SignalP v2.0 on the first 45 residues, and the program was run using all other defaults.

Predictions of GPI attachment signals were generated using the GPI Modification Site Prediction server [22] big-PI, available at http://mendel.imp.univie.ac.at/sat/gpi/gpi_server.html.



Results

The membrane organization of individual proteins must be determined in order to place the protein in the appropriate context and understand its function. We have developed the MemO annotation pipeline to generate high-confidence membrane organization annotation on data sets representing entire eukaryotic proteomes. The pipeline uses consensus methods optimized for eukaryotic data to predict signal peptides and transmembrane domains, and incorporates specialized prediction methods to resolve conflict or annotate additional features such as GPI attachment signals. The structure of the prediction pipeline is illustrated in Fig. 1. First, input sequences are pre-filtered to remove non-full-length sequences, then SP and TMD feature predictions are generated using consensus methods. These features are combined, conflicts resolved, and the membrane organization classified. Additional predictions, such as GPI attachment signals, may be generated for specific classes of protein. The pipeline output is a table of protein annotation, where the predicted features and membrane organization of each protein is recorded.



Figure 1: Structure of the MemO Pipeline.



Full-length filtering

Accurate determination of membrane organization requires the entire protein sequence. Non-full length sequences are removed from the input set in order to minimize noise in the final results. We remove sequences shorter than 30 residues or containing non-standard amino acid symbols as these create problems for some of the prediction methods. While the extent of pre-filtering required depends on the quality of the input data set we typically remove sequences lacking an initial methionine or where the coding sequence is clearly annotated as truncated, a partial sequence, or a fragment.


Signal peptide prediction

Others have observed that publicly available signal peptide prediction methods have relatively high false positive prediction rates [5]. In order to address this problem, we developed a consensus based method to predict the presence of signal peptides at the N-terminus of eukaryotic protein sequences. The resulting prediction method demonstrates a reduction in the false positive prediction rate without sacrificing sensitivity.

For input, the method takes raw predictions of signal peptide status from three prediction methods: SignalP neural network (NN) [4] and SignalP hidden Markov model (HMM) [21], and a weight matrix based method [3]. All three primary methods use several parameters to discriminate signal peptides, and we evaluated these parameters to select from each individual predictor the parameter that provided the best discrimination between SP positive and negative proteins, (data not shown). We selected the maximal y score (ymax) from SignalP-NN, the probability score (sprob) from SignalP-HMM and the raw score from the implementation of the von Heijne weight matrix method. Each of these three parameters was normalized by scaling the observed values to range [0,1]. We calculated a parameter called SPscore, defined as the average of these three normalized values, and evaluated the ability of this value to discriminate between signal peptide positive and negative proteins. Detection of the cleavage site was not considered.

Optimization and evaluation of the consensus signal peptide prediction module was conducted using SP positive and negative test sets of eukaryotic proteins using a five fold cross validation strategy. A threshold of 0.63 was observed to maximize the proportions of true positive (TP) and true negative (TN) results in the positive and negative training sets. At this threshold, the percentage of TP in the positive set was 96.3% (false negatives (FN) = 3.7%) and the percentage of TN in the negative set was 96.2% (false positives (FP) = 3.8%).

The discrimination performance of the SPscore parameter may be compared with the contributing parameters across a range of thresholds. A ROC curve of sensitivity (TP/total positives) plotted against the FP prediction rate (FP/total negatives) provides information regarding the performance of a continuous parameter across a range of threshold values. Fig. 2 shows the ROC curves of ymax from SignalP-NN, sprob from SignalP-HMM, the normalized output score from SPScan, and the consensus parameter SPscore. It can be seen from these curves that all methods perform well overall, and that the parameters considered show good discrimination ability. Closer examination of the curves in the area of optimal performance (where the TP and TN prediction rates approach 1, see Fig. 2, inset) reveals that the SPscore exceeds the performance of the parameters that contribute to it across TP prediction rates ranging from 0.88 to 0.98.


Figure 2: ROC curve demonstrating the comparative performance of signal peptide prediction parameters and the consensus signal peptide prediction method.


Comparison of SPscore with the individual signal peptide predictors used to calculate it revealed that using this parameter to predict SP status reduced FP prediction in the negative data set from 12.0%, 11.9% and 6.4% (for SignalP-NN, -HMM and the weight matrix method respectively) to only 3.8%. The trade-off effect on FN predictions in the positive set was small: the rate for consensus SPscore FN prediction is 3.7%, while SignalP-NN and -HMM both have a FN rate at 2.0%. In order to gauge overall performance of the methods, a performance statistic Qs may be calculated as the average of the percentage of TP and the percentage of TN [23]. For SignalP-NN, Qs = 93.0, for SignalP-HMM, Qs = 93.0, and for the weight matrix method, Qs = 91.5 compared with Qs = 96.3 for the consensus SPscore parameter. The Qs scores for the SignalP methods are reduced by the comparatively low rates of TN prediction even though these methods both achieve ~98% TP prediction rate when run using program defaults. The higher Qs score for our method demonstrates that the high TP prediction rate (96.3%) we have achieved has not resulted in the trade off of a lower TN prediction rate as was observed for SignalP HH and HMM, (96.2% TN rate compared with ~88% for SignalP methods).


Transmembrane domain prediction

Very small data sets are available for the training and testing of transmembrane domain prediction methods [7, 24], making it difficult to evaluate or objectively compare the performance of methods. Uncertainty also surrounds the actual performance of transmembrane domain prediction methods on unknown data [7-9]. It has been observed however that when methods agree on the TMD status of a region, that prediction is more likely to be correct than in cases where methods do not agree [25], and consensus approaches applied to families of proteins have demonstrated improved prediction capability over single methods [26]. We have developed a consensus method to predict transmembrane domains in protein sequences. This method collects TMD status predictions from a number of predictors and generates a consensus prediction and confidence score for each putative TMD region. By exploiting the tendency of various predictors to make uncorrelated errors, improved prediction accuracy is achieved by this approach.

For input to the consensus TMD method, predictions for each sequence were collected from five programs: HMMTOP [17], TMHMM [13], SVMTM [18], DAS [20] and MEMSAT [19]. For each protein sequence, a consensus parameter, TMfrequency was calculated for each amino acid, a, such that:

where n is the number of transmembrane domain prediction methods used, and pa is the prediction status of amino acid a from predictor i, and has a value of 1 if method i predicts a TMD at that residue, and 0 otherwise. The TMfrequency profile for each sequence indicates the tendency of given regions of sequence to be predicted as transmembrane domains. Three criteria are then applied in the consensus algorithm to find putative transmembrane domain regions in the TMfrequency profile: the TMfrequency score must be greater than or equal to a threshold (here set at 3, representing agreement of at least three of the five predictors); the region above the threshold must be greater than or equal to 5 residues long; and the region may not contain sub-threshold gaps exceeding four consecutive residues within the putative TMD region.

After determining which regions of the sequence are predicted as transmembrane domains, a confidence score for each predicted domain is calculated by averaging the TMfrequency value of each residue predicted in the consensus TMD. This score, the TMscore, represents the degree of support a given consensus TMD region has from individual prediction methods, and can be used to estimate confidence in a predicted region. The TMscore for known transmembrane domains in the EUK-TM85 set can also be calculated from TMfrequency profiles by averaging TMfrequency across the residues of the real TMD region. The confidence values of the real TMDs support the observation by others [27] that very long TMDs are not as well predicted by TMD prediction methods as those in the canonical length range between 15-25 residues. Very long TMDs (>30 residues in the domain) show an average TMscore of 3.0, while those of canonical length show an average TMscore of 3.8.

For evaluation purposes, a predicted region was considered to be a true positive prediction if it overlapped a known region by at least 5 residues [28]. Each prediction is only counted once. Comparison of results for this consensus method, the five individual prediction methods contributing to it, and two existing consensus based applications, ConPred II [29] and BPROMPT [30] are presented in Tab. 1. These results demonstrate that the consensus prediction method creates a more even balance between false positive and false negative performance rates than use of a single method can achieve. It also addresses the issue of confidence in predictions, because presentation of the TMscore immediately informs the user of the degree of support for any predicted domain.


Table 1: Evaluation of transmembrane domain prediction methods on the eukaryotic transmembrane protein set EUK-TM85.
Method TMD
predicted
TP FN FP Total errors
(FN+FP)
Consensus 366 346 62 20 82
HMMTOP 404 361 47 43 90
TMHMM 343 329 79 14 93
SVMTM 369 340 68 29 97
DAS 393 326 82 67 149
MEMSAT 300 275 133 25 158
ConPred II 370 345 63 25 88
BPROMPT 320 273 135 47 182
Methods are evaluated according to total errors approach applied by Möller et al. [9] in a recent comparison of transmembrane prediction methods. Briefly, a predicted TMD is classed as a TP if it overlaps by five residues with a real TMD. If a predicted TMD overlaps with two real TMDs, it is only counted once as a TP, and if no other predicted TMD overlaps with the second real TMD then it is counted as a FN (i.e. missed prediction). There are a total of 408 observed TMDs in this data set. Results obtained from web-accessible consensus methods ConPred II and BPROMPT are included for benchmarking purposes. ConPred II and BROMPT are known to have been trained on the dataset from which EUK-TM85 is extracted. The FN column also indicates what proportion of the 408 TMDs in the set was missed by each predictor.



Amino-terminal feature differentiation

It has been recognized that signal peptide and transmembrane domain prediction methods make errors in assigning the identity of hydrophobic signal sequences at the N-terminus [31]. For the MemO pipeline, we use an N-terminal feature differentiation program specifically designed to address this problem by classifying hydrophobic features at the N-terminus as either transmembrane domains (signal anchors) or signal peptides [11]. Sequences where a transmembrane domain is predicted to start in the first 25 residues and a signal peptide has been predicted are submitted for conflict resolution. Application of the N-terminal filter improves the accuracy of transmembrane domain prediction. Compared with the performance of the consensus method alone, on the EUK-TM85 set this step lowers the total error score by 2, although it should be noted that only 7 proteins contained conflicting predictions in this set, and in 6 of the seven cases the conflict was correctly resolved.


Glycosylphosphatidylinositol anchor prediction

GPI lipid anchors are attached to the C-terminal end of proteins in the ER, and are known to anchor extracellular proteins to the membrane. In order for the cleavable C-terminal attachment signal to be available for cleavage and modification, this end of the protein sequence must be accessible in the lumen of the ER. The GPI attachment site displays a motif that may be predicted computationally through a combined approach of recognizing amino acid preferences at the site and evaluating the physical properties of the residues in the C-terminal signaling sequence [22].

We evaluated the performance of GPI predictions on a set of 23 known GPI-anchored proteins and 602 sequences of known soluble chains. Of the 23 proteins, positive predictions were obtained for 18 (78%), while no false positive predictions were generated in the negative set. The GPI prediction step is executed on those proteins which have a predicted membrane organization that is permissive of this modification, that is, have a topology that appropriately exposes the C-terminus of the protein in the ER lumen. Exclusion of proteins which do not have permissive topology from this prediction step lowers the potential for false positive predictions of this feature in proteins which will never be available for modification.


Annotation rules

Features were combined to give six categories of membrane organization: Soluble, intracellular proteins; Soluble, secreted proteins; Type I membrane proteins; Type II membrane proteins; Multi-span membrane proteins and Glycosylphosphatidylinositol anchored membrane proteins. Classification is made according to the schema for combining features described in Tab. 2. The Soluble, intracellular protein classification may be considered a default prediction as this category has no predicted features. The Multi-span membrane protein categorization is applied to all proteins with two or more predicted transmembrane domains regardless of whether that protein contains a signal peptide.


Table 2: Classification rules for the five major membrane organization classes.
Membrane organization class Signal Peptides Transmembrane domains
Soluble, intracellular protein Absent Absent
Soluble, secreted protein Present Absent
Type I membrane protein Present Single domain
Type II membrane protein Absent Single domain
Multi-span membrane protein Present or absent Multiple domains
The five major classes are defined by the presence or absence of signal peptide and transmembrane domain features. The sixth class, Glycosylphosphatidylinositol anchored proteins, are a subset of the proteins with their C-terminus orientated to the extracellular or luminal space, including the Soluble, secreted proteins and Type II membrane proteins which are predicted to contain this post-translational modification, and Multi-span membrane proteins whose topology indicates an extracellular location for the C-terminus.



Evaluation of membrane organization predictions

The ability of MemO to correctly classify the membrane organization of proteins was evaluated against the EUK-MO set of proteins with established membrane organization classes. This evaluation (see Tab. 3) shows that all classes achieve high specificity, ranging between 0.97 and 0.99, as a result of low false positive prediction rates across the classes. Prediction accuracy is also high (0.98-0.99) across all the membrane organization classes. The sensitivity of the prediction shows more fluctuation, and is lowest for the two single transmembrane domain classes, (Type I membrane proteins and Type II membrane proteins), where the most common cause of a FN classification is the inclusion of an additional, weakly predicted transmembrane domain that changes the predicted classification to the Multi-span membrane protein class. Another important aspect of membrane organization is the orientation of the N-terminus of a protein with respect to the membrane. This information is important to the design of some experimental methods used to investigate the location or function of a protein [32]. The orientation of the N-terminus is predicted correctly by MemO in 97.9% of all proteins in this test.


Table 3: Evaluation of MemO classification for each major membrane organization class.
Membrane organization (MO) class Predicted MO TP FP TN FN Accuracy Sensitivity Specificity Correct N-terminus orientation
Soluble, intracellular protein (122) 120 117 3 207 5 0.98 0.96 0.99 119 (97.5%)
Soluble, secreted protein (122) 120 114 6 207 8 0.98 0.93 0.97 120 (98.4%)
Type I membrane protein (12) 15 10 5 317 2 0.99 0.83 0.98 12 (100%)
Type II membrane protein (21) 21 16 5 308 5 0.98 0.76 0.98 19 (90.5%)
Multi-span membrane protein (52) 53 48 5 277 4 0.99 0.92 0.98 52 (100%)
The number of proteins in each MO class is shown in brackets in the first column. Of the 329 proteins examined in this test, 305 (92.7%) are predicted in the correct class, and 322 (97.9%) have the orientation of the N-terminus correctly predicted. Prediction accuracy ([TP+TN]/[TP+TN+FP+FN]), sensitivity (TP/[TP+FN]) and specificity (TN/[FP+TN]) are calculated.




Discussion

The MemO annotation pipeline classifies proteins into classes of membrane organization that correspond to the biological context of proteins with respect to their entry into and orientation within the secretory pathway. The pipeline was developed for use in areas of molecular, cell and developmental biology where a demand exists for high quality, high confidence annotations of membrane organization to guide and target experimental investigation. For example, this strategy has proved valuable for targeting selected sets of proteins which show membrane organization characteristics desirable in a given experimental context, such as selection of Type II membrane proteins for targeted subcellular localization [33] or identification of over-expressed Soluble, secreted proteins in expression array experiments [34-37]. We have also applied this methodology to the mouse transcriptome as part of the RIKEN Functional Annotation of Mouse 3 project [38], and used the pipeline to identify signal peptide and transmembrane domain features that are differentially included in the variable protein output of transcriptional units of the mouse genome [39].

With respect to examination of the subcellular localization of individual proteins, MemO's membrane organization prediction facilitates a number of downstream research activities. Firstly, knowledge of the orientation of the amino and carboxy termini of a protein directs the choice of experimental methods suitable for localizing individual proteins [32], thus improving the outcomes of localization experiments. Secondly, many protein motifs which function as subcellular targeting motifs are required to be within the cytoplasmic regions of individual proteins [40] and consideration of a protein's topology enables false positive motifs not exposed to the cytoplasm to be discarded. Defining the membrane organization of individual proteins also facilitates the productive incorporation of predictions of GPI attachment signals which require a particular orientation in the ER [41].

Within the MemO pipeline, development of automated consensus methods in signal peptide and transmembrane domain prediction provides increased confidence in the prediction of those features by drawing together evidence from multiple sources. Creation of a single numeric parameter, SPscore, for signal peptide prediction enables simple interpretation of the strength of the result. Co-prediction of transmembrane domains by a number of predictors is frequently taken as an indication of prediction accuracy [42]. Reporting the consensus TMscore, which represents the degree of co-prediction for each predicted domain, allows increased confidence in predictions with strong agreement, while indicating those predictions with a lower degree of consensus. It is easy to understand and interpret, and enables individual domain regions within a protein to be assessed independently.

A major issue for the prediction of membrane proteins is the differentiation of N-terminal transmembrane domains from cleavable signal peptides [31]. To address this issue, we follow the approach of Yuan et al. [11], which can be applied to sequences where conflict between prediction of a signal peptide and transmembrane domain is detected. An alternative strategy is followed in the transmembrane protein prediction method ConPred II [29], where the authors recommend the removal of signal peptides from sequences prior to transmembrane domain prediction to avoid the incorrect prediction of these peptides as transmembrane domains. ConPred II also includes an optional discrimination method, DetecSig [10] to detect signal peptides in transmembrane proteins and establish topology, while the membrane organization predictor, Phobius [16], incorporates signal peptide prediction and transmembrane domain prediction in one method. Both of these strategies are less sensitive at detecting signal peptides than dedicated signal peptide prediction methods [10, 16]. In contrast, our strategy establishes putative signal peptide and transmembrane domain predictions separately using specialist predictors, and then only applies the discrimination method to those proteins with conflicting predictions.

The signal peptide and transmembrane domain consensus prediction modules both present improvements over the application of single methods, and rely both on the accuracy and tendency of methods to produce uncorrelated errors for improved accuracy. For example, the individual methods used to generate the consensus predictions all perform well, but use different computational strategies (neural networks, hidden Markov models and weight matrices) and generate different incorrect predictions in the test sets. In the consensus signal peptide prediction method, combination of three predictions compensates for errors by any single predictor, and improves performance by reducing FP rate to 3.8% without unduly sacrificing sensitivity. The advantages of applying consensus methods to transmembrane domain prediction have been noted previously for families of proteins [26], and this is now regarded as a productive strategy for improving prediction quality. The consensus method reported here shows improvement over individual methods that contribute to it when applied to published evaluation sets, in keeping with the observation that domains predicted by many different methods are more likely to be correctly predicted [25]. When compared with ConPred II [29] and BPROMPT [30], two online servers that use consensus strategies to predict transmembrane domains, we observe that our consensus TMD prediction method is highly competitive and performs better at the task of identifying transmembrane domain segments. Application of our N-terminal filtering method improves this performance by further reducing FP transmembrane domain predictions.

MemO annotations of membrane organization have been used as a framework for protein classification within the LOCATE subcellular location database [43] available at http://locate.imb.uq.edu.au/. This database provides access to the MemO annotations in the mouse Isoform Protein Sequence set, and enriches the biological context of the annotations through integration with Pfam [44] and SCOP [45] domain predictions, subcellular localization data collected from both the literature and high-throughput experiments, targeting motif predictions, and links to other database resources. We have used the MemO pipeline to annotate a number of protein sequence sets from higher eukaryotes. The annotated data sets and the scripts to implement the consensus transmembrane domain and signal peptide prediction methods are available for download from http://locate.imb.uq.edu.au/downloads.shtml. MemO annotations of membrane organization have been used as a framework for protein classification within the LOCATE subcellular location database [43] available at http://locate.imb.uq.edu.au/. This database provides access to the MemO annotations in the mouse Isoform Protein Sequence set, and enriches the biological context of the annotations through integration with Pfam [44] and SCOP [45] domain predictions, subcellular localization data collected from both the literature and high-throughput experiments, and links to other database resources. We have used the MemO pipeline to annotate a number of protein sequence sets from higher eukaryotes, and these annotations are available for download from http://locate.imb.uq.edu.au/downloads.shtml.



Acknowledgements

This work was supported by funds from the Australian Research Council of Australia and the Australian National Health and Medical Research Council of Australia; R.D.T. is supported by an NHMRC R. Douglas Wright Career Development Award. We thank colleagues in the Teasdale laboratories for their helpful discussion. We also thank the authors of the transmembrane domain and signal peptide prediction programs used in this study for providing us with the programs. This work was performed as part of the Renal Regeneration Consortium, and was supported by National Institutes of Health (DK63400) as part of the Stem Cell Genome Anatomy Project [http://www.scgap.org/].



Abbreviations

Signal peptide (SP); Transmembrane domain (TMD); Endoplasmic reticulum (ER); Glycosylphosphadatidylinositol (GPI); Amino terminus (N-terminus); Neural network (NN); Hidden Markov model (HMM)




References


  1. van Vliet, C., Thomas, E. C., Merino-Trigo, A., Teasdale, R. D. and Gleeson, P. A. (2003). Intracellular sorting and transport of proteins. Prog. Biophys. Mol. Biol. 83, 1-45.

  2. Martoglio, B. and Dobberstein, B. (1998). Signal sequences: more than just greasy peptides. Trends Cell Biol. 8, 410-415.

  3. von Heijne, G. (1986). A new method for predicting signal sequence cleavage sites. Nucleic Acids Res. 11, 4683-4690.

  4. Nielsen, H., Engelbrecht, J., Brunak, S. and von Heijne, G. (1997). Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering 10, 1-6.

  5. Menne, K. M. L., Hermjakob, H. and Apweiler, R. (2000). A comparison of signal sequence prediction methods using a test set of signal peptides. Bioinformatics 16, 741-742.

  6. Sabatini, D. D., Kreibich, G., Morimoto, T. and Adesnik, M. (1982). Mechanisms for the incorporation of proteins in membranes and organelles. J. Cell Biol. 92, 1-22.

  7. Chen, C. P. and Rost, B. (2002). State-of-the-art in membrane protein prediction. Appl. Bioinformatics 1, 21-35.

  8. Ikeda, M., Arai, M. and Shimizu, T. (2000). Evaluation of transmembrane topology prediction methods by using an experimentally characterized topology data set. Genome Informatics 11, 426-427.

  9. Möller, S., Croning, M. D. R. and Apweiler, R. (2001). Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17, 646-653.

  10. Lao, D. M. and Shimizu, T. (2001). Methods for detecting the signal peptide in transmembrane and globular proteins. Genome Informatics 12, 340.

  11. Yuan, Z., Davis, M. J., Zhang, F. and Teasdale, R. D. (2003). Computational differentiation of N-terminal signal peptides and transmembrane domains. Biochem. Biophys. Res. Commun. 312, 1278-1283.

  12. Ikeda, M., Arai, M., Okuno, T. and Shimizu, T. (2003). TMPDB: a database of experimentally-characterized transmembrane topologies. Nucleic Acids Res. 31, 406-409.

  13. Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E. L. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567-580.

  14. Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J., Higgins, D. G. and Thompson, J. D. (2003). Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 31, 3497-3500.

  15. Hobohm, U., Scharf, M., Schneider, R. and Sander, C. (1992). Selection of representative protein data sets. Protein Sci. 1, 409-417.

  16. Käll, L., Krogh, A. and Sonnhammer, E. L. L. (2004). A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol. 338, 1027-1036.

  17. Tusnady, G. E. and Simon, I. (2001). The HMMTOP transmembrane topology prediction server. Bioinformatics 17, 849-850.

  18. Yuan, Z., Mattick, J. S. and Teasdale, R. D. (2004). SVMtm: Support vector machines to predict transmembrane segments. J. Comput. Chem. 25, 632-636.

  19. Jones, D. T., Taylor, W. R. and Thornton, J. M. (1994). A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33, 3038-3049.

  20. Cserzö, M., Wallin, E., Simon, I., von Heijne, G. and Elofsson, A. (1997). Prediction of transmembrane α-helices in prokaryotic membrane proteins: the dense alignment surface method. Protein Eng. 10, 673-676.

  21. Nielsen, H. and Krogh, A. (1998). Prediction of signal peptides and signal anchors by a hidden Markov model. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., Glasgow, J. (ed.), AAAI Press, Menlo Park, CA, pp. 122-130.

  22. Eisenhaber, B., Bork, P. and Eisenhaber, F. (1999). Prediction of potential GPI-modification sites in proprotein sequences. J. Mol. Biol. 292, 741-758.

  23. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F. and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 412-424.

  24. Möller, S., Kriventseva, E. V. and Apweiler, R. (2000). A collection of well characterised integral membrane proteins. Bioinformatics 16, 1159-1160.

  25. Nilsson, J., Persson, B. and von Heijne, G. (2000). Consensus predictions of membrane protein topology. FEBS Lett. 486, 267-269.

  26. Bertaccini, E. and Trudell, J. R. (2002). Predicting the transmembrane secondary structure of ligand-gated ion channels. Protein Eng. 15, 443-453.

  27. Chen, C. P. and Rost, B. (2002). Long membrane helices and short loops predicted less accurately. Protein Sci. 11, 2766-2773.

  28. Sonnhammer, E. L. L., von Heijne, G. and Krogh, A. (1998). A hidden Markov model for predicting transmembrane helices in protein sequence. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., Glasgow, J. (ed.), AAAI Press, Menlo Park, CA, pp. 175-182.

  29. Arai, M., Mitsuke, H., Ikeda, M., Xia, J.-X., Kikuchi, T., Satake, M. and Shimizu, T. (2004). ConPred II, a consensus prediction method for obtaining transmembrane topology models with high reliability. Nucleic Acids Res. 32, W390-W393.

  30. Taylor, P. D., Attwood, T. K. and Flower, D. R. (2003). BPROMPT, a consensus server for membrane protein prediction. Nucleic Acids Res. 31, 3698-3700.

  31. Lao, D. M., Arai, M., Ikeda, M. and Shimizu, T. (2002). The presence of signal peptide significantly affects transmembrane topology prediction. Bioinformatics 18, 1562-1566.

  32. Stow, J. L. and Teasdale, R. D. (2005). Expression and localization of proteins in mammalian cells. In: Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics, Jorde, L. B., Little, P. F. R., Dunn, M. J. and Subramaniam S. (eds.), John Wiley & Sons.

  33. Aturaliya, R. N., Fink, J. L., Davis, M. J., Teasdale, M. S., Hanson, K. A., Miranda, K. C., Forrest, A. R., Grimmond, S. M., Suzuki, H., Kanamori, M., Kai, C., Kawai, J., Carninci, P., Hayashizaki, Y., Teasdale, R. D. (2006). Subcellular localization of mammalian type II membrane proteins. Traffic 7, 613-625.

  34. Challen, G. A., Martinez, G., Davis, M. J., Taylor, D. F., Crowe, M., Teasdale, R. D., Grimmond, S. M. and Little, M. H. (2004). Identifying the molecular phenotype of renal progenitor cells. J. Am. Soc. Nephrol. 15, 2344-2357.

  35. Martinez, G., Georgas, K., Challen, G. A., Rumballe, B., Davis, M. J., Taylor, D., Teasdale, R. D., Grimmond, S. M. and Little, M. H. (2006). Definition and spatial annotation of the dynamic secretome during early kidney development. Dev. Dyn. 235, 1709-1719.

  36. Caruana, G., Cullen-McEwen, L., Nelson, A. L., Kostoulias, X., Woods, K., Gardiner, B., Davis, M. J., Taylor, D. F., Teasdale, R. D., Grimmond, S. M., Little, M. H. and Bertram, J. F. (2006). Spatial gene expression in the T-stage mouse metanephros. Gene Expr. Patterns, in press.

  37. Mitchell, E. K. L., Taylor, D. F., Woods, K., Davis, M. J., Nelson, A. L., Teasdale, R. D., Grimmond, S. M., Little, M. H., Bertram, J. F. and Caruana, G. (2006). Differential gene expression in the developing mouse ureter. Gene Expr. Patterns 6, 519-538.

  38. Carninci, P., et al.; FANTOM Consortium; RIKEN Genome Exploration Research Group and Genome Science Group (Genome Network Project Core Group) (2005). The transcriptional landscape of the mammalian genome. Science 309, 1559-1563.

  39. Davis, M. J., Hanson, K. A., Clark, F., Fink, J. L., Zhang, F., Kasukawa, T., Kai, C., Kawai, J., Carninci, P., Hayashizaki, Y. and Teasdale, R. D. (2006). Differential use of signal peptides and membrane domains is a common occurrence in the protein output of transcriptional units. PLoS Genetics 2, 554-563.

  40. Bonifacino, J. S. and Traub, L. M. (2003). Signals for sorting of transmembrane proteins to endosomes and lysosomes. Annu. Rev. Biochem. 72, 395-447.

  41. Udenfriend, S. and Kodukula, K. (1995). How glycosylphosphatidylinositol-anchored membrane proteins are made. Annu. Rev. Biochem. 64, 563-591.

  42. Käll, L. and Sonnhammer, E. L. L. (2002). Reliability of transmembrane predictions in whole-genome data. FEBS Lett. 532, 415-418.

  43. Fink, J. L., Aturaliya, R. N., Davis, M. J., Zhang, F., Hanson, K., Teasdale, M. S., Kawai, J., Carninci, P., Hayashizaki, Y. and Teasdale, R. D. (2006). LOCATE: A mouse protein subcellular localization database. Nucleic Acids Res. 34, D213-D217.

  44. Finn, R. D., Mistry, J., Schuster-Böckler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S. R., Sonnhammer, E. L. and Bateman, A. (2006). Pfam: clans, webtools and services. Nucleic Acids Res. 34, D247-D251.

  45. Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J. P., Chothia, C. and Murzin A. G. (2004). SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 32, D226-D229.