| In Silico Biology 4, 0012 (2004); ©2004, Bioinformation Systems e.V. |
Department of Cell Biology and Biophysics, Faculty of Biology,
University of Athens, Panepistimiopolis, Athens 157 01, Greece
Phone: +30-210-7274 931
Fax: +30-210-7274 742
* corresponding author
Email: shamodr@cc.uoa.gr
Edited by E. Wingender; received October 08, 2003; accepted December 31, 2003; published January 05, 2004
waveTM is a web tool for the prediction of transmembrane segments in
-helical membrane proteins. Prediction is performed by a dynamic programming algorithm on wavelet-denoised 'hydropathy' signals. Users submit a protein sequence and receive interactively the results. Topology prediction can also be obtained in conjunction with the algorithm OrienTM. A web server that implements the waveTM algorithm is freely available at http://bioinformatics.biol.uoa.gr/waveTM.
Key words:
-helical transmembrane proteins, prediction, discrete wavelet transform, hydropathy signal, denoising, threshold
A plentitude of hypothetical protein sequences, which await characterization, is presently available from genome sequencing projects. However, a limited number of membrane proteins have a determined structure, since crystals of membrane proteins are difficult to grow [Rost, 2003]. Due to their abundant representation in the proteome and their functional importance mainly in translocation and signalling, prediction of membrane spanning segments in amino acid sequences has been an important field of research for over two decades [Chen and Rost, 2002]. For this reason, a variety of algorithms have previously been proposed for transmembrane segment prediction of
-helical proteins [Chen and Rost, 2002].
Recently, the use of both types of wavelet transform, continuous (CWT) and discrete (DWT) in the Bioinformatics field is promising. Continuous Wavelet Transform (CWT) allows a one-dimensional signal to be viewed in a more discriminative two-dimensional time-scale representation (scalogram). Rough classification of proteins mediated by the study of three levels of such scalograms of hydropathy data has been proposed [Mandell et al., 1997]. Discrete Wavelet Transform (DWT) has been applied on hydrophobicity signals in order to predict hydrophobic cores in proteins [Hirakawa et al., 1999]. Protein sequence similarity has also been studied using DWT of a signal associated with the average energy states of all valence electrons of each amino acid [de Trad et al., 2002]. Detection of repeats of particular secondary or supersecondary structural units is also another application of continuous wavelet transforms [Murray et al., 2002]. A non-parametric method based on a wavelet data-dependent threshold technique for change-point analysis was applied to predict transmembrane helices in membrane proteins [Liò and Vannucci, 2000].
In this work, the waveTM server is presented, freely available on the Web, which involves wavelet-based 'hydropathy' signal denoising to determine membrane spanning segments in amino acid sequences of
-helical membrane proteins.
A wavelet is a waveform that decays quickly and has an average value of zero [Daubechies, 1988]. Wavelet analysis is the breaking up of a signal into approximating functions (shifted and dilated versions of the wavelet) contained in finite domains [Daubechies, 1988]. There are two types of wavelet transform, discrete (DWT) and continuous (CWT) [Mallat, 1989]. CWT is calculated by the continuous shifting of the continuously scalable wavelet over the signal [Daubechies, 1988]. In DWT a subset of scales and positions (based on powers of two) are chosen, in which the correlation between the signal and the shifted and dilated waveforms is calculated. Consequently, the signal is decomposed into several groups of coefficients, each containing signal features corresponding to a scope of frequencies [Daubechies, 1988]. Small scales refer to compressed wavelets, depicted by rapid variations appropriate for extracting high frequency features of the signal. Proportionally, large scales capture low-frequency, coarse features of the signal. An important attribute of wavelet methods is that, due to the limited duration of every wavelet, local variations of the signal are better extracted and information on the location of these local features is retained in the constituent waveforms [Graps, 1995].
In our approach, a sliding window of 20 residues has been used in order to calculate an average residue hydrophobicity signal along a protein sequence. The 'hydrophobicity' scale was the one originally used by Pasquier et al., 1999. This 'hydrophobicity' scale results from the transmembrane propensity of each amino acid as calculated from the entire SwissProt database, release 35 [Pasquier et al., 1999]. The window slides across the sequence and at each position an average hydrophobicity value for the twenty residues is calculated and assigned to the middle residue. For the first and the last 19 residues of the sequence the average is calculated from smaller windows. The length of the hydrophobicity signal produced across the protein sequence is equal to the number of residues of the protein. DWT of this "noisy" hydrophobicity signal produces a number of "noisy" wavelet coefficients. The waveform used for the DWT is the Daubechies least asymmetric wavelet, using filter number 4, DaubLeAssym 4 [Daubechies, 1992]. In order to eliminate the amount of noise in the "noisy" hydrophobicity signal, these wavelet coefficients are altered according to the SureShrink procedure [Donoho and Johnstone, 1995]. SureShrink is an adaptive thresholding method, which means that coefficients are treated in a level-by-level fashion. In each level, if there is information that the wavelet representation of that level is not sparse a threshold that minimizes Stein's unbiased risk estimate (SURE) is applied; otherwise a universal-type threshold is used [Ogden and Parzen, 1996]. In waveTM, SureShrink was selected due to the fact that it shows the lowest mean square error (MSE) among the conventional wavelet shrinkage denoising methods, although its performance is not visually appealing. Also, it performs well in situations of sparsity of wavelet coefficients [Ogden and Parzen, 1996]. SureShrink can be used with two types of thresholding: soft or hard [Donoho and Johnstone, 1995]. In waveTM we use hard thresholding. Hard thresholding of the wavelet coefficients involves setting to zero data of absolute value below the level-dependent threshold. Subsequently a denoised signal of average residue hydrophobicity is recomposed by the thresholded coefficients. All the above calculations were made using the WaveThresh III package [Nason and Silverman, 1994] that works within the statistical language R.
This denoised hydrophobicity signal can be saved and viewed by the users of the waveTM algorithm as a postscript file. Although peaks in the denoised hydrophobicity signal can alone be considered as predictions of membrane spanning segments, an algorithm based on dynamic programming [Jones et al., 1994] is incorporated to optimize the prediction of their number and location. Briefly, the dynamic programming algorithm produces models of the number, the length and the location of membrane-spanning segments. The model with the optimal score is selected among those satisfying the determined constraints. As hydrophobic residues were often observed at both sides of the predicted membrane-spanning segments, it was tested whether the inclusion of these residues in the predicted membrane-spanning segments would improve the per residue accuracy of the prediction. Consequent to the positive results of the above test, the end points of the predicted segments are extended to include any flanking hydrophobic residues. After this modification, results are presented to the user.
The results obtained can further be processed, running the OrienTM algorithm [Liakopoulos et al., 2001] to predict the topology of the transmembrane segments of the protein. The software runs at http://bioinformatics.biol.uoa.gr/waveTM and is freely available through the Internet. It is installed on a Silicon Graphics o2 machine with a 300 MHz R5000 processor.
The prediction accuracy of the algorithm was tested on several different data sets. These were the set of 101 non-homologous transmembrane proteins from Pasquier et al., 1999, the high-resolution (36 transmembrane proteins) and the low-resolution (165 transmembrane proteins) sets from Chen et al., 2002, and also the sets A, AB, ABC and non-redundant from Möller et al., 2000. The results obtained from waveTM using these sets were also compared with other existing methods like PRED-TMR [Pasquier et al., 1999] and the popular methods HMMTOP [Tusnády and Simon, 2001] and TMHMM [Krogh et al., 2001] used for the prediction of transmembrane
-helical segments. They are shown in Table 1 and are also available at the waveTM server site. Table 1 clearly shows that the prediction accuracy of waveTM is comparable, though not superior, to that of several other prediction algorithms of membrane
-helices that do not use evolutionary information. Users of the software should be aware of the fact that signal peptides are sometimes predicted as transmembrane segments. However, this is a common problem in algorithms predicting
-helices in transmembrane proteins and cannot easily be overcome.
| Table 1: | Results obtained running waveTM, PRED-TMR, HMMTOP 2.0 and TMHMM 2.0 on various test sets. |
| Test set | Number of proteins | Method used | Qa | Ca | Qp | TP | FP | FN |
| 101 PRED-TMR | 101 | waveTM | 0.90 | 0.77 | 0.95 | 409 | 38 | 26 |
| PRED-TMR | 0.88 | 0.79 | 0.95 | 406 | 16 | 29 | ||
| HMMTOP 2.0 | 0.91 | 0.82 | 0.97 | 433 | 23 | 2 | ||
| TMHMM 2.0 | 0.92 | 0.83 | 0.98 | 426 | 14 | 9 | ||
| ROST (HIGH-RESOLUTION) | 36 | waveTM | 0.73 | 0.58 | 0.92 | 100 | 2 | 23 |
| PRED-TMR | 0.77 | 0.58 | 0.92 | 106 | 2 | 17 | ||
| HMMTOP 2.0 | 0.82 | 0.67 | 0.95 | 114 | 5 | 9 | ||
| TMHMM 2.0 | 0.83 | 0.68 | 0.92 | 110 | 7 | 13 | ||
| ROST (LOW-RESOLUTION) | 165 | waveTM | 0.90 | 0.75 | 0.94 | 663 | 63 | 61 |
| PRED-TMR | 0.86 | 0.77 | 0.93 | 647 | 33 | 77 | ||
| HMMTOP 2.0 | 0.90 | 0.80 | 0.96 | 700 | 42 | 24 | ||
| TMHMM 2.0 | 0.89 | 0.79 | 0.95 | 670 | 28 | 54 | ||
| MÖLLER (A) | 37 | waveTM | 0.82 | 0.64 | 0.95 | 114 | 7 | 5 |
| PRED-TMR | 0.80 | 0.65 | 0.94 | 109 | 4 | 10 | ||
| HMMTOP 2.0 | 0.82 | 0.67 | 0.96 | 115 | 8 | 4 | ||
| TMHMM 2.0 | 0.83 | 0.68 | 0.92 | 110 | 8 | 9 | ||
| MÖLLER (A, B) | 60 | waveTM | 0.85 | 0.68 | 0.95 | 250 | 18 | 23 |
| PRED-TMR | 0.83 | 0.69 | 0.94 | 248 | 15 | 25 | ||
| HMMTOP 2.0 | 0.85 | 0.70 | 0.95 | 266 | 21 | 7 | ||
| TMHMM 2.0 | 0.86 | 0.72 | 0.94 | 253 | 13 | 20 | ||
| MÖLLER (A, B, C) | 188 | waveTM | 0.88 | 0.73 | 0.94 | 790 | 69 | 74 |
| PRED-TMR | 0.85 | 0.73 | 0.93 | 776 | 47 | 88 | ||
| HMMTOP 2.0 | 0.88 | 0.76 | 0.96 | 847 | 66 | 17 | ||
| TMHMM 2.0 | 0.89 | 0.77 | 0.95 | 811 | 39 | 53 | ||
| MÖLLER (NON-REDUNDANT) | 148 | waveTM | 0.88 | 0.72 | 0.95 | 651 | 55 | 60 |
| PRED-TMR | 0.85 | 0.72 | 0.93 | 641 | 41 | 70 | ||
| HMMTOP 2.0 | 0.88 | 0.75 | 0.96 | 695 | 51 | 16 | ||
| TMHMM 2.0 | 0.88 | 0.75 | 0.94 | 654 | 36 | 57 |
|
101 PRED-TMR: The test set of the 101 non-homologous transmembrane proteins used by Pasquier et al., 1999 ROST: The set of transmembrane proteins described by Chen et al., 2002 MÖLLER: The set of transmembrane proteins described by Möller et al., 2000 Qa : Per residue accuracy (for definition see Pasquier et al., 1999) Ca : Correlation coefficient (for definition see Pasquier et al., 1999) Qp : Per segment accuracy (for definition see Pasquier et al., 1999) |
WaveTM differs from the algorithm proposed by Liò and Vannucci, 2000, in several important respects: (a) it utilizes a different thresholding technique, (b) it uses a different hydrophobicity scale, (c) the hydrophobicity signal which is 'denoised' is an average hydrophobicity signal as described in the 'Methods' section, (d) a dynamic programming algorithm produces models of the number, the length and the location of the membrane-spanning segments, in an objective way, (e) it is available freely through the Internet. It would be useful to compare the results of waveTM with those produced by the algorithm of Liò and Vannucci, 2000. However, this cannot be done since this algorithm does not have a server freely available on the Internet.
The potential of waveTM lies in the fact that it is the only interactive transmembrane
-helical segment prediction Internet server world-wide, with accuracy similar to that of other existing popular and well known tools, which uses wavelet denoising of hydropathy signals and dynamic programming. It is freely available for all users at the address mentioned in the Materials and Methods section, therefore users can experiment freely with it. Its main advantage is its interactivity. Users can easily compare its output with that of other existing methods, providing as observed structure the output produced by any other method. Furthermore, it can easily be incorporated into joint prediction schemes, like CoPreThi [Promponas et al., 1999], which often produce higher quality results than individual prediction schemes [Chen and Rost, 2002].
We thank the University of Athens for financial support.