In Silico Biology 7, 0037 (2007); ©2007, Bioinformation Systems e.V.  


BETTY: Prediction of β-strand type from sequence


Olav Zimmermann1*, Longhui Wang1 and Ulrich H. E. Hansmann1,2




1 John v. Neumann Institute for Computing, FZ Jülich, 52425 Jülich, Germany
2 Dept. of Physics, Michigan Technological University, Houghton, MI 49931-1295, USA



* Corresponding author

   Email: olav.zimmermann@fz-juelich.de





Edited by H. Michael; received April 18, 2007; revised July 01, 2007; accepted July 05, 2007; published August 28, 2007



Abstract

Most secondary structure prediction programs do not distinguish between parallel and antiparallel β-sheets. However, such knowledge would constrain the available topologies of a protein significantly, and therefore aid existing fold recognition algorithms. For this reason, we propose a technique which, in combination with existing secondary structure programs such as PSIPRED, allows one to distinguish between parallel and antiparallel β-sheets. We propose the use of a support vector machine (SVM) procedure, BETTY, to predict parallel and antiparallel sheets from sequence. We found that there is a strong signal difference in the sequence profiles which SVMs can efficiently extract. With strand type assignment accuracies of 90.7% and 83.3% for antiparallel and parallel strands, respectively, our method adds considerably to existing information on current 3-class secondary structure predictions.

BETTY has been implemented as an online service which academic researchers can access from our website http://www.fz-juelich.de/nic/cbb/service/service.php.

Keywords: SVM, support vector machine, structure prediction, secondary structure prediction, tertiary structure prediction, beta-sheets, beta-strands, parallel beta-sheets, antiparallel beta-sheets, long range constraints



Introduction

Recent decades have led to remarkable success in performing 3-class secondary structure predictions, i. e. classifying whether a residue belongs to an α-helix, β-sheet or to other structures [Rost, 2001]. Most programs, however, do not distinguish between parallel and antiparallel β-sheets. This is an important limitation as sheets, containing at least two β-strands, represent a supersecondary structure and shape the overall architecture of a protein by connecting distant parts of the polypeptide chain. In contrast to antiparallel sheets the strands in parallel β-sheets are not adjacent, i. e. they constrain a stretch of the polypeptide chain between the strands. Hence, knowledge of whether a sheet is parallel or antiparallel significantly reduces the number of feasible topologies, and therefore will improve existing secondary structure alignment [Przytycka et al., 1999], fold recognition [Liu et al., 2006] and constraint folding algorithms [Kolinski et al., 2005].

In this study we develop a support vector machine (SVM) based method that can distinguish between parallel β-sheets and antiparallel β-sheets. This method, named by us as BETTY (for BETa TYpe prediction), thus complements existing secondary structure predictions at the super-secondary and tertiary structure level. We analyze the performance of our approach in cross-validated experiments with both known β-sheet positions and those suggested by PSIPRED [Jones, 1999]. We also measure the performance of our approach as a function of the position of a residue within a strand, and test some simple heuristic methods to derive a strand type label from results obtained for individual residues. The article concludes with a discussion on possible applications, limitations and further developments.



Materials and methods


Definitions

The existence of β-sheets in proteins was predicted back in 1951 by Linus Pauling [Pauling and Corey, 1951], more than a decade before they were actually observed in the X-ray structure of lysozyme [Blake et al., 1965]. In 1970 the IUPAC_IUB defined protein secondary structure types but the de facto standard is the implementation of the β-sheet definition in the DSSP software [Kabsch and Sander, 1983]. For the sake of comparability and simplicity we will use this definition throughout this study. A β-bridge is defined as an elementary H-bond pattern that is parallel in two nonoverlapping stretches with residues i−1, i, i+1 and j−1, j, j+1, if either the H-bond pair [COi−1毽意Hj, COj毽意Hi+1] or the pair [COj−1毽意Hi, COi毽意Hj+1] exists. Correspondingly the antiparallel β-bridge is defined by the existence of either the H-bond pair [COi毽意Hj, COj毽意Hi] or the pair [COi−1毽意Hj+1, COj−1毽意Hi+1]. Note, that the DSSP definition of β-strands uses a generous definition of H-bonds and has no restrictions regarding dihedral angles, so that it fits also structural irregularities such as β-bulges.


Datasets

We used a non-redundant dataset from the Protein Data Bank (PDB) [Berman et al., 2000] with pairwise identities of < 25% containing 2686 protein chains. Secondary structure was assigned from the experimentally determined tertiary structure by DSSP [Kabsch and Sander, 1983] although we are aware that in certain cases the DSSP definition can give rise to both over- and under-assignments of β-strands and sheets. DSSP defines eight secondary structure classes: H (α-helix), G (3/10-helix), I (π-helix), E (β-strand), B (isolated β-bridge), T (turn), S (bend) and space (all others). Of 410,995 residues in our dataset, 89,497 were defined by DSSP as being located in β-strands (E). The residues were labeled as parallel and antiparallel in 17.15% and 73.16% of the cases respectively. 5.13% of the residues were indicated by DSSP to be connected at the same time to both a parallel and an antiparallel strand (mixed strands). 4.56% of the residues are not assigned, i. e. they have no label of a connected strand.


Methods

In order to classify residues into different β-strand types, we used support vector machines (SVM), a supervised machine learning algorithm. Based on +1/−1 labeled training examples, SVMs finds a representation of the high dimensional hyperplane which separates the two classes with the largest margin. The resulting model contains the training examples that are closest to this hyperplane (= support vectors) together with their Lagrange multipliers and a bias term. This model of the separating hyperplane is then used to classify previously unseen examples. For a detailed introduction into the theory of SVMs see [Schölkopf and Smola, 2002]. In our experiments, we used the C-SVM algorithm as implemented in the LIBSVM-library (http://www.csie.ntu.edu.tw/~cjlin/libsvm).

Input data for SVMs can be given as vectors of numerical values (features) with the vector labeled by its class, +1 or −1. We defined the positive class as overlapping sequence fragments where the central residue is in an antiparallel β-strand. A fragment length of 15 has been reported previously as optimal for SVM-based secondary structure prediction [Kim and Park, 2003]. The negative class was defined as those sequences where the central residue has a parallel, mixed or unassigned β-strand label. We only used those sequences which are labeled as sheet (E) in DSSP and we disregard isolated β-bridges (B). To capture information from remote homologues we encoded each residue with its position specific scoring matrix (PSSM) which we obtained from 3 iterations of standard PSIBLAST runs against the non-redundant NCBI protein sequence database (NR) [Altschul et al., 1997]. The raw values x of the PSSM are scaled to [0,1] by a mapping function, proposed in Kim and Park, 2003:

(1)

We employed the Gaussian radial basis function kernel (RBF kernel) in our SVM and used a grid-based search to optimize the regularization parameter C and the hyperparameter gamma corresponding to the standard deviation of the RBF kernel.

With the exception of Tab. 4, where positive and negative class are explicitely described, we use the definitions in Tab. 1 for prediction outcome types:


Table 1: Definition of prediction outcome types.
Prediction Observation
 +1 (antiparallel) −1 (parallel)
+1 (antiparallel) TP (True Positive) FP (False Positive)
−1 (parallel) FN (False Negative) TN (True Negative)


and assessed the classifier performance by the following measures:

Prediction performance was only evaluated where the true type of the β-sheet could be determined unequivocally. Hence, we did not evaluate those residues or strands in mixed sheets with both parallel and antiparallel neighbors. For residue-based measures, we also excluded those residues, that had no sheet label assigned in DSSP, even if they were within a strand of definite type.



Results and discussion

We performed a range of tests to assess the performance of the classifier alone and its performance together with PSIPRED as a β-sheet prediction algorithm. Unless otherwise mentioned, all results from our SVM-classifier were 7-fold cross-validated.


β-Type classifier performance

We first checked the capability of the classifier to distinguish between residues with a parallel sheet label in DSSP, and those with an antiparallel sheet label. Our dataset contained 89,497 β-sheet residues of which left 80,821 to be evaluated after the filtering of mixed and unassigned residues. Tab. 2 shows the raw classifier performance using a range of metrics on true and predicted β-sheet residues.


Table 2: Classifier performance on β-sheet residues.
β-Residue dataset Acc PPV NPV Sens Spec MCC
true β-residues 88.2% 0.91 0.73 0.95 0.60 0.593
correctly predicted β-residues 88.7% 0.92 0.75 0.94 0.66 0.631
underpredicted β-residues 86.6% 0.89 0.61 0.96 0.34 0.386


We achieve an overall performance of 88% classification accuracy and a Matthews correlation coefficient of 0.59 (row 1). Due to the low abundance of parallel training examples the classifier tended to underpredict parallel β-sheet residues as can be seen from the comparatively low specificity, i. e. low probability to detect a parallel residue. On the subset of β-residues correctly identified by PSIPRED, the β-type classification algorithm performed slightly better (row 2). PSIPRED failed to detect 22% of the sheet residues and for those the β-type prediction was far less accurate (row 3).

We also checked the dependency of the strand type performance on the proximity to the DSSP-defined ends of the strands. As shown in Tab. 3, positions at the ends of β-strands are less accurately classified than more interior positions. This is in agreement with the behavior of secondary structure prediction algorithms and may reflect several different effects. One possible explanation is the arbitrary definition of the β-strand borders. For overlapping secondary structure assignments, DSSP gives preference to α-helical and single-strand assignments over β-sheet assignments. Another reason is that the ends could be less conserved and hence the alignments and scoring matrices which are used in both PSIPRED and BETTY are less accurate. The poorer performance in classifying terminal residues is reflected in a 7% lower classification accuracy for very short strands (length < 3 aa) which contain only near-terminal residues (data not shown). Low NPV and specificity indicate that the performance penalty towards both ends of the strand is more pronounced for the parallel class. As parallel strands are shorter on average their fraction of residues near to the termini is larger. A major negative effect on the β-type classifier performance was due to the underpredictions of parallel residues at the N- and C-terminals.


Table 3: β-type classification performance in dependency of position in strand.
Pos.AccPPVNPVSensSpecMCC
N-ter86.6%0.890.760.950.560.574
N-ter+188.5%0.910.770.940.670.643
N-ter+289.7%0.930.750.940.700.664
C-ter−289.2%0.920.750.940.680.645
C-ter−187.1%0.900.750.940.610.601
C-ter85.3%0.870.740.950.520.536


Performance in concert with secondary structure prediction algorithms (PSIPRED/BETTY)

In order to maximize specificity our algorithm implements a parallel vs. antiparallel classifier, instead of a parallel vs. rest and an antiparallel vs. rest classifier. It is therefore incapable of detecting the location of β-strands directly, but builds on the a priori assignment of β-sheets. In this study we have used PSIPRED for β-sheet prediction as it is among the best-performing secondary structure algorithms currently known.


Table 4: Prediction performance of PSIPRED (P) and PSIPRED/BETTY (P/B) on the full dataset.
Pos. classNeg. classProgramAccPPVNPVSensSpecMCC
βnot βP90.2%0.770.940.780.940.713
parallel-βantiparallel-β, not βP/B97.0%0.630.980.540.990.568
antiparallel-βparallel-β, not βP/B90.0%0.680.950.730.930.642


The results in Tab. 4 show that PSIPRED made correct β-sheet assignments for 90% of the residues in our testset and achieved an excellent MCC of 0.7 (row 1). However, PSIPRED underpredicts edge strands of β-sheets 3 times more often than center strands. In our testset the figures were 8% and 29% for the underprediction of center strands and edge strands respectively. A similar finding was reported earlier in [Siepen et al., 2003]. For assessing the overall performance of the PSIPRED/BETTY tandem to detect and identify sheets, there are three outcome classes: parallel-β, antiparallel-β and not-β. To be able to use the binary class assessment measures in this case we report the performance in Tab. 4 as two derived binary classifiers: parallel-β against the rest (row 2) and antiparallel-β against the rest (row 3). Their high specificity, NPV and accuracy mainly reflects the accurate detection of β-sheets by PSIPRED and their overall low abundance. Most importantly the more fine grained classification into 4 classes has only negligible impact on the performance as indicated by the slightly lower PPV and MCC when compared to PSIPRED.

Multi-class performance can be inferred from the confusion matrix in Tab. 5. 88.8% of all residues are classified correctly into the 3 classes parallel-β, antiparallel-β and not-β by the combined algorithm. This is similar to the two-class accuracy rate achieved on known β-sheets. Due to the high correlation between PSIPRED errors and BETTY errors (see Tab. 2), the non-perfect prediction of PSIPRED does not have a negative influence on the overall performance of the combined algorithm. Splitting the "not-β" class into helix and coil, we obtain a 4-class accuracy, Q4 (par-β, antipar-β, helix, coil) of 79.3%. which is only 1.8% less than the PSIPRED Q3 (helix, β-sheet, coil) of 81.1% but 9.5% lower than the Q3 (par-β antipar-β, non-β) reported above. This demonstrates that the main source of error in the 4-class prediction is the confusion of helix and coil by PSIPRED and not the identification of β-sheets by PSIPRED, nor their subclassification by BETTY.


Table 5: Confusion matrix of PSIPRED/BETTY predictions.
ObservationPrediction
 parallel-βantiparallel-βhelixcoil
parallel-β829943333432371
antiparallel-β281847,581239012,686
mixed-β1315287489317
unassigned-β18521731791544
helix1081723117,67810,796
coil19791645827,240145,516


In order to perform the actual assignment of a type label to a strand suggested by PSIPRED we use a simple threshold: if more than 40% of the residues of a strand are classified to be residues of a parallel strand, then we assign the strand to the parallel class, antiparallel otherwise. As we have no separate class for mixed strands we exclude them from the performance assessment. Tab. 6 shows that the threshold-based smoothing improves the prediction over the pure residue assignment. Additional constraints, for instance occurrence of two adjacent parallel residues, do not improve the performance. The last experiment in Tab. 6 indicates that on the subset of the β-strands which is detected by PSIPRED the β-type classification performs better than it does for β-strands in general. This confirms the correlation between the predictability of a β-strand by PSIPRED and the predictability of its β-type by our algorithm, which is not surprising, since both algorithms use the same PSIBLAST scoring matrices as basis for their calculations. Even with a simple threshold based approach the performance is already very close to that of PSIPRED. This in turn implies, that more elaborate strand assignment strategies (e. g. second SVM layer) could not improve the result much further, because the performance of any combined classifier that uses PSIPRED as a prefilter is upper bounded by the performance of PSIPRED itself.


Table 6: Performance of strategies to correct residue classifications on the secondary structure level.
AlgorithmsPerformance w/o mixed sheets
Sec. Struc.β-TypeAccPPVNPVSensSpecMCC
DSSPpp82.0%0.8690.5760.9120.4640.409
DSSPpp_m89.6%0.9160.7990.9570.6580.663
DSSPpp_p83.4%0.9360.5700.8490.7740.561
DSSPpp_mpp89.6%0.9070.8330.9680.6160.657
PSI/DSSPpp_m90.4%0.9240.8190.9560.7150.706
pp: no correction, pp_m: strands with > 40% predicted parallel residues are parallel, pp_p: strands with > 1 predicted parallel residue are parallel, pp_pp: strands with > 1 duplet of predicted parallel residues are parallel, PSI/DSSP: only the subset of residues correctly predicted by PSIPRED is evaluated. We consider assignments of a strand type as correct if 80% of the residues of a strand defined by DSSP are correctly predicted.


To illustrate the performance of our classifier and of the improvement made by the threshold based heuristics for strand type assignment we took the example of the structure of PDB file 1k92, chain A. We chose this example not because it presents an average case but because it contains all classes of mispredictions. Fig. 1a shows three larger β-sheets of which one is parallel and the other two antiparallel. Clearly one of the antiparallel β-sheets is severely underpredicted by PSIPRED (yellow), while the other two sheets are correctly detected with the exception of some C-terminal residues and a tine parallel strand. On the non-β parts there are a couple of loops which were overpredicted by PSIPRED as sheets (orange). Notably this did not occur for any helical residue. This confirms that the overall helix prediction capabilities of PSIPRED are better than those for β-sheets. Regarding the β-type classification on those parts which have been correctly identified by PSIPRED, there is only one misclassified antiparallel residue but many misclassified parallel residues. In particular there is one short parallel strand where the majority were predicted as antiparallel and one long parallel edge strand with about half the residues misclassified (red). These two strands illustrate the limitations of our simple heuristic method to assign strand type labels. While the short strand remains incorrectly assigned on the strand level, the fraction of correct classified residues on the long parallel strand are above the threshold of 40% and hence the strand is correctly labeled as parallel (Fig. 1b).


 a)
Click on the thumbnail to enlarge the picture
 b)
Click on the thumbnail to enlarge the picture
Figure 1: PDB file 1k92 (a) with mapped β-strand residue predictions and (b) with threshold based corrections. Legend: correct predictions: green – parallel, blue – antiparallel,white – no strand, wrong predictions: red – missed parallel, purple – missed antiparallel, PSIPRED errors: orange – strand overprediction, yellow – strand underprediction.



Conclusion and outlook

We have presented a straightforward SVM based algorithm to distinguish parallel and antiparallel strands from the amino acid sequence. The good performance of the algorithm indicates that there is a strong signal difference in the sequence profiles of parallel and antiparallel protein sequences which SVMs can efficiently extract.

With strand type assignment accuracies of 90.7% and 83.3% for antiparallel and parallel strands, respectively, our method adds a considerable amount of information to current 3-class secondary structure predictions, which are used here as a prefilter. The information on parallel sheets in particular reduces the number of feasible topologies significantly. In contrast to antiparallel sheets, parallel sheets are comprised of strands which are not immediately adjacent, i. e. they constrain a stretch of the polypeptide chain between the strands. For most applications the parallel/antiparallel attribute can be treated as an augmented secondary structure alphabet. Consequentially, our method improves existing secondary structure alignment and fold recognition algorithms. It also yields additional input for methods aiming to derive the full β-sheet topology of a protein from strand strand interaction scoring [Cheng and Baldi, 2005]. Another possible application is constraint folding where our approach gives non-local constraints that help to reduce the search space [Kolinski et al., 2005].

Similar in concept to our dihedral prediction algorithm, DHPRED [Zimmermann and Hansmann, 2006], which provides information on coil regions in particular, BETTY represents another building block for a bottom up approach to 3D structure prediction. In the future, we plan to merge DHPRED and BETTY, thus augmenting the PSIPRED prefilter with detailed dihedral angle ranges and improving the strand assignment by employing the iterative 2-layer SVM procedure implemented in DHPRED. We will also incorporate a continuous probability score for each strand and each residue within a strand in BETTY, to provide more finegrained constraints and handle mixed strands. Finally we will use BETTY in concert with other methods as part of an automated 3D structure prediction pipeline under development.



Acknowledgement

This work is supported in part by a research grant (GM62838) of the National Institutes of Health (USA). The computations were performed on computers at the John von Neumann Institute for Computing in Jülich, Germany.




References