In Silico Biology 2, 0030 (2002); ©2002, Bioinformation Systems e.V.  
G C B ' 0 1

Improving fold recognition of protein threading by experimental distance constraints

Mario Albrecht*, Daniel Hanisch, Ralf Zimmer and Thomas Lengauer

Fraunhofer Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven, D-53754 Sankt Augustin, Germany
* corresponding author

Edited by E. Wingender; received November 30, 2001; accepted December 28, 2001; published April 03, 2002


We present a comprehensive analysis of methods for improving the fold recognition rate of the threading approach to protein structure prediction by the utilization of few additional distance constraints. The distance constraints between protein residues may be obtained by experiments such as mass spectrometry or NMR spectroscopy. We applied a post-filtering step with new scoring functions incorporating measures of constraint satisfaction to ranking lists of 123D threading alignments. The detailed analysis of the results on a small representative benchmark set show that the fold recognition rate can be improved significantly by up to 30% from about 54%-65% to 77%-84%, approaching the maximal attainable performance of 90% estimated by structural superposition alignments. This gain in performance adds about 10% to the recognition rate already achieved in our previous study with cross-link constraints only. Additional recent results on a larger benchmark set involving a confidence function for threading predictions also indicate notable improvements by our combined approach, which should be particularly valuable for rapid structure determination and validation of protein models.

Key words: protein threading, fold recognition, structure prediction, experimental data, distance constraints, cross-linking reagents, mass spectrometry, NOE restraints, NMR


The threading approach predicts the three-dimensional protein structure by aligning representative template protein structures with an amino acid sequence called the target sequence [Eisenberg, 1997; Finkelstein, 1997; Jones, 1997; Koehl and Levitt, 1999; Rost, Schneider and Sander, 1997; Sternberg et al., 1999; Zhang et al., 1997]. The alignment computation and evaluation usually gives a sequence-structure similarity score for each alignment as the result of applying a scoring function. According to the fold recognition protocol, the alignments obtained are then sorted by their score, which yields a ranking list of target-template alignments. The best-scoring alignment should identify the template structure and its corresponding fold class that is most compatible with the target sequence and thus constitute a meaningful model for the yet unknown structure of the target sequence.

However, the problem of developing an accurate scoring function is still unsolved for distantly related target and template proteins sharing the same fold. Especially, making the scoring scheme reflect diverse biological constraints seems to be a difficult task. Thus threading methods based solely on sequence information of the target protein often fail. To remedy the inherent shortcomings of the scoring function and, at the same time, to enhance the credibility of the proposed models, it becomes necessary to exploit more biological knowledge on the target in the prediction process. This additional information gives rise to constraints on the alignments and can be obtained from experimental data, for instance, distances between atoms of protein residues as measured by mass spectrometry (MS) [Chapman, 1996], which uses protein cross-linking reagents as molecular rulers (see Fig. 1), or by NOE (nuclear Overhauser effect) restraints of NMR spectroscopy [Branden and Tooze, 1998]. Their application should lead to a more accurate fold recognition and an improvement of the prediction and alignment quality.

Figure 1: Schematic view of the determination of experimental distance constraints by mass spectrometry.

This combined approach can be particularly beneficial for structural genomics projects that intend to determine many protein structures in short time. Experimental results that would give insufficient data for the complete structure determination if taken by themselves may already yield enough constraints to support the structure prediction by the threading method. The constraints applied to the set of predicted structures can help in selecting and validating the more plausible models of the target protein. This procedure is expected to accelerate the overall structure determination because the amount of data that is usually required can be reduced. Additionally, the threading models may render feasible the structure determination of proteins of larger size, for instance, by NMR spectroscopy, which is still limited to protein sizes up to about 30kD.

There have been few publications on threading with experimental constraints. Recently, Xu et al. described how to incorporate NOE restraints as distance constraints in their threading program PROSPECT in order to compute better protein models and to support the NMR structure determination [Xu et al., 2000]. To this end, they use a larger number of constraints than in our study as stated below. They also give some examples how the threading performance can be improved by the incorporation of partial structural information about the position of disulfide bonds and active sites in the target protein [Xu and Xu, 2000]. In contrast, Young et al. demonstrated the feasibility of rapid structure determination by the combination of intramolecular cross-links measured by mass spectrometry with threading results, using solely the target FGF-2 as model protein [Young et al., 2000]. Distance constraints derived from the measured cross-links are used to sort the set of computed threading models by the constraint error, the sum of the distance deviations between the target structure and the models.

In the following, we present a comprehensive analysis of methods for improving the fold recognition rates of the threading approach by the utilization of only few additional distance constraints from experiments. Here, in contrast to our previous paper focusing solely on simulated cross-link constraints [Hoffmann et al., 2001], we included sets of NOE constraints and used different scoring functions.


Protein benchmark set

To evaluate the performance of our methods, we used our standard template library of representative protein structures. This comprehensive benchmark set (available on request) taken from the Hobohm96-25 database [Hobohm and Sander, 1994] consists of 251 single-domain proteins, whose pairwise sequence identity is below 25%, which makes it hard to compute biologically reasonable alignments based solely on sequence information. The SCOP annotation [Murzin et al., 1995] of the proteins is used to divide the library into structural fold classes, which results in 11 classes containing at least 5 up to 11 members; the minimum of 5 members was chosen to allow the analysis of the ranking of members of the identical fold. These 11 folds contain altogether 81 target proteins (see Table 1), which are threaded against the complete template library. As described in [Thiele et al., 1999], this benchmark set represents a demanding task for threading methods because fewer than 50% of the residues of proteins of identical fold can be superposed within 3.0Å in most cases.

Table 1: Each target fold class shown with the protein count, the /-type class, the SCOP class name, the minimum and maximum sequence length of the target proteins in the class.

fold count type SCOP name min max

1 8 4-helical cytokines 119 172
2 7 Four-helical up-and-down bundle 106 154
3 5 - Cystine-knot cytokines 85 112
4 6 + Ferredoxin-like 81 143
5 5 / Flavodoxin-like 128 302
6 11 Globin-like 136 172
7 6 / /-Hydrolases 265 534
8 5 Lipocalins 131 176
9 7 OB-fold 98 280
10 11 / TIM /-barrel 228 490
11 10 Viral and capsid proteins 175 548

Distance constraint filter

The constraint filter that checks the violation of the target distance constraints requires that both target residues related by the constraint are aligned to template residues that are within the given distance (cf. Fig. 2).

Figure 2: Alignments of a target sequence s with selected distance constraints to two templates t* and to annotated with the corresponding distances in their structures. Assuming a distance tolerance = 2.5Å and a position tolerance = 0, all distance constraints in the left alignment are satisfied, whereas two distance constraints are violated in the right alignment.

Formally, let the target sequence be represented by s = s1, s2,..., sn and the sequence of the template structure by t = t1, t2,..., tm. We are given a set of distance constraints C = {(si, sj, dsisj)| si, sj s, dsisj +}, where dsi,sj is the atomic distance between the residues si and sj. We then regard some alignment A(s, t) of length z that maps the target segment s' = s'1, s'2,..., s'z onto the template segment t' = t'1, t'2,..., t'z. Every residue of s contained in s' is either mapped to some residue of t in t' or to a gap. A distance constraint c = (si, sj, dsisj) is said to be aligned to two template positions t'k and t'l if the two target residues si and sj are aligned to t'k and t'l, respectively. If t'k or t'l represents an alignment gap, the distance dt'kt'l is defined to be +.

The stringency of the distance constraint filter in terms of specificity and sensitivity can be adjusted by two tolerance parameters:
   1.Distance tolerance: the maximum deviation from the given distance value.
   2.Position tolerance: the number of sequence positions searched left and right of the aligned template position for matching distances.

The distance tolerance accounts mainly for the inaccuracy of the experimental measurements, but also for small local structural deviations between the target and the template. Because the alignments computed by the threading method do usually not reflect structural similarity perfectly and may be shifted with respect to the standard-of-truth alignments, we set the position tolerance to = 4, which gives a search interval of size 9 around the aligned residue positions. This particular parameter setting allows -helices to be misaligned by one turn, and -strands by one or two shifts of two consecutive residues. Generally, the position tolerance can be set to small values for high-quality alignments, while larger values should be chosen for alignments of lower quality.

Now we define that some distance constraint C C aligned to t'k and t'l is satisfied if the distance between the mapped residues in the template structure closely matches the measured distance constraint in the target structure:

Otherwise the constraint C is said to be violated.

Experimental distance constraints

We constructed three different sets of distance constraints for each target protein. Because experimental data is not yet available to us for a large number of proteins, we simulate the distances by simply reading them off resolved target structures in the Protein Data Bank PDB [Berman et al., 2000]. To store the data for each protein, we used the ProML specification language [Hanisch et al., 2001].

We always selected distances between C atoms in order to be able to compare the results of the three sets (for glycine residues, we computed pseudo C atoms). However, it would be readily possible to extract similar distance constraints from real experimental data that is measured between sulfur atoms or hydrogen atoms as it is often the case in MS or NMR, respectively.

For the first set Cm of distance constraints, we mimic the measurement process of intramolecular homobifunctional cross-linkers by MALDI-TOF mass spectra and select distances in the range of 8Å to 14Å between aspartate, glutamate, and lysine residues [Green et al., 2001; Hoffmann et al., 2001]; the distance tolerance was chosen as = 2.5Å, which corresponds to the maximum standard deviation observed in reagent manufacturer and simulated data [Green et al., 2001]. Additionally, we included distance constraints that are derived from disulfide bridges as kind of natural cross-linking reagents into our set. The number of distance constraints in Cm was set to 0.1 constraints per residue.

For the second and third constraint set Cn1 (Cn2), we randomly picked 0.1 (0.2) artificial long-range NOE restraints per residue of distance 4Å to 6Å with a minimum sequential separation of three adjacent residues. This corresponds to sparse data that is acquired early in the NMR structure determination process [Bowers et al., 2000; Skolnick et al., 1997; Standley et al., 1999]. The distance tolerance of NOE enhancements was chosen to be more stringent with = 1.0Å.

Scoring functions

We apply the distance constraint filter to the ranking list of target-template alignment scores, which is produced by the threading program. We investigated four different scoring functions that validate the structure models given by the target-template alignments when checking the experimental distance constraints for violation. Two functions simply count the number of satisfied and violated distance constraints, while the other two use the idea that distance constraints conserved among the members of a fold class should receive more weight in the summation.

For this purpose, we extracted structurally conserved cores of the folds in the template library by means of structural superpositions and multiple alignments. We assign a higher weight to a distance constraint if its both ends lie within a core region of the template. In practice, we assume a weight of five, which amounts to scoring the distance constraint inside a core region five times as high as a non-core constraint.

In the following, let na be the size of a set containing all experimental distance constraints of a target s and nf  the number of fulfilled distance constraints in some target-template alignment A(s, t), whose threading score is denoted by r. The weighted sum computed from nf  and na under the consideration of structural cores is given by wf  and wa, respectively. Then the four scoring terms are defined as follows:

The parameters fa and wfa of the linear combined terms srfa and srwfa were chosen such that the threading score r and the constraint score sfa and swfa contribute roughly equally to the combined value if about 50% of all potential constraints are fulfilled in the top ranked threading alignments. As further investigations into the optimal choice of these parameters revealed, this assumption usually maximizes fold recognition rate in case of our sets Cm and Cn1 with 0.1 constraints/residue. However, the exact choice is not that relevant as maximum performance is achieved over a relatively broad interval of parameter values around the chosen values fa = 1600 and wfa = 1600. Interestingly, maximum performance for the set Cn2 with 0.2 constraints/residue was reached with the double parameter values fa = 3200 and wfa = 3200. This may be due to the fact that, in contrast to the 0.1-sets, the same absolute amount of fulfilled constraints is already obtained if only about 25% of all possible constraints of the 0.2-set are satisfied.

Threading alignments

We used the threading program 123D [Alexandrov et al., 1996] and its recent extension 123D* [Sommer et al., 2001] incorporating profile methods to predict the structure of all 81 target proteins in our benchmark set by threading them against the template library. The result of a 123D run for each target is a list of all 250 template proteins (excluding the target protein itself), which is ranked by the target-template alignment scores. We employed three different parameter sets to compute global alignments of increasing quality:

P1: Standard parameter set with gap insertion and extension cost 20 and 0.8.

P2: Optimized parameter set as published in [Zien et al., 2000].

P3: Improved threading with frequency profiles.

In addition, we generated standard-of-truth alignments via the structural superpositions computed by the program SARF2 [Alexandrov, 1996].



According to the fold recognition protocol, all target proteins are classified and their predicted folds compared with the true folds in order to calculate a recognition rate. The fold of a target structure is correctly recognized if the best-scoring template in the ranking list (not containing the target protein) belongs to the target fold class. In the special case that more than one template reaches the identical best score, the target protein is counted as correctly recognized only if all best-scoring templates share the same target fold.

In order to improve the fold recognition rate, we apply the distance constraint filter to the 123D results in a post-filtering step. This procedure amounts to a re-evaluation and validation of the computed threading alignments with one of the four scoring functions sfa, srfa, swfa, and srwfa. In this way, the aligned templates are checked for the violation of distance constraints contained in one of the three sets Cm, Cn1, and Cn2.

Table 2 shows the fold recognition rate for each pair of scoring function and constraint set, and compares it with the performance of the 123D scoring function r for every threading parameter set P1, P2, and P3.

Table 2: Recognition rates for 123D alignments, dependent on the scoring functions r, sfa, srfa, swfa, srwfa, the distance constraint sets Cm, Cn1, Cn2, and the threading parameter sets P1, P2, P3.

  r sfa srfa swfa srwfa  
  123D constr. 123D+constr. core-constr. 123D+core-constr.  

54% 21% 47% 49% 58% P1

60% 23% 63% 54% 73% P2

65% 20% 56% 43% 68% P3

54% 46% 57% 64% 67% P1

60% 52% 72% 65% 79% P2

65% 57% 67% 75% 77% P3

54% 52% 57% 72% 73% P1

60% 62% 75% 73% 84% P2

65% 64% 73% 79% 80% P3

Apparently, the combined scoring functions srfa and srwfa outperform both the 123D scoring function r and the simpler functions sfa and swfa. The reason may be that the distance constraint score exploits an orthogonal measure (the fraction of satisfied constraints) to complement the threading score r. While the 123D score gives an empirical estimate of the alignment quality, number of gaps, sequence and low structural similarity between the target and template, the distance constraint score serves as a high-quality indicator of structural accuracy and thus of indispensable properties of a correct template structure.

Moreover, the scoring functions swfa and srwfa, which are based on structural cores, also lead to a substantial increase of the recognition rate when compared with the functions sfa and srfa. Besides, the data shows that the alignment quality, as given by the 123D recognition rate, is important for the improvement potential of the scoring function.

Apart from that, the comparison of the performance of the distance constraint sets Cn1 and Cn2 reveals the expected relationship that the recognition rate increases with more distance constraints used. At a closer look, it appears that the distance constraints Cn1 obtained by NMR do slightly better than the MS constraint set Cm. This observation may be explained by the fact that the maximum of the atomic distance distribution of a protein is often attained towards values larger than 8Å and thus the satisfaction of a typical NOE distance constraint of 4Å to 6Å is more significant for the correctness of the chosen template structure. Another, but only minor, effect (as revealed in further tests not detailed here) is the different accuracy of the MS and NMR measurements expressed by the distance tolerance parameter. A third reason might be that the distance constraints from cross linkers are biased towards certain amino acids, which is not the case with NOEs.

Figure 3 depicts that the increase of the recognition rate also depends on the fold class. It is particularly interesting that the targets with ferredoxin-like fold (number 4 in the figure) are much better recognized by the combined function with 79%, while the 123D scoring function usually fails completely, which may be due to the special composition of these proteins.

Figure 3: Recognition rates per fold class, using the 123D parameter set P2 and the NOE distance constraint set Cn1. Each triple of bars shows the performance of the 123D scoring function r (left) and of the constraint scoring functions swfa (middle) and srwfa (right), both based on structural cores. The 12th triple of bars depicts the overall recognition rates averaged over all 11 fold classes.

Comparison with SARF alignments

In order to determine an upper bound on the recognition rate that can be maximally achieved, we evaluated the standard-of-truth SARF alignments on our benchmark set by the scoring functions sfa and swfa. Additionally, we used the number of aligned residues as an artificial scoring function a to obtain an estimate of the best possible recognition rate. This is feasible because SARF aligns only residues that are structurally superimposable within a predefined threshold of 3.0Å. Thus, in general, the more residues are aligned, the closer a structural relationship between the target and template protein can be assumed.

The results are given in Table 3, and some of them are illustrated in Figure 4. As shown, we cannot hope for reaching a recognition rate much higher than 90% on our particular protein benchmark set because some structural similarities are stronger between than within different SCOP fold classes. In accordance with the observations described in the previous section, we notice again an increased fold recognition rate among the distance constraint sets Cm, Cn1, Cn2, and a better performance of the scoring function swfa based on structural cores in contrast to the simpler function sfa. It is also striking that the better alignment quality produced by the SARF program yields higher recognition rates.

Table 3: Recognition rates for SARF alignments, dependent on the scoring functions a, sfa, swfa, and the distance constraint sets Cm, Cn1, Cn2

    a sfa swfa
    aligned res. constr. core-constr.

Cm MS 0.1-set 90% 65% 85%
Cn1 NOE 0.1-set 90% 79% 90%
Cn2 NOE 0.2-set 90% 85% 95%

To sum up, it is remarkable that we approach the maximum fold recognition rate already by using only few distance constraints. The best fold recognition rates found in Table 3 lie between 68% to 73% for MS (in agreement with Hoffmann et al. [2001]) and between 77% to 84% for NMR constraints. This is a significant improvement of up to 30% as compared to the best 123D* threading method with 65% recognition rate.

Figure 4: Recognition rates per fold class, using SARF alignments and the NOE distance constraint set Cn1. Each triple of bars shows the performance of the scoring function a (left) and of the constraint scoring functions sfa (middle) and swfa (right), the latter based on structural cores. The 12th triple of bars depicts the overall recognition rates averaged over all 11 fold classes.

Larger benchmark set

Our findings in the analysis detailed above are supported by results on a larger benchmark set. The template library, which is also used as the set PDB40D in [Sommer et al., 2001], consists of 2808 SCOP domains [Murzin et al., 1995] in 540 fold classes. The maximal sequence identity is 40% as determined by the ASTRAL-Server [Brenner et al., 2000]. We selected all 1613 single-domain target protein chains covering 283 fold classes from the corresponding set PDB40C, in which each target chain has at least one different template partner of its fold class in the library. We computed global alignments between the target and template sequences with the parameter set P2 and applied the distance constraint post-filter for the three sets Cm, Cn1, and Cn2 to the threading results with the scoring functions sfa and srfa. We did not use scoring functions based on structural cores because the cores were not available yet for each fold class. Generally, the construction of structural cores for all fold classes of the template domains is regarded as difficult. However, we will address this problem in future work with the derivation of cores for each template domain from the multiple structural alignment of its superfamily members as a first approximation.

Additionally, we employed a confidence function that computes the raw score gap as described in [Sommer et al., 2001] for each threading prediction by the program 123D. This function gives a quantitative estimate on the reliability of the prediction of the target fold. The raw score gap is the difference between the best threading score of the top-scoring template fold to the score of the next high-scoring template fold in the ranking list. A confidence threshold is introduced to discriminate between reliable and non-reliable predictions and to reduce the number of necessary experiments for the re-evaluation method. This modified fold recognition protocol regards the fold of a target chain as reliably predicted if the computed raw score gap is at least as high as a given threshold. In this case, the ranking list of threading alignments for the target chain is not subject to post-filtering procedures. If the confidence value is below the threshold, the computed ranking list is re-evaluated by a constraint scoring function.

The fold recognition rates on the larger benchmark set are shown in Table 4. As already observed for the smaller benchmark set, the NOE distance constraint sets Cn1 and Cn2 perform better than the MS constraint set Cm and improve the recognition rate by about 5%. The application of the confidence function is important to increase the recognition rates substantially for the set Cm. However, we expect further improvements of the recognition rate by constraint scoring based on structural cores in future work. The confidence thresholds Tm, Tn1, and Tn2 assumed for the constraint sets Cm, Cn1, and Cn2, respectively, are chosen as the minimum of all thresholds whose application results in the same maximum recognition rate after post-filtering.

Table 4: Recognition rates for 123D alignments on a larger benchmark set with threading parameter set P2, using the scoring functions r, sfa, srfa and the confidence thresholds Tm = 139, Tn1 = 340, Tn2 = 340 for the distance constraint sets Cm, Cn1, Cn2, respectively.

    r sfa srfa srfa
         123D         constr.      123D+constr.   with confidence thresholds

Cm MS 0.1-set    71.73% 16.21% 71.61% 73.16%
Cn1 NOE 0.1-set    71.73% 35.47% 76.26% 76.63%
Cn2 NOE 0.2-set    71.73% 44.77% 76.88% 77.12%

As depicted in the Figures 5 and 6, a lower confidence threshold decreases the number nl of target chains that are lost, i.e., that were correctly recognized by threading but not any more by post-filtering. In contrast, the number nw of targets that are won, i.e., that can be solely recognized by post-filtering, is sustained on a very high level over a large interval of threshold values. As can be seen by the comparison of the plots of the overall number nd = nw - nl of additional recognized target chains, it is better to choose the threshold Tm lower than the thresholds Tn1 and Tn2 because post-filtering with MS distance constraints seems to work worse than post-filtering with NOE restraints. Thus, only prediction for targets with rather low confidence values should be included in the MS re-evaluation.

Figure 5: Numbers nl (red dotted line) and nw (blue dashed line) of lost and won target chains together with the overall number nd (black solid line) in dependance of the chosen confidence threshold Tm after post-filtering with distance constraint set Cm.

Figure 6: Numbers nl (red dotted line) and nw (blue dashed line) of lost and won target chains together with the overall number nd (black solid line) in dependance of the chosen confidence thresholds Tn1 and Tn2 after post-filtering with distance constraint sets Cn1 and Cn2, respectively.

In general, the confidence measure appears to be a very good indicator of the threading prediction quality despite its apparent simplicity. Thus it helps to avoid post-filtering procedures for target protein chains whose fold is already predicted reliably, and, at the same time, the number of additional time-consuming and costly experiments to collect distance constraints are reduced in practice. In particular, only 29.20% (40.55%) of all target chains are included into the re-evaluation for optimally chosen confidence threshold Tm = 139 (Tn1 = Tn2 = 340) with the highest overall numbers of additional recognized targets after post-filtering.


We demonstrated that a small number of experimental distance constraints already suffices to improve the fold recognition rate considerably. It is possible to collect the distance constraints readily by standard experiments such as mass spectrometry and NMR spectroscopy. The introduction of a confidence measure on the reliability of threading predictions helps to reduce the number of necessary experiments. Our observations should be particularly beneficial for the current rapid structure determination efforts on a genomic scale.

Even in the hard case of low sequence similarity or missing homologues, suitable template structures can be selected reliably by post-filtering and re-ranking the list of threading alignments. The better performance is achieved with combined scoring functions that consider few additional constraints and their location in structural cores. Our comprehensive analysis also shows that the application of NOE constraints improves on the recognition rate achieved by exploitation of MS cross-link constraints. In addition, we show that the success of the post-filtering approach depends significantly on the alignment quality. Thus accurate threading programs remain necessary for generating high-quality alignments.

In future work, we will apply real experimental data and investigate potential scoring functions further, including the introduction of some significance value for the satisfaction of distance constraints and the analysis of other combinations of confidence measures. We are also developing a new threading algorithm that allows to incorporate the distance constraints directly into the alignment computation, which should improve the alignment quality towards the structural standard and, at the same time, aid to reduce the overall computation time.


We thank Heinz-Theodor Mevissen and Ingolf Sommer for helpful discussions. Part of this research has been funded by the DFG under contract no. ZI 616/2.