| In Silico Biology 4, 0031 (2004); ©2004, Bioinformation Systems e.V. |
1 Equipe de Bioinformatique Génomique et Moléculaire (EBGM),
INSERM E03-46, Université Denis Diderot - Paris 7, case 7113,
2, place Jussieu, 75251 Paris, France.
Phone: +33-1-44 27 77 31, Fax: +33-1-43 26 38 30
2 Université Pierre & Marie Curie - Paris 6, 75005 Paris, France.
3 CEA, 17 avenue des Martyrs, 38054 Grenoble, France.
Email: alexandre.debrevern@ebgm.jussieu.fr
*corresponding author
Edited by E. Wingender; received April 13, 2004; revised and accepted May 26, 2004; published May 29, 2004
A statistical analysis of the PDB structures has led us to define a new set of small 3D structural prototypes called Protein Blocks (PBs). This structural alphabet includes 16 PBs, each one is defined by the (Φ, Ψ) dihedral angles of 5 consecutive residues. The amino acid distributions observed in sequence windows encompassing these PBs are used to predict by a Bayesian approach the local 3D structure of proteins from the sole knowledge of their sequences. LocPred is a software which allows the users to submit a protein sequence and performs a prediction in terms of PBs. The prediction results are given both textually and graphically.
Key words: structure prediction, confidence index, Bayesian approach
A classical approach to simplify 3D protein structures consists in describing the protein backbone in terms of secondary structures with repetitive α-helices and β-strands and, everything else called coils. The use of neural networks and homologous sequences has increased the prediction rate to a value close to 80 % [1-3]. However, even with such a rate, the approximation of the three-dimensional structure by only 3 states is very crude: 50 % of the residues are assigned as "coil" whereas they correspond to very different local structures.
To go further, various teams have proposed to categorize the 3D structures through a structural alphabet, i. e. a set of small protein fragments frequently observed in a structural databank [4]. This structural description gives new insights into the relation 1D-3D, revealing peculiar sequence specificity [5-9].
We have defined in a previous study a structural alphabet composed of 16 average protein fragments of 5 residues in length, called Proteins Blocks (PBs, see Figure 1) [6]. These PBs show a good 3D approximation of the local structures with an average RMSD of 0.42 Å. They have also proved their reliability to describe long length fragments [10-13]. The main structural characteristics of the Protein Blocks are briefly pointed out in the following. PBs a to f may be related to the β-strand secondary structure, PB d corresponds to the more regular central part, PBs a, b and c to the N-caps and e, f to the C-caps. The PBs k to p may be related to the α-helix secondary structure, with PB m describing the central part of a right-handed helix, PBs k and l for the N-caps and PBs n to p for the C-caps. Finally, PBs g to j may mainly be associated with coil structures. A Bayesian approach based on the relationship between Protein Blocks and their amino acid propensities is used to perform a local structure prediction [6].
Thus, the prediction of the PB series from the sole knowledge of the protein sequence allows predicting every region of the protein without ignoring the local conformations of the coil state. Moreover, it gives a precise description of the repetitive structures [13]. Bayesian prediction gives a lower prediction rate than more sophisticated method like Artificial Neural Networks [1-3]. Nevertheless, it permits to analyze the role of each amino acid in the prediction and to compute an index which is directly correlated with the quality of the prediction (see Prediction confidence index section).
The purpose of this project was to develop a software named LocPred (Local structure Prediction) based on this alphabet. LocPred is written in Java and can be used under many different platforms. The user can submit a protein sequence either in single letter amino acid code format or in Fasta format (Figure 2a).
The prediction is based on the observed distributions of the amino acids in sequence windows encompassing each PB. Three options are available: (i) A Bayesian prediction: Tested with more than 300 sequences belonging to the Protein Databank, we have obtained an average prediction rate of 34.4%. (ii) Sequence families approach. This approach has been developed to optimize the sequence-structure relationship. Indeed, for one given PB, the Bayesian approach implies the use of one amino acid occurrence matrix. However, a same local fold, e. g. a PB, can be associated with different sequence clusters. So, using an optimization close to Kohonen's Self-Organizing Maps (SOM [14]), we have defined several new occurrence matrices for the most frequent PBs (for more precise details see [6]). They permit to increase the sequence - structure relationship of these PBs. This clustering in different sequence families has led to an improvement of the prediction rate to 40.7% on average. (iii) New sequence families approach. Moreover, we have recently improved this approach with the use of a method related to simulated annealing simulations. The prediction rate now reaches 48.7%.
The prediction score is computed along a sliding sequence window of 15 residues in length. For each sequence position, LocPred gives as outputs the most probable PBs as well as the distribution of the probabilities associated with each PB (Figure 2b).
From this information, it is possible to define an entropy-based index called Neq (for equivalent Number of Protein Blocks), close to the one proposed in PSIPRED [15]. The Neq allows one to locate strongly (Neq ~ 1) versus weakly (Neq ~ 16) informative sequence regions. We have shown that a strong correlation exists between the Neq values and the PB prediction success in each position. Thus, Neq helps to distinguish putative well predictable regions versus misleading regions.
A user would like to know if the performed prediction in terms of PBs will be correct. So, we have used the average Neq value taken from the prediction and a linear regression model to compute the expected prediction rate for a protein (only available for New sequence families approach). This latter has a standard deviation of only 5%.
In the same way, we have assessed the quality of the prediction at each position by taking into account the local Neq value and then proposed two distinct strategies. Both use a fixed prediction rate.
(i) The "global strategy": it consists in the computation of the optimal number of PBs in each position to insure a given prediction rate. So, the number of selected PBs may be variable along the sequence. Figure 1c shows the results of the prediction for the protein-conjugating enzyme with the global strategy for a prediction rate of 65%. For instance, the 7 first residues have been associated with one single PB, the next two with 3 PBs.
(ii) The "local strategy": the protein sequence is predicted with a constant number of PBs per position (Figure 2c). This strategy determines the regions able to be predicted with this prediction accuracy [6]. The corresponding PBs selected by each method can be downloaded.
Moreover, an online help is available on http://www.ebgm.jussieu.fr/~debrevern/LOCPRED/, as well as the 3D structures of the PBs. These strategies are interesting as a first step in an ab initio method [16] and could help to analyze and align appropriately sequences with low similarity. For the homology modeling with an available 3D structure or a 3D model, a rasmol script [17] can be obtained to visualize the Neq variations along the structure. In the same way, a comparison of a 3D structure or model translated in terms of Protein Blocks can be done.
LocPred is freely available for use through the Internet at the URL: http://www.ebgm.jussieu.fr/~debrevern/LOCPRED and can also be installed locally (same URL). It can be executed over the World Wide Web on any Java compatible Web Browser. The Java files are available at the same URL.
We would like to thank Estelle Calvez, Maxime Huvet, Laurent Fourrier and Aurélie Urbain for different tests and analyses, Joelle Hochez for the data-processing support, Patrick Fuchs and Anne-Claude Camproux for fruitful discussions.
This work was supported by a grant from the Ministère de l'Enseignement Supérieur et de la Recherche and from "Action Bioinformatique inter EPST" 2001-2002 (number 4B005F) and 2003-2004 ("Outil informatique intégré en Génomique Structurale. Vers une prédiction de la structure tridimensionnelle d'une protéine à partir de sa séquence." and "Plateforme de bioinformatique structurale - RPBS"). AdB was supported by a grant from the Fondation de la Recherche Médicale. CB and RG have grants from the Ministère de la Recherche. HV has a grant from the Centre d'Essai Atomique (CEA). CE and SH are Professors at the University Paris 7 - Denis-Diderot, Paris. AdB is a researcher at the French Institute for Health and Medical Research (INSERM).