|In Silico Biology 2, 0044 (2002); ©2002, Bioinformation Systems e.V.|
Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 52900, Israel.
Phone: +972-3-531-8124, Fax: +972-3-535-1824
* corresponding author
Edited by E. Wingender; received May 05, 2002; revised and accepted July 24, 2002; published August 09, 2002
We present a system for predicting protein-protein modifications, and demonstrate its usefulness in the field of signal transduction research. Signal transduction is one of the most important areas of investigation in biological research. One of the major mechanisms frequently employed by cells to regulate signal transduction processes involves protein phosphorylation by various kinases. As many as 1000 protein kinases and 500 protein phosphatases in the human genome are thought to be involved in phosphorylation processes which regulate all aspects of cell function. The complexity of such interactions stems from the enormous number of factors and interactions, which makes the identification of putative substrates for any given enzyme by straightforward experimentation increasingly difficult. We present here a data mining algorithm, based on the similarity between the modifier proteins and between the modified proteins, and on experimental constraints. The application presented here (PESI) focuses on substrate phosphorylation by various enzymes. This algorithm reduces the number of substrate candidates for experimental study by about two orders of magnitude. Moreover, this algorithm has already yielded predictions for previously unknown substrates of the enzymes PKC and PKC, which we have confirmed experimentally.
Determining biological pathways and biological interactions in the in vivo setting of the living cell is currently dealt with by intensive experimental work. However, the complexity of biological interactions including a myriad of genes and gene products, multiple interactions and the numerous intricate pathways which are involved, makes the solution of these problems by conventional experimental analysis efforts difficult to examine, interpret and verify, and would benefit from computerized assistance. An example for such a biological function the study of which could be assisted by the use of computational solutions is the identification of protein interactions in signal transduction pathways.
Signal transduction is one of the most important areas of investigation in biological research, and involves many types of interactions. One of the major mechanisms frequently employed by cells to regulate their activity, and in particular to regulate signal transduction processes, involves changes in protein phosphorylation . Exposure of cells to different ligands leads to autophosphorylation and transphosphorylation processes, which in turn mediate activation of downstream signaling proteins. This allows the formation of signaling complexes composed of multiple proteins, mediating the transmission of a distinct signal affecting cellular function [2, 3]. As many as 1000 protein kinases and 500 protein phosphatases in the human genome are thought to be involved in phosphorylation processes . The targets of phosphorylation encompass a large group of signaling molecules, including: receptors, cytokines, enzymes, transcription factors, transcriptional coregulators and chromatin-modifying factors. These modifications either positively or negatively regulate signal transduction activity to facilitate a program of gene expression that results in appropriate changes in cell behavior. In addition to binding of adapter proteins to docking sites associated with tyrosine phosphorylated residues, protein phosphorylation and dephosphorylation can directly regulate distinct aspects of transducer factor function, including cellular localization, protein stability, protein-protein interactions and DNA binding. One common feature of all of the regulatory systems affected by phosphorylation events is the specificity of the interaction associated with distinct signaling pathways.
The complexity of signal transduction pathways is associated with the enormous number of factors and interactions, which are simultaneously undergoing every single moment in each cell. A recent keyword search of SwissProt revealed about 3000 proteins that participate in signal transduction processes, hence the number of protein interactions in the phosphorylation context alone should be enormous. These problems severely limit experimental analysis and correct integration and analysis of data. A prediction of these interactions in a virtual environment can serve to investigate possible interactions between different types of proteins in a distinct cellular milieu, and could predict possible consequences on cellular behavior in the context of changes in these proteins. Such problems belong to the repertoire of interests of researchers in the area of bioinformatics, who attempt to find a creative solution to the problem of suggesting new interactions using computational methods. In this study we focused on trying to elucidate, via computational means, how protein modifications affect or are affected by interactions between enzymes and their substrates. By 'enzymes' we refer to proteins that are able to modify other proteins. The modified protein - referred to as substrate - might be either another enzyme or any other protein participating in the same signal transduction pathway. Each type of modification is defined as a separate parameter. We developed a program (PESI) which, by computation of the known physical and biological parameters, could predict biological interactions between groups of proteins following their modification.
The majority of interactions underlying signal transduction processes is associated with the formation of signaling complexes composed of several proteins. Each complex contains an enzyme which is associated with its substrate, and additional proteins supporting the modification process. Two crucial aspects of the modification process determine in which signal transduction process, if any, the modification will occur. The first aspect is the ability of enzyme and substrate to bind to each other. This association requires three-dimensional compatibility between enzyme and substrate. Consequently, it is reasonable to assume that if two enzymes are highly similar in their substrate binding sites, which are, usually, also linked to the enzyme active site, then their substrates should be similar in their properties and especially in their enzyme binding site.
The second aspect is the involvement of additional proteins. Besides substrates, the majority of interactions which lead to complex formation will require additional protein factors in support of substrate modification. Therefore, the formation of this complex is one of the conditions that have to be fulfilled for a proper phosphorylation process to occur. Consequently, it is reasonable to assume that if two enzymes participate in similar protein complexes, which are composed of similar proteins, then these enzymes will bind similar substrates. The protein composition of modification complexes specifically depends on the cellular compartment and on specific physical and biochemical protein profiles, which are associated with each distinct signal transduction pathway.
The protein binding properties described above can be inferred from distinct protein attributes. Thus, the ability of an enzyme or a substrate to bind the same protein factors is determined by the presence of specific structural motifs and by biochemical properties. Both characteristics are suggestive of similarity between these proteins. An additional important feature is the existence of specific protein binding domains, and domains responsible for localization of proteins to defined cellular compartments. These properties are required to enable proteins to participate in the same complex and are also indicative of their similarity. Therefore, by creating a database which will include the structural, physical and biochemical attributes and similarities between specific proteins, and by using an appropriate data mining algorithm to search for protein similarities, we can predict protein-protein interactions or modifications and their effects on protein phosphorylation in the signal transduction pathway.
Note that the information that we suggest to utilize in this database does not include, at least at the current stage, direct protein sequence information, i. e. the actual sequence or sequence similarity scores between proteins. This is due to the fact that it is far from trivial to relate detailed functional similarities to sequence similarities. The general question of inferring functional relationships from sequence and even structure relationships is very complicated, (see for example reviews in 5, 6). The specific question about phosphorylation substrates is even more difficult since most kinases share sequence motifs and thus show sequence similarities, but the level of similarity is not a clear indication to the specific reaction each protein can perform.
Our program does utilize sequence attributes (like does the sequence contain SH2 domain, or does it contain a zinc finger motif) that are relevant to phosphorylation activity. Thus, we submit that at the current time utilizing biochemical information that is usually readily available in a laboratory setting is more relevant than raw sequence information. However, technically our framework does allow to include a sequence as another attribute of the protein, and if found useful, it might be utilized in future versions of the system.
The general idea underlying the PESI algorithm is that a set of experimentally validated interactions between proteins can be used to infer a much larger set of possible interactions. This is done using the simple assumption, which is supported by common biological experience, that if protein A is known to modify protein B, it is likely that proteins that are biologically similar to protein A could modify proteins which are biologically similar to protein B. The main challenge is to design the system in such a way that only meaningful predictions of possible interactions will be produced.
This general idea can be used for predicting various types of interactions between proteins (not only phosphorylation but also, e.g., cleavage) and for various types of predictions, such as finding a substrate for a given enzyme, finding an enzyme modifying a given substrate, searching for an interaction with certain characteristics, etc. To demonstrate the idea in a specific laboratory setting, the current implementation is aimed mainly at finding a phosphorylation target (i. e. substrate) for a known enzyme, where some experimental data is available about the substrate. In order to elucidate the full picture of each signal transduction process, one must discover every protein interaction, that is, the nature of every step in the signal transduction chain of events. Understanding the nature of such processes is dependent on determining the details of substrate modification by enzymes. The details of interest are: what enzyme modifies each substrate, and by which type of modification.
The data that underlies the PESI system contain two types of information:
The PESI graphic user interface includes screens that enable the user to select the enzyme from the list of enzymes that exists in the system, and type of modification specific for each enzyme. In addition, the interface enables the user to specify the known experimental data about the substrate, i. e., molecular weight, possibility to undergo phosphorylation on serine/threonine or tyrosine residues by serine/threonine or tyrosine enzymes, the existence of SH2 and/or SH3 domains, etc. In practice, it is quite common for the experimentalist to be able to provide these data. One of the main methods used by experimental researchers in order to determine the substrate for a specific enzyme is, first, to immunoprecipitate the proteins that physically associate with the enzyme of interest and separate them for specific detection. These results in hundreds of candidates for a substrate of the specific enzyme, with only their molecular weight approximately known, and only one of them is probably suitable. Second, immunoblotting is performed in order to characterize the substrate, utilizing immunoglobulins raised against specific protein domains like SH2, SH3, anti-phospho-serine/threonine, anti-phospho-tyrosine and others. In this way, experimentalists can provide data regarding the candidate protein. However, if data are not available, a "don't know" flag can be checked. The PESI main input screen is shown in Figure 1.
Table 1: A list of all modifications currently included in the PESI system.
|Modifier protein||Modified protein||Serine/Threonine||Tyrosine|
|Figure 1: PESI Main input-output dialogue box. Upper panel - data input interface. Left: selection of an enzyme from the existing list. Right: selection of the required type of modification. Center: specification of constraints, i. e. known experimental data about the substrate. Most fields enable a choice between three different options: the appropriate protein attribute is present, absent or unknown. Lower panel - data output interface, which presents to the user a sorted list of substrate candidates.|
The PESI prediction algorithm is described schematically in Figure 2. We start with the given enzyme and scan for all other enzymes within its "similarity zone", i. e., all of the proteins whose similarity score with the given enzyme is above a predefined threshold. Next, we search for enzymes in the similarity zone that participate in a known interaction. With these known interactions we carry the search over from the enzyme domain to the substrate domain. In the substrate domain, we perform a conceptually similar operation: for each substrate participating in the previously identified interactions, we build a similarity zone of all substrates that are sufficiently similar to it. The substrates in these similarity zones form our primary list of substrate candidates. This list is sorted according to the similarity score, which reflects both enzyme and substrate similarities.
The similarity score is based on the following parameters. Since our particular implementation concentrates on modification of the phosphorylation type, the parameters such as whether the protein is known to be a serine/threonine or tyrosine kinase, and whether it has the potential to be phosphorylated on serine/threonine or tyrosine residues, have a critical importance. Therefore these parameters were assigned a Boolean type, i. e. must match. Parameters such as the presence of SH2 and SH3 domains are also highly significant, because of their role in the formation of many protein complexes. The presence of these complexes, as mentioned above, could play a critical role in phosphorylation process.
|Figure 2: General scheme of the PESI prediction algorithm. Given the input enzyme, the algorithm finds all other enzymes within its "Enzyme similarity zone", i. e., all proteins whose similarity score with the given enzyme is above a predefined threshold. Known interactions of these enzymes give the "Similarity zones" of all candidate substrates. The intersection between these zones and the set of substrates defined by the "Constraints" (grey zone on the scheme) gives the candidate "Output substrates".|
The presence of domains such as a zink finger, a leucine zipper, a nuclear localization signal, and the ability to be stimulated by GTP, also play a very significant role in our algorithm. The presence of these domains in observed proteins indicates possible similarities in protein activities, and/or at participation in the same complexes, and/or at similar cell compartment localization. All these suggest that the proteins in question may be involved in shared biological pathways. As explained above, this supports the possibility that these proteins are phosphorylated by similar enzymes. The presence of specific structural motifs such as polyproline, and the ability of the proteins being compared to be stimulated by GTP, support the possibility that these proteins undergo physical association with similar proteins, again suggesting they may be phosphorylated by similar enzymes.
Another parameter, which participates only in the enzyme similarity algorithm, is the enzyme's capability for self-phosphorylation (autophosphorylation). In this respect we differentiate between serine/threonine and tyrosine autophosphorylation. There is abundant biological evidence regarding a difference between serine/threonine and tyrosine autophosphorylation. These specific differences refer to distinct biological functions, suggesting enzyme similarity in the context of our algorithm. In order to determine the threshold of similarity, i. e. the "radii" of the similarity zone, all the properties mentioned above were initially ranked according to their relative biological significance. These parameters and their relative weights are listed in Table 2. Parameters that reflect more significant biological properties get greater values. In the next stage, the value of these parameters and the similarity threshold were manually adjusted, using known protein interactions for calibration. The calibration was done such that known cases of interactions will get score above the threshold, and cases were it is known that the proteins do not interact get scores below the threshold. In the given set of parameters, the threshold was set to an additive score of above 30, reflecting a similarity zone of a match of about 2/3 of the weighted properties.
Table 2: The table presents the list of the relevant parameters, and their relative weights, for the determination of the similarity zone around each protein. For this set, proteins with an additive score above 30, reflecting a weighted match of about 2/3 of the properties, are considered similar.
|Protein properties||Relative value of the parameter|
|Ability to undergo phosphorylation on serine/threonine residues||4|
|Ability to undergo phosphorylation on tyrosine residues||5|
|Ability to undergo autophosphorylation on serine/threonine residues||2|
|Ability to undergo autophosphorylation on tyrosine residues||2|
|Whether the protein is a serine/threonine kinase||4|
|Whether the protein is a tyrosine kinase||5|
|Whether the protein is a phosphatase||5|
|Whether the protein includes SH2 domain||3|
|Whether the protein includes SH3 domain||3|
|Whether the protein includes NLS domain||3|
|Whether the protein includes ATP-binding domain||4|
|Whether the protein includes leucine zipper domain||3|
|Whether the protein includes zink finger domain||2|
|Whether the protein includes polyproline domain||2|
|Whether the protein is stimulated by GTP||2|
The next step of our algorithm is to take all these primary candidates, and to check the degree of compatibility of their properties with the constraints, that is, the known experimental properties provided by the user for the substrate of interest. This enables us to narrow down the list of candidates. Thus, the final output of the search is a sorted list of the substrates predicted by the program, ranked by the product of their similarity score and their degree of compatibility with the constraints supplied by the user.
One of the crucial elements of the PESI prediction scheme is how it scores each feature in determining the level of similarity between proteins, and how it determines when the threshold of similarity has been reached. This has been so far done manually, using the prediction of several known interactions as a criterion to evaluate the performance of the system, and adjust the parameters accordingly.
In this section we describe the steps taken in order to evaluate the usefulness of the PESI program. First, we examine to what degree the program is self-consistent, in the sense that if we delete any modification from the database, the system will be able to retrieve it. Second, we show that the program is able to predict new modifications that were not included in the database, but verified in a subsequent literature search. Third, we report on several laboratory experiments we have actually performed to check novel predictions of the system. Strikingly, these experiments were able to validate two out of six of our predictions. We also include another prediction of the system, the interaction between ERK1 and SEK1. We are not equipped to validate this prediction experimentally, but since it might represent an important step in the MAPK signal transduction pathway, we hope that our prediction will inspire biologists involved in MAPK research to further investigation.
Modification retrieval: We first examined the program's reliability by utilizing a jackknife procedure, i. e. records were deleted from the table of known modifications, and then we attempted to retrieve, by using the program, the deleted records. Obviously the prediction efficiency depends on the level of database saturation, that is, the number of enzymes in the similarity zone of the enzyme we perform the search for, and the number of modifications already known for these enzymes. The program, with the current database saturation, was able to retrieve about 24% of deleted modifications (i. e. 17 out of 70). We regard this result as very promising, considering it was achieved with an extremely unsaturated database. Evidently, as we fill in more data in the PESI database, the prediction ability will significantly improve.
Modifications that were independently re-discovered by the system: As the majority of activation and deactivation processes in cell biology are mediated by modification of the phosphorylation state, we have concentrated on one of the groups of enzymes which are associated with phosphorylation changes - the kinase family of isoforms of protein kinase C (PKC).
We performed a search for a possible substrate for PKC, a serine/threonine kinase, utilizing our experimental data (constraints) about the candidate substrate. The program predicted the protein STAT3 as a substrate candidate. Details for this prediction query are given in Table 3. Later experiments have confirmed that STAT3 is indeed a substrate for PKC. A search of published materials has also confirmed that this modification has been observed .
Table 3: Description of the PESI prediction query for a substrate of PKC.
|Enzyme name||Type of phosphorylation||Known experimental data (constraints)|
|PKC||Serine/Threonine||Modified by serine/threonine kinase,
modified by tyrosine kinase,
presence of SH2 and leucine zipper,
molecular weight is between 85 and 75 kD.
Other attributes set to absent or unknown.
We also performed a search for a possible substrate for PKC, another serine/threonine kinase, again utilizing experimental constraints about the candidate substrate. The program predicted the protein p21 as a substrate candidate. Details for this prediction query are given in Table 4. This modification has not been reported in published materials. Thus we have turned to colleagues who are studying p21; they confirmed that indeed, p21 is a substrate of PKC (Dr. Kuroki, personal communication).
Table 4: Description of the PESI prediction query for a substrate of PKC.
|Enzyme name||Type of phosphorylation||Known experimental data|
|PKC||Serine/Threonine||Modified by serine/threonine kinase,
modified by tyrosine kinase,
molecular weight between 20 and 25 kD.
Other attributes set to absent or unknown.
In addition, we performed a search for a possible substrate for FER kinase, a tyrosine kinase, utilizing our experimental data (constraints) about the candidate substrate. The program predicted the protein STAT3 as a substrate candidate based on the attributes of modification by tyrosine kinase, presence of SH2 and leucine zipper domains, and molecular weight of 75 to 85 kD. This modification was discussed previously in our lab, but never directly observed. Subsequent to the program prediction, experimental work in the lab  have confirmed the predication that STAT3 is indeed a substrate for FER kinase.
Novel modifications suggested by the system and confirmed by experiments.
At this stage we were interested in additional predictions of substrates of PKC and PKC. Different initial constraints were suggested for these enzymes; accordingly, the program has proposed the following:
Among the suggested possible modifications, STAT1 was experimentally shown to be a substrate of PKC and c-Jun was experimentally shown to be a substrate of PKC (Figure 3). Experiments were performed utilizing immunoprecipitation of distinct enzymes (PKC isoforms) and analysis by Western blotting utilizing antibodies raised against specific substrates.
We believe that adding more data to the database (as such data become available in the literature) will lead to additional substrate predictions for PKC enzymes. As it is, our prediction algorithm has the valuable benefit of reducing the number of substrates serving as candidates for experimental verification, from hundreds of possible substrate candidates for PKC and PKC to the few most promising candidates - a reduction by one or two orders of magnitude.
Novel modification suggested by the system and is yet to be confirmed:
Our most recent prediction query, described in Table 5, has retrieved the protein SEK1 as candidate substrate of ERK1. Whether this prediction is correct is as yet unknown. Survey of published materials has revealed that both proteins are reported as belonging to the shared MAPK signal transduction pathway . Therefore, we suggest that this unpublished modification is worthy of experimental testing.
|Figure 3: Cell lysates were immunoprecipitated, utilizing immunoglobulins raised against PKC (A) or PKC (B). Immunoblotting was performed in order to characterize the substrate, utilizing immunoglobulins raised against appropriate proteins. A. Among the suggested possible proteins (STAT1, STAT5 and -catenin), STAT1 is shown to be a substrate of PKC. B. Among the suggested possible proteins (STAT1, c-Jun and -catenin), c-Jun is shown to be a substrate of PKC.|
Table 5: Description of the PESI prediction query for a substrate of ERK1.
|Enzyme name||Type of phosphorylation||Known experimental data|
modified by serine/threonine kinase,
modified by tyrosine kinase,
molecular weight between 42 - 48 kD.
Other attributes set to absent or unknown.
We presented here a prediction algorithm, PESI, based on the similarity between enzymes and between their substrates, which enabled us to decrease the number of substrate candidates for experimental study by about two orders of magnitude, reducing the number from few hundreds which are too numerous for feasible experimental testing, to several candidates which can be tested experimentally.
The general idea underlying the PESI algorithm is that a set of experimentally validated interactions between proteins can be used to infer a much larger set of possible interactions. That is based on the simple assumption that if two proteins are known to interact, then proteins that are biologically similar to these proteins respectively, are good candidates to interact with each other. Although the initial predictions were made based on a relatively small, unsaturated database, our algorithm has already yielded predictions for previously unknown substrates of the enzymes PKC and PKC, some of which we have confirmed experimentally.
Due to the enormous numbers of proteins involved in signal transduction pathways and the interactions between them, much effort has been devoted recently to computerized systems that can manage the rapidly accumulating amounts of information on signal transduction pathways. Spad  is an impressive compilation of signaling pathways and networks. Wit2  includes, as part of an effort for functional annotation of genomes, a computerized description of known pathways. Similarly the stem cell database  contains a component describing kinases and signaling. A general review of the status of many projects that aim to computerize information about biological interactions and networks can be found in . However, to the best of our knowledge, our effort is the first system in which this information has been utilized as a data-mining tool for predicting presently unknown interactions in a practical setting.
Altogether, PESI was able to re-discover 17 modifications that were included in the system in a jack-knife prediction. Three cases of modifications that were discussed but not verified experimentally in the time of the prediction were supported by PESI and confirmed by later experiments: FER kinase -> STAT3, PKC -> p21, PKC -> STAT3. Two cases of novel modifications were discovered: PKC -> STAT1, PKC -> c-Jun. These predictions were experimentally validated by us (Figure 3). One case, ERK1 -> SEK1, was predicted and still waits for conformation.
Clearly, the size of the system and the number of predictions made so far is rather small. Thus, the data presented here should be considered more as a proof of concept rather than a production system. Nevertheless, we are encouraged by the fact that even with such a limited scope, productive predictions have been made.
Obviously, the prediction efficiency of the PESI system will be greatly improved with the growth of the databases on which our system relies. Moreover, our system's prediction efficiency would probably be improved by increasing the number of structural and functional protein properties included in the calculation of the similarity score. This, however, is not a trivial task, because the addition of every single property might require a re-calibration of the weights assigned to each property in each scoring procedure. As mentioned above, the current system has been calibrated manually. Future extensions of the system to include a larger number of properties will require some type of automated calibration process, possibly utilizing a learning or evolutionary algorithm.
Our current PESI system is calibrated for the prediction of phosphorylation interactions. However, the same basic idea of a search based on similarity scores and limited by experimental constraints can be applied for the prediction of other types of protein-protein interactions like cleavage, docking, inhibition etc. Each one of these applications will require a definition and calibration of the set of parameters that characterize functional similarity for these interactions.