| In Silico Biology 2, 0027 (2002); ©2002, Bioinformation Systems e.V. |
| G C B ' 0 1 |
Department of Physics,
Institute of Science
Tianjin University, Tianjin 30072, P. R. China
Fax:86-22-87890061
LiuHui Center for Applied Mathematics,
Nankai University and Tianjin University, Tianjin 300072, P. R. China
Email: zpfeng@eyou.com
Edited by E. Wingender; received November 22, 2001; revised and accepted December 18, 2001; published March 28, 2002
The present paper overviews the issue on predicting the subcellular location of a protein. Five measures of extracting information from the global sequence based on the Bayes discriminant algorithm are reviewed. 1) The auto-correlation functions of amino acid indices along the sequence; 2) The quasi-sequence-order approach; 3) the pseudo-amino acid composition; 4) the unified attribute vector in Hilbert space, 5) Zp parameters extracted from the Zp curve. The actual performance of the predictive accuracy is closely related to the degree of similarity between the training and testing sets or to the average degree of pairwise similarity in dataset in a cross-validated study. Many scholars considered that the current higher predictive accuracy still cannot ensure that some available algorithms are effective in practice prediction for the higher pairwise sequence identity of the datasets, but some of them declared that construction of the dataset used for developing software should base on the reality determined by the Mother Nature that some subcellular locations really contain only a minor number of proteins of which some even have a high percentage of sequence similarity. Owing to the complexity of the problem itself, some very sophisticated and special programs are needed for both constructing dataset and improving the prediction. Anyhow finding the target information in mature protein sequence and properly cooperating it with sorting signals in prediction may further improve the overall predictive accuracy and make the prediction into practice.
Key words: subcellular location, N-terminal targeting sequences, sorting signals, targeting information, amino acid composition, quasi-sequence-order-effect, pseudo-amino acid composition, auto-correlation functions, unified attribute vector, Zp curve, Zp parameters, Bayes discriminant algorithm, component-coupled algorithm, k-nearest neighbor method, hidden Markov model, neural networks, Support Vector Machine (SVM), jackknife test, hydrophobicity, pairwise sequence similarity
Within the last couple of years the complete sequences of a number of genomes have been determined. The central challenge of Bioinformatics is the rationalization of the mass of sequence information, with a view not only to deriving more efficient means of data storage, but also to designing more incisive analysis tools. Since subcellular location plays a crucial role in protein function, the availability of systems that can predict location from sequence will be essential to the full characterization of expressed proteins. Experimental determination of subcellular location is mainly accomplished by three approaches: cell fractionation, electron microscopy and fluorescence microscopy. As currently practiced, these approaches are time consuming, subjective, and highly variable [Murphy, 2000]. The assignment of the function for a given protein has proved to be especially difficult where no clear homology to proteins of known function exists [Bork et al., 1994]. Since the pioneering effort to predict subcellular location from protein sequence were provided [Nakai and Kanehisa, 1991; 1992; Nakashima and Nishikawa, 1994], a variety of projects have been engaged in clarifying the functions of the protein sequences and systematically determine their subcellular location.
Except for a small number of proteins that are coded in the genomes of mitochondria and chloroplasts, all other proteins are synthesized in the cytosol. Proteins need to be sorted to one or other subcellular compartment to perform their functions. Sorting usually relies on the presence of an N-terminal targeting sequence, which is proteolytically removed after entry. For further sorting within the organelle, additional targeting information may be located in a secondary targeting sequence, either placed adjacent to the original targeting sequence or in other regions of the protein [McGeoch, 1985; Folz and Gordon, 1987; Ladunga et al., 1991; Chou, 2000a; Chou, 2001a; Emanuelsson et al., 2000]. Based on the sorting signal peptide or mature protein sequence of a protein, the subcellular location has been predicted and much progress has been achieved in recent years.
Currently several www servers on predicting the subcellular location of a protein are available. Nakai and Kanehisa proposed PSORT for the prediction of protein locations in cells in 1991 and has made a great improvement during the past a few years [Nakai and Kanehisa, 1991; 1992; Horton and Nakai, 1996; 1997; Nakai and Horton 1999]. The 1999 Nobel Prize in Physiology or Medicine has been awarded to Günter Blobel for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell." The first such signal just to be discovered was the secretory signal peptide, which is the signal predicted by SignalP (http://www.cbs.dtu.dk/services). Gunnar Von Heijne's group reported subcellular location predictors designed to identify either SignalP [Nielsen et al., 1997] or ChoroP [Emanuelsson, et al., 1999] in a protein sequence. Then, in 2000 they integrated and extend these efforts and present a novel subcellular location predictor, TargetP, that assigns one of four different locations (chloroplast, mitochondrion, ER/Golgi/secreted, and "other") to a query sequence. MitoProt is a program to predict the mitochondrial and chloroplast proteins harboring targeting sequences [Claros, 1995; Claros and Vincens, 1996]. Predotar is another program to recognize the N-terminal targeting sequences of classically targeted mitochondrial and chloroplast precursor proteins, which is good at distinguishing between mitochondrial and plastid targeting sequences [Gierasch, 1989].
Chou and Elrod made an extensive research in this regard mainly based on the amino acid composition [1998; 1999a; 1999b], quasi-sequence-order-effect [Chou, 2000b], and pseudo-amino acid composition [Chou, 2001b]. Inspired by Dr. Chou's talk in 1998 we have been engaged in the same research since then and provided an experimental version of online server using more information in addition to the amino acid composition. Recently, Hua provided an online server called SobLoc to predict subcellular location [Hua and Sun, 2001]. For convenience to access, the predictive websites on the protein subcellular location and typical features are listed as follows:
|
Generally there are three steps involved in all the predictive methods. First, an objective representative dataset should be constructed for deriving the character information from each of subcellular locations. Next, the attribute parameters or descriptor to represent the targeting information should be extracted. Finally, based on some algorithms we compare the likelihood of the query protein being sorted to each possible location and make suitable evaluation for the predictive result. In this paper, the generally used predictive methods, especially the different way of extracting information from the global sequence of a protein, and the current problems that halter the prediction are simply reviewed.
Now the widely used predictive methods are machine learning approaches and statistical approaches. In the first version of PSORT, Nakai and Kanehisa presented a technique of artificial intelligence, an expert system, equipped with two groups of rules. The first group of rules calls various subprograms and stores the results in the so-called working memory, whereas the second group of rules combines these results to make the final prediction [Nakai, 2000]. The subprograms include the methods used by McGeoch and von Heijne for signal peptide prediction [McGeoch, 1985; von Heijne, 1986], the method of Klein et al. [1985, 1988] for predicting lipoproteins and transmembrane segments; and observation of Yamaguchi et al. [1988], on lipoprotein sorting. Horton and Nakai [1996] improved the prediction with a new probabilistic reasoning model. They proposed a simpler and well-known algorithm, the k-nearest neighbor method to further improve the prediction by ignoring the inherent hierarchy between various signal-recognition events and treating all variables equally [Horton and Nakai, 1997]. Fujiwara et al. [1997] developed a hidden Markov model (HMM) that represents various known sequence characteristics of mitochondrial targeting signal. Emanuelsson et al. [1999] presented ChoroP based on a neural network for identifying chloroplast transit peptides and their cleavage sites. They claimed that the performance level of ChoroP is well above the chloroplast location predictor PSORT. TargetP is built from two layers of neural networks, where the first layer contains one dedicated network for each type of presequence, and the second is an integrating network that outputs the actual prediction [Emanuelsson, et al., 2000].
The molecular mechanisms related to signal peptides are rather complex. Many of them should be interpreted within the context of the information. Therefore, prediction methods of the locations should be developed based on the wealth of knowledge on the protein sorting process produced by extensive studies on cell biology [Nakai, 2000]. Besides, in large genome analysis projects genes are usually automatically assigned and these assignments are often unreliable for the 5'-regions. This can lead to leader sequences being missing or only partially included, thereby causing problems for prediction algorithms depending on them. A method based on the amino acid composition should be comparatively stable to this sort of ambiguous assignment [Reinhardt and Hubbard, 1998]. This approach was originally suggested by the results of Nakashima and Nishikawa [1994]. They found the discrimination between intracellular and extracellular proteins by amino acid composition and residue pair frequencies. Cedano et al. [1997] extended the discriminative classes from two to five, i.e., extracellular, integral membrane, anchored membrane, intracellular and nuclear. Inspired by the novel approach in protein structural class prediction [Chou and Zhang, 1994; Chou, 1995; Chou and Zhang, 1995], they also developed a software called ProtLock for predicting the subcellular locations. This represents remarkable progress in the research. Reinhardt and Hubbard [1998] constructed prokaryotic protein dataset, which included cytoplasmic, periplasmic and extracellular proteins, and eukaryotic protein dataset, which included nuclear, mitochondrial, cytoplasmic and extracellular proteins. They used The Stuttgart Neural Network Simulator in the prediction based on the amino acid composition. In their prediction, two different types neural networks were used [Reinhardt and Hubbard, 1998]. The overall accuracy was 80.9% and 66.1% respectively for three and four classes of subcellular locations. Based on the same dataset, Cai et al. [2000a] used a typical neural network mode, T. Kohonen's self-organization model, to predict the subcellular location of proteins from their amino acid composition. The overall accuracy was improved to 84.4% and 70.6% respectively. Yuan applied a first-order Markov chain and extended the residue pair probability to higher-order models to predict protein subcellular locations on the same dataset used by Reinhardt and Hubbart, the overall accuracy could be 89% and 73%, respectively for prokaryotic sequences and eukaryotic sequences [Yuan, 1999]. Recently, Cai et al. [2000b], Hua and Sun [2001] made subcellular location prediction with SVM (Support Vector Machine), respectively, based merely on the amino acid composition and higher predictive accuracy could be obtained. SVM is a new discriminative method [Vapnik, 1995], which is not only well founded theoretically because it is based on extremely well developed machine learning theory, but is also superior in practical applications [Hua and Sun, 2001]. Another advantage of SVM is its fast convergence in learning than in neural networks [Ding and Dubchak, 2001].
An analysis in an attempt to understand the correlation of the structural class and subcellular location was given by Bahar et al. [1997] and Andrade et al. [1998], respectively. Andrade et al. [1998] examined all eukaryotic proteins with known 3D structure. They found that the total amino acid composition carries a signal that identifies the subcellular location and this signal was due almost entirely to the surface residues. The discovery may lead to an improvement of the prediction accuracy by decreasing the noise if we can predict the protein surface accessibility correctly.
Initial trials using standard statistical methods in predicting the
subcellular location was ProtLock, which was proposed by Cedano et al.
[1997]. Stimulated by the encouraging results of structural class prediction
[Chou and Zhang, 1994; Chou, 1995; Chou and Zhang, 1995], they adopted the Mahalanobis distance in their
algorithm. However, their definition of the Mahalanobis distance is different
from what originally introduced by Chou and Zhang in protein structural
classes prediction. They used a covariance matrix
of the pool in the calculation, whereas Chou and Zhang applied class-dependent
covariance matrix. Stimulated by the further improvement in protein structural class prediction [Chou et al., 1998; Liu and Chou, 1998; Chou and Maggiora, 1998; Chou, 1999; Bu et al. 1999], both the groups of Chou and Zhang improved the
algorithm by adding a term to reflect the difference of subset sizes in
the training dataset, respectively, and used the algorithm in the prediction of protein subcellular location. Zhang and his colleagues added ln|C
| to the Mahalanobis distance to form the Bayes discriminant
function, where C
is the covariance matrix of
class
[Feng and Zhang, 2000; Feng, 2001; Feng and Zhang, 2001;
2002]; whereas Chou and Elrod added
ln
to form a discriminant function, where
is the product of all positive eigenvalues of Cµ,
which is the covariance matrix of µ class [Chou and Elrod, 1998;
1999a; 1999b; Chou, 2000b; 2001b]. Zhang and his colleagues called the algorithm
of the least Bayes discriminant function as the Bayes discriminant algorithm.
Chou and Elrod called their algorithm of the least discriminate function
as the component-coupled algorithm [Chou and Elrod, 1998; 1999a; 1999b;
Chou, 2000b; 2001b]. In mathematics, the determinant of a matrix (|C
|)
is equivalent to the product of all positive eigenvalues of the matrix.
Accordingly, the so-called Bayes algorithm used in our group is exactly the same as Chou's component-coupled algorithm, as elaborated recently by two independent papers [Cai, 2001; Zhou and Assa-Munt, 2001]. Based on this algorithm
and the amino acid composition, Chou made prediction of protein subcellular
location, membrane protein types and location [Chou and Elrod, 1998; 1999a;
1999b]. Since the information within the primary sequence is greatly reduced
by considering the amino acid composition alone, the sequence order of
amino acids in the query protein have been taken into account. We developed
several methods to extract more information besides the amino acid composition.
They are: auto-correlation functions of the amino acid indices along the
primary sequence [Feng and Zhang, 2000; 2001], the unified attribute vector
in Hilbert space [Feng, 2001] and Zp parameters [Feng and Zhang, 2002].
Chou developed quasi-sequence-order approach [2000b], and the pseudo amino acid composition [2001b] in the prediction. All of the above algorithms are briefly reviewed as follows.
Since the information determining the subcellular location of a protein
is mostly encoded in its amino acid sequence, the prediction of subcellular
locations is of great theoretical interest as an interpretation of genetic
information. Prediction of protein sorting signals from the sequence of
amino acids has great importance in the field of proteomics today [McGeoch, 1985; Folz and Gordon, 1987; Ladunga et al., 1991; Nielsen
et al., 1999; Chou, 2000a; Chou, 2001b]. But the molecular mechanisms related to signal peptides
are rather complex. Many of them should be interpreted within the context
of the information. Therefore, we should make full use of our knowledge
to understand the mechanisms of sorting process and improve the prediction.
Nakai has contributed an excellent review on protein sorting signals and
prediction of subcellular location based on them [Nakai, 2000]. Here I
just review some methods of extracting the sequence descriptors from globe
sequences. Compared with the other cross validation methods such as the independent-data test or sub-sampling test often used by many investigators, the jackknife test is the most rigorous cross validation approach as elucidated by Chou and Zhang [1995].
1. Auto-correlation functions of the amino acid indices along the primary sequence
Besides the amino acid frequencies in the prediction, the auto-correlation functions based on the hydrophobicity or hydrophilicity profile of amino acids along the primary sequence of the query protein have been considered in the Bayes algorithm. In other words, in addition to the 20-D components of the amino acid frequencies, other m-D components should be added to form a 20+m-D vector. Thus the attribute vector is defined as
![]() |
(1) |
where fi(i=1, 2, ..., 20) are the occurrence frequencies of the 20 amino acids in the protein concerned, ri(i=1, 2, ..., m) are the auto-correlation functions, and m is an integer to be determined by the optimum prediction. To calculate the auto-correlation functions, first replace each residue in the primary sequence by its amino acid index. An amino acid index is a set of 20 numerical values representing any of the different physicochemical properties of the 20 amino acids. Consequently, the replacement results in a numerical sequence
![]() |
(2) |
where hi is the amino acid index for the i-th
residue and N is the number of residues of the query protein. Occasionally,
there may be some non-standard amino acids in the sequence, they are simply
assigned to zero. Note that before defining the auto-correlation function,
the amino acid index should be centralized and standardized. In other words,
if we denote
and
the mean and standard deviation of the 20 numerical values, each hi
in eq. (2) is assigned the value of
.
For simplicity, the value of
is still denoted by hi hereafter. The auto-correlation
function rn is defined as [Cornette et al., 1987; Zhang
et al., 1998]
![]() |
(3) |
where hi is the centralized and standardized amino
acid index for the i-th residue and m is an integer of optimum
prediction. Consequently, using the descriptor shown in Eq. (1) and the
Bayes algorithm the overall predictive accuracy for the resubstitution
and jackknife tests are improved with the hydrophilicity value of Hopp
and Woods [1981], hydrophobicity value of Ponnuswamy et al. [1980], and
other several sets of amino acid index collected by Tomii and Kanehisa
[1996]. The level of improvement can be from 3% to 11% for the datasets with different
pairwise sequence similarity [Feng and Zhang, 2000; 2001].
2. Quasi-sequence-order approach
The quasi-sequence-order approach proposed by K. C. Chou [2000b] can be summarized as below. For a protein sequence with N amino acids
![]() |
(4) |
where R1 represents the residue at sequence position1, R2 the residue at sequence position 2 , and so forth. The sequence order effect can be approximately reflected through a set of sequence-order-coupling numbers as defined as
![]() |
(5) |
where
j is called the j th-rank sequence-order-coupling number that reflects the coupling mode between all the most contiguous residues along a protein sequence. The coupling factor Ji,k is a function of amino acids Ri and Rk . Chou [2000b] chose the follow defininition in his study
![]() |
(6) |
where D(Ri,Rk) is the physicochemical distance from amino acid Ri to amino acid Rk, which derived based on the residue properties of hydrophobicity, hydrophilicity, polarity, and side-chain volume. Schneider and Wrede [1994] gave 20x20=400 such distance values. Gerenally, the physicochemical distance from amino acid Ri to amino acid Rk is different from that of their reverse. Such a feature is an additional advantage that will help to further distinguish the directionality of protein sequence order. Therefore, Chou used (20+n)-D vector,
![]() |
(7) |
and
![]() |
(8) |
where fi is the normalized occurrence frequency of the 20 amino acids in the protein X,
j (j = 1, 2, 3, ..., n) is called the j th-rank sequence-order-coupling number calculated based on Eqs. (5) and (6). w in Eq. (8) is the weight factor for the sequence order effect. Therefore, the first 20 components reflect the influence of the amino acid composition, whereas the later n components reflect the effect of sequence order. In Chou's study [2000b], he improved the predictive accuracy about 4-5% with w = 0.1 and n = 13 for the dataset constructed by himself comprised 12 types of subcellular locations.
3. Pseudo-amino acid composition
In the pseudo-amino acid composition proposed by K. C. Chou, the first
two steps of extracting information from the sequence are similar with
our auto-correlation method described above. However, he used L, Ri,
and
(Ri, Ri+n)
to instead of N, hi and hihi+n
in Eqs. (1) and (2). Therefore, a primary sequence can be changed to a
numerical sequence
![]() |
(9) |
and
![]() |
(10) |
where
is called the i-tier
correlation factor that reflect the sequence order correlation between
all the most contiguous residues along a protein chain, and
is also called the correlation function, which is given by
![]() |
(11) |
where H1(Ri), H2(Ri),
and M(Ri) are the hydrophobicity value, hydrophilicity
value, and side-chain mass of the amino acid Ri , respectively,
and H1(Rj), H2(Rj),
and M(Rj) are corresponding values for the amino acid
Rj. Before substituting the values of the amino acid
indices, all the indices should be centralized and standardized like in
our auto-correlation function approach described above.
Hence, like in Eqs. (7) and (8), the (20+ n)-D vector was used in the Bayes discriminant algorithm, but instead of
n defined in Eqs. (5) and (6)
n defined in Eqs. (10) and (11) was used. In the new (20+ n)-D vector, the first 20 components are the normalized occurrence frequencies of the 20 amino acids in the protein X, which reflect the effect of the amino acid composition. The later n components reflect effect of sequence order for the same protein; w is the weight factor for the sequence order effect. The definition of this pseudo-amino acid composition can introduce more correlation factors of physico-chemical effects. In his study, Chou improved prediction accuracy about 15% in the jackknife test for the nine locations of membrane proteins with n = 21 , about 5% for the five types of membrane proteins with n = 20, and about 5% for the 12 of subcellular locations with n = 8.
The hydrophilicity value of Hopp and Woods [1981], hydrophobicity value of Tanford, and the
mass of amino acid side chain [Chou, 2001b] were used in his study.
4. Unified the attribute vector in Hilbert space
In the representation of unified attribute vector, each protein can be represented by a vector, which is 20-D vector in Hilbert space with unified length. Hence, all of proteins have their representative points on the surface of the 20-D globe. The representative points of the proteins in the same family or with the higher sequence identity are closer on the surface.
As the primary approximation, the components of the vector were supposed
to be the square root of occurrence frequencies of the 20 amino acids in
the protein concerned. Therefore, we used
to replace fi in the Bayes algorithm to ensure the length
of the attribute vector to be unified. The overall predictive accuracy
could be improved from 3% to 5% for different databases [Feng, 2001] with
this simply modification of the usage of the amino acid composition. Several
advantages are existed in this representation. First, since the frequencies
of the 20 amino acids are normalized, only 19 of them are independent.
Thus the 19+m-D covariance matrix is actually involved in all of
the calculation of the Bayes algorithm, such as in the approach of auto-correlation
functions of amino acid index or pseudo amino acid composition.
Otherwise, the calculation of (C
)-1
and the logarithm of the determinant |C
|
will be divergent. In the new representation, 20 amino acid composition
could be used without cause any problems in the Bayes algorithm.
5. Zp curve and Zp parameter
The Z curve is a 3D space curve constituting the unique representation of a given DNA sequence [Zhang and Zhang, 1991; 1994]. The 20 amino acids can be classified into four groups according to their hydrophobic and charged properties, such as apolar, polar, positively charged and negatively charged. If we use A to represent the hydrophobic group, P the polar group, C+ the positively charged group, and C - the negatively charged group, these four letters A, P, C+ and C- correspond four letters of nucleotides, respectively, in DNA sequence. Thus each protein sequence can be represented to a 3-D space curve according to the definition of the Z curve for DNA sequence and we called it Zp curve. Therefore, the Zp curve is a 3D space curve that represents a protein sequence based on the hydrophobicity and the charged properties of amino acids along the primary sequence. The curve can intuitively reflect the interaction of hydrophobicity or Coulomb, which demonstrate that the Zp curves carry main information contained in the primary structure of proteins. The curve can also be drawn with any other four groups of reduced alphabets of amino acids.
Some parameters (called Zp parameters) may be extracted from the above Zp curve. The Zp parameters may be either extracted from all the three components of the curve or from an individual component, such as the x component. Consider the x component as an example first. The equation kx=1 means that only one parameter ( r1 ) is extracted from the x component , which is defined by r1=xN/N where xN is the terminator point coordinate of the x component curve and N is the length of the protein. Similarly, the equation kx=2 means that two parameters, r1=xN/N and r2=x(N/2)/(N/2) , are extracted from the x component curve. Furthermore, for kx=3 we define r1=xN/N, r2=x(2N/3)/(2N/3), and r3=x(N/3)/(N/3). The procedures can be continued. Consequently, one could get as many parameters as one desires, based on different values of kx, ky and kz. Obviously, if kx=ky=kz, the number of the Zp parameters extracted are always multiple of three, otherwise, the number of the Zp parameters may be any positive integer. For example, using the Bayes discriminant algorithm, the overall predictive accuracy of 81.5% is achieved in the jackknife test based on merely ky =13 for the prokaryotic protein database as used by Reinhardt and Hubbard [1998]. This result is slightly better than that (80.9%) using the neutral network method based on the amino acid composition. Besides, by jointing the Zp parameters described above and the amino acid composition, much improvement can be achieved compared with the method based merely on the amino acid composition in the jackknife test [Feng and Zhang, 2002].
The Zp parameters are different from the amino acid composition by their position specific, because they are extracted at different position along the primary structure, whereas the amino acid composition represents the message of the whole sequence. The Zp approach effectively partitions the sequence into accumulative regions growing from N-terminus and provides information about the length of sequence chunks. The higher predictive accuracy shows evidence that a particular length of N-terminal fragment may be important in protein targeting.
How the targeting information encoded in the primary structure is not
well understand for all the organelles. Therefore, wherever possible, a
range of different analysis methods to extract more information should
be useful. The methods described above are only some of effective ways
of extracting the sequence descriptor. We are going to search more specific
sequence descriptor in further research.
Since the targeting process is very complicated, the incomplete understanding of the protein sorting presents a barrier to current attempts to predict subcellular location accurately. In the prediction by signal approach, the real sorting processes are imitated, which can be useful to verify the generality of current knowledge [Nakai, 2000]. The limitation of the prediction can be reflected by the state of currently available online servers. For example, Predotar will fail to recognize many or all outer membrane proteins, and many inter-membrane space and inner membrane proteins [Gierasch, 1989]. MitoProt II supplies a series of parameters that permit theoretical evaluation on mitochondrial targeting sequences and the importability. It provides the possibility to predict mitochondrial proteins. With a cross-validation test, its accuracy was estimated to be 75% [Claros and Vincens, 1996; Claros et al., 1997]. TargetP predicted chloroplast, mitochondrion, ER/Golgi/secreted, and "other" locations with a success rate of 85% (plant) or 90% (non-plant) on their redundancy-reduced test sets [Emanuelsson et al., 1999; 2000]. Besides, there may exist more sorting pathways to many proteins that can direct them to their specific sites. The prediction is also hindered by an absent of this kind of knowledge.
Prediction by mature protein sequence allows a simple and unified treatment of all sequences, which is convenient for practical use and objective test. However, effectively extract information from the sequence is very hard. The amino acid composition is a set of important parameters, but the same amino acid composition corresponds to diverse sequences. Larger standard deviation for the sequences in each subcellular location indicates that amino acid composition is not always conservative. There are several algorithms based on global protein sequences now, and can lead to higher predictive accuracy for the dataset constructed by Reinhardt and Hubbart [1998] and by Chou and Elrod [1999a; 1999b]. However, influenced by the fact of protein domain structural class prediction, some scholars consider that the current higher predictive accuracy still cannot ensure that these algorithms are all effective in practice prediction. It is well established that protein domains having more than 30% sequence identity adopt the same fold structures [Sander and Schneider, 1991; Blundell and Johnson, 1993; Flores et al., 1993; Hilbert et al., 1993; Rost and Sander, 1996; Cuff and Barton, 1999]. Therefore, the structural class prediction of a new protein domain with homologous large than 30% to a protein domain of know structure can be easily performed by sequence alignment, and any prediction method for the protein domain structural classes should only address those protein domains for which no homologous protein domains are found in Protein Data Bank. The actual performance of prediction accuracy is closely related to the degree of similarity between the training and testing sets, or to the average degree of pairwise similarity in the dataset for a cross validated study [Reinhardt and Hubbart, 1998; Nakai, 2000; Feng, 2001; Feng and Zhang, 2002]. Since the database of Reinhardt and Hubbard has pairwise sequence identity less than 90%, in order to evaluate the effect of the algorithms a larger standard dataset with low sequence similarity is needed for both prokaryotic and eukaryotic proteins. On the other hand, even at 40% sequence identity, it is still not safety to infer that two proteins have the same subcellular location [Andrade et al., 1998]. This implies that it has different meaning towards the homologous problem in the prediction of protein subcellular location and the protein structural class. In other words, in the prediction of subcellular location, one should not emphasize too much that the training dataset should be as non-homologous as in the prediction of protein structural class. We have tried to construct such a database according to the latest version of SWISSPROT database [Bairoch and Apweiler, 1997]. However, inspection of the "subcellular location" field in the database, we find that current experimental knowledge on some subcellular locations is also very limited. For most proteins, there is not any description in "subcellular location" field. For many proteins, there is a brief, but very general description, such as "cytoplasmic" or "nuclear". We could construct database based on these proteins only. However, after several filtering processes, such as excluded the proteins with ambiguous or uncertain description, reduced sequence identity with FASTA [Pearson and Lipman, 1988] or RedHom server in CBS [Lund et al., 1997], few sequences remained for some organelles. On the other hand, this general description on the subcellular location may merge many specific properties of protein function, then make the extraction of information even harder.
In the SWISSPROT database, as Murphy et al. pointed out, there are many proteins, in which the "subcellular location" field contains unstructured text that varies from being very general to quite specific. The ambiguities in database description or reports of experiments just reflect imprecision and investigator-to-investigator variation in terminology, uncertain about the actual location of many proteins, and the fact that many proteins cycle between different locations [Murphy et al., 2000]. Therefore, the subcellular location is only approximately known for most known proteins. In other words, the nature of the available training data is not too "standard" as we expected. Based on the reality determined by the Mother Nature, several scholars pointed out that the limitation of Reinhardt and Hubbart [1998] dataset is not due to that it contains some sequences with pairwise sequence identity close to 90% but that it lacks the value of practical application since only 3 or 4 subcellular locations are included in their dataset. Compared with this, the datasets constructed by Elrod and Chou [Chou and Elrod, 1999a; 1999b; Chou, 2000b; 2001b] have much greater application value [Nakai, 2000]. Since some subcellular locations, such as Golgi apparatus and vacuole, really contain only a minor number of proteins of which some even have a high percentage of sequence similarity, include them based on the reality determined by the Mother Nature is reasonable. The most fundamental rationale for any statistical prediction method based on a training dataset is that the data in a same class or group must have some similarity. For the current case, the similarity can be reflected through either the amino-acid composition, or quasi-sequence-order factor, or pseudo-amino-acid-composition, or different ranks of sequence-order-coupling modes, or whatever. An ideal training dataset should satisfy the following two criteria: (1) it has an extensive representativity for each of the classes concerned; (2) the data in each subset are highly clustered without overlapping with those of the other subsets (it may not be straightforward). Accordingly, it is fully acceptable if a dataset contains some sequences with 90% pairwise similarity as long as it has an extensive representativity. For example, the training dataset of 2319 protein sequences derived by [Chou, 2001b] for 12 subcellular locations also contains some sequences with high sequence identity, but the average sequence identity in each of the 12 subsets is very low (smaller than 12%), as shown in that paper [Chou, 2001b]. Datasets like these are of course very useful in both theoretical study and practical application although they are still far away from perfect and ideal. However, one should realize that a good algorithm should lead to both higher sensitivity and specificity for each subclass.
Owing to the complexity of the problem itself, there is no such thing that a good dataset can be easily obtained by a straightforward extraction from any current database. It is indeed a painstaking and time-consuming task to construct a useful dataset for protein subcellular location prediction. Sometimes, in order to clarify the ambiguous description in the database, one had to take the hardship spending a lot of time to read the original relevant papers. To deal with the difficult situation, some very sophisticated and special programs are needed both for constructing datasets and improving the prediction. Furthermore, finding the target information in mature protein sequence and properly cooperating it with sorting signals in the reasoning algorithm may further improve the overall predictive accuracy and make the prediction into practice. The next ten years will also be exciting for sequence analysts [Nakai, 2000] and proteomics will come of age when its revelation about formerly uncharacterized proteins directly drive imaginative hypotheses about their functions [Fields, 2001].
The author of this paper is grateful to Prof. C. T. Zhang for discussions,
and also greatly appreciates the help of Dr. K. C. Chou, Dr. K. Nakai,
and Dr. A. Reinhardt. The present research was supported in part by the
NSFC (grant no. 90103031).