| In Silico Biology 9, 0004 (2008); ©2008, Bioinformation Systems e.V. |
1 Center for Pharmacoinformatics, National Institute of Pharmaceutical Education and Research S.A.S. Nagar, Sector 67, S.A.S Nagar, Punjab 160 062, India
2 Department of Biotechnology, National Institute of Pharmaceutical Education and Research S.A.S. Nagar, India
Email: virag2k@gmail.com; pradeepniper@gmail.com; nilanjanroy@niper.ac.in
* Corresponding author
Email: prabhagarg@niper.ac.in
Edited by H. Michael; received August 23, 2008; revised December 01, 2008; accepted December 03, 2008; published December 23, 2008
High-throughput genome sequencing projects continue to churn out enormous amounts of raw sequence data. However, most of this raw sequence data is unannotated and, hence, not very useful. Among the various approaches to decipher the function of a protein, one is to determine its localization. Experimental approaches for proteome annotation including determination of a protein's subcellular localizations are very costly and labor intensive. Besides the available experimental methods, in silico methods present alternative approaches to accomplish this task. Here, we present two machine learning approaches for prediction of the subcellular localization of a protein from the primary sequence information. Two machine learning algorithms, k Nearest Neighbor (k-NN) and Probabilistic Neural Network (PNN) were used to classify an unknown protein into one of the 11 subcellular localizations. The final prediction is made on the basis of a consensus of the predictions made by two algorithms and a probability is assigned to it. The results indicate that the primary sequence derived features like amino acid composition, sequence order and physicochemical properties can be used to assign subcellular localization with a fair degree of accuracy. Moreover, with the enhanced accuracy of our approach and the definition of a prediction domain, this method can be used for proteome annotation in a high throughput manner.
Availability: SubCellProt is available at www.databases.niper.ac.in/SubCellProt.
Keywords: protein function, subcellular localization, machine learning, PNN, kNN
High-throughput genome sequencing projects are producing an enormous amount of raw sequence data. As of January 2008, sequencing of around 800 organisms is complete while the sequencing of more than 3500 organisms is still underway [1]. The protein sequence databases continue to expand but the methods that can be reliably used to characterize these proteins are far from adequate. The major drawbacks of experimental methods that have been used to characterize the proteins of various organisms are the time frame involved, high cost and the fact that these methods are not amenable to high throughput techniques. In silico approaches provide a viable solution to these problems. Computationally based characterization of the features of the proteins found or predicted in completely sequenced proteomes is an important task in the search for knowledge of protein function [2]. Protein subcellular localization, consequent to protein trafficking is a key functional characteristic of proteins. Subcellular localization of proteins in the intended compartments is vital for the structural and functional features of the cell. Therefore, comprehensive knowledge on the subcellular localization of proteins is essential for understanding their roles and interacting partners in cellular metabolism.
Exhaustive experimental studies have been carried out to elicit the subcellular localization of the entire proteome in yeast [3] and in Arabidopsis [4]. However, these high-throughput approaches are not practicable to innumerable proteins produced lately, as these would consume a lot of resources. Thus, a computational approach for reliable subcellular localization can be really useful to narrow down the gap between the sequence information and the associated functional information.
Computational methods for predicting subcellular localization can be broadly divided into two types: predictions on the basis of finding homology to annotated proteins and machine learning approaches for making predictions. The former method attempts to classify an unknown protein to a subcellular class on the basis of homology [5] or conserved domains or motifs [6]. A major limitation of this method is its inability to correctly classify a protein which has insufficient homology to a known protein. The other approach, i. e. machine learning techniques, infers a predictive model from proteins with known localizations. This approach requires the conversion of the amino acid sequence to indices that characterize the biochemical properties of the protein. Then, it can be used to determine the localization of a newly discovered protein, even if the sequence is different from the known ones. Based on these principles, several computational methods have been developed over the past years. Many among these are based on Support Vector Machines (SVM) such as Subloc [7], ESLPred [8], HSLPred [9], pSLIP [10], CELLO [11] and BaCelLo [12] while others use nearest neighbor algorithms such as Plant-PLoc [13], Euk-PLoc [14] and Hum-PLoc [15]. There are still others which use a hybridization classifier to make predictions [16 - 18]. Many of these methods are implemented as public servers where a user can simply input a protein sequence and run the prediction. Some of the available public servers include CELLO [12], MultiLoc [19], Proteome Analyst [20], TargetP [21] and WoLF PSORT [22]. Of late, efforts have also been made to predict the sub-organelle localizations like subnuclear [23] and submitochondrial [24]. However, most of these methods employ a small dataset for training and are generally based on a single algorithm. Moreover, a prediction domain is never clearly defined. This aspect is particularly important when a molecular biologist is trying to characterize a newly discovered protein.
In this work, two machine learning algorithms namely k-Nearest Neighbor (kNN) and Probabilistic Neural Network (PNN) were used to develop a predictive model for the subcellular localization of proteins. Prediction is also accompanied by probability so that the end user can gauge the confidence in the prediction.
Datasets
Protein sequences of 135,527 proteins for eukaryotes were retrieved from UniProtKB/Swiss-Prot (release 55.2) database. Following criteria were used to filter the dataset:
After applying the above criteria, a dataset of 38,354 proteins was obtained.
Several programs such as cd-hit [25], PISCES [26] and UniqueProt [27] are freely available to obtain a non-redundant or a representative set of proteins from a redundant set. In this work, we have used the cd-hit program to remove the homology bias in this dataset. The dataset of 38,354 protein sequences were run through the cd-hit program using a threshold identity of 90% to cover the maximum descriptor space for training of models developed. Finally, a dataset of 26,156 proteins was obtained which was used for training purpose.
An independent dataset was used to evaluate the performance of the algorithm. This dataset has been used in a previous study by Chou and Shen [28]. However, same inclusion and exclusion criteria were used on this dataset. Proteins which corresponded to more than one location were removed from the dataset. Also a large proportion of the proteins in this test set were already part of the training set. These proteins were removed from the test set and the remaining ones were used to evaluate the performance of the model. This dataset is very stringent in the sense that none of the proteins in this dataset has more than 25% sequence identity to each other. In this way, a non-redundant test dataset was used for testing purpose. The training dataset contained 26,156 protein sequences for the 11 different locations. The independent test dataset contained 1,660 sequences. Tab. 1 shows the number of proteins in the training and the test set for the respective locations.
| Table 1: | The number of sequences in each location for training and test set. |
| Location | No. of sequences (Training set) | No. of sequences (Test set) |
| Nucleus | 4442 | 227 |
| Extracellular | 5409 | 324 |
| Mitochondria | 2803 | 179 |
| Cytoplasm | 3419 | 148 |
| Chloroplast | 3579 | 307 |
| Plasma membrane | 4849 | 303 |
| Endoplasmic reticulum | 828 | 62 |
| Golgi apparatus | 287 | 32 |
| Peroxisome | 186 | 27 |
| Lysosome | 150 | 22 |
| Vacuole | 204 | 29 |
Construction of feature vector
Subcellular localization for secretory pathways, chloroplasts and mitochondria are mainly associated with the presence of a signal peptide at the N-terminus of a protein sequence [19]. However, some proteins get secreted by means of a non-classical way and do not require N-terminal signal peptides [29, 30]. Nuclear localization signals are not necessarily N-terminally located. Furthermore some proteins are translocated into mitochondria owing to a localization peptide at the C-terminal region also. Thus, a reliable method for subcellular localization prediction must take into account the features from the whole protein sequence.
Pseudo-amino acid composition (Pse-AAC)
The concept of pseudo-amino acid composition was introduced by Chou [31]. This composition takes into account three physicochemical properties of amino acids namely hydrophobicity, hydrophilicity and mass of the constituent amino acids together with the percentage content of amino acids. The protein sequences are represented by n-dimensional vector. The first 20 components of this n-dimensional vector represent the conventional amino acid compositions while the next (n−20) components incorporate the sequential order effects along the length of the protein sequence.
Consider a protein chain of L amino acid residues:
| R1R2R3R4R5R6R7..........RL | (1) |
These amino acid residues are replaced by their corresponding normalized values of hydrophilicity, hydrophobicity and mass (Tab. 2). The sequence order effects are incorporated by the use of following factors:
![]() | where λ < L | (2) |
Here θλ is called the first-tier correlation factor that reflects the sequence order correlation between all the most contiguous residues along a protein chain (when λ = 1), θ2 the second-tier correlation factor that reflects the sequence order correlation between all the second most contiguous residues (when λ = 2) and so forth. θn is an n-th tier correlation factor. Greater is the value of λ, greater is the sequence order effect incorporated in the Pse-AAC. However, a larger value of λ is not always practical, as it increases the dimensionality of the data. We have taken the value of λ as 9 since it does not increase the dimensionality of data and at the same time ensures that the sequence order effects are incorporated adequately.
μ is a normalization term and, as suggested by Chou [31], can range from 0.05 to 0.5. We tried with different values of μ but the best results were achieved with the value of μ = 0.5. In Equation (2), the correlation function (Θ) is given by
| Θ(Rt,Rt+λ) = {[H1(Rt) - H1(Rt+λ)]2 + [H2(Rt) - H2(Rt+λ)]2 + [M(Rt) - M(Rt+λ)]2} / 3 | (3) |
where H1(Rt), H2(Rt), and M(Rt) are, respectively, the hydrophobicity value, hydrophilicity value, and side-chain mass of the amino acid Rt, and H1(Rt+λ), H2(Rt+λ), and M(Rt+λ) the corresponding values for the amino acid Rt+λ.
| Table 2: | Amino acid indices [31] |
| Amino acid | Hydrophilicity | Hydrophobicity | Mass |
| A | −0.148032339 | 0.620140363 | −1.551627700 |
| C | −0.407738196 | 0.290065654 | −0.516130590 |
| D | 1.669908661 | −0.900203753 | −0.127819174 |
| E | 1.669908661 | −0.740167530 | 0.325210811 |
| F | −1.186855767 | 1.190269407 | 0.907677935 |
| G | 0.111673519 | 0.480108668 | −2.004657685 |
| H | −0.148032339 | −0.400090557 | 0.616444373 |
| I | −0.823267567 | 1.380312421 | −0.192537744 |
| K | 1.669908661 | −1.500339588 | 0.325210811 |
| L | −0.823267567 | 1.060239976 | −0.192537744 |
| M | −0.563561710 | 0.640144891 | 0.389929380 |
| N | 0.215555861 | −0.780176586 | −0.160178459 |
| P | 0.111673519 | 0.120027167 | −0.677927014 |
| Q | 0.215555861 | −0.850192433 | 0.292851526 |
| R | 1.669908661 | −2.530572772 | 1.231270782 |
| S | 0.267497033 | −0.180040751 | −1.033879145 |
| T | −0.096091167 | −0.050011320 | −0.580849160 |
| V | −0.667444053 | 1.080244504 | −0.645567729 |
| W | −1.654326310 | 0.810183378 | 2.169690037 |
| Y | −1.082973424 | 0.260058862 | 1.425426490 |
The sequence order effect of a protein is reflected through a set of sequence-correlation factors θ1, θ2, and θλ as defined by Eq. (3). To augment the formulation of amino acid composition that incorporates these sequence-correlation factors, we define a (20 + 9)-D feature vector instead of the conventional 20-D vector. The first 20 components are representative of the amino acid composition while the rest 9 components encapsulate the sequence order effects.
Amino acid composition (AAC)
The first twenty components of the modified feature vector are obtained by calculating the amino acid frequencies and subjecting them to normalization according to the following equation:
| fu = (fi - fmin)/( fmax - fmin) | (4) |
where
fu is the normalized occurrence frequency of each of the twenty amino acids;
fi is the amino acid frequency for a particular amino acid in a protein sequence;
fmax is the largest amino acid frequency for any amino acid in the protein sequence;
fmin is the smallest amino acid frequency for any amino acid in the protein sequence;
The next 9 components are calculated according to the Equation (3). In this work, the protein sequences are represented as a 29-dimensional vector.
Algorithms
Two algorithms, k-Nearest Neighbor (k-NN) and Probabilistic Neural Network (PNN) were trained on the same learning dataset and their performance was evaluated on the same independent test dataset.
k-Nearest Neighbor (k-NN)
In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest k training examples in the domain space. To implement the k-NN algorithm on any classification problem, a labeled dataset and a metric to measure the proximity of the two vectors in an n-dimensional space is required. In this work, cosine metric is used to measure the proximity of a protein to another protein in the space. For each protein, its dot product with all other proteins in the dataset is calculated. Based on this dot product, its cosine with any other protein in the n-dimensional space is calculated as given by Equation (5):
![]() | (5) |
Based on this metric, the nearest neighbor(s) for an unknown object is identified. The class of the nearest neighbor(s) is then assigned to the unknown object. When only one closest neighbor is taken into consideration, the algorithm is known as Nearest Neighbor algorithm. In case that more than one neighbor is considered, the class of the unknown object is assigned either based on majority rule or consensus rule. The greater the value of cosine between two proteins is, the closer are the two proteins. For proteins which have a cosine value of 1, the two proteins exactly superimpose on each other in the space.
Probabilistic Neural Networks (PNN)
PNN are conceptually similar to k-NN models. The k-NN algorithm only considers the nearest neighbor (s) to an unknown object in the domain space and assigns a class to the unknown object, while the PNN algorithm considers all training examples in determining the final class of the unknown object. In the first layer, the distance is computed from the point being evaluated to each of the other points, and a radial basis function (RBF) (also called a kernel function) is applied to the distance to compute the weight for each point. The radial basis function is so named because the radius distance is the argument to the function. The second layer sums these contributions for each class of inputs to produce as its net output a vector of probabilities. Finally, a compete transfer function on the output of the second layer picks the maximum of these probabilities, and produces a 1 for that class and a 0 for the other classes.
Dataset of 26,156 proteins was used to build the models and five-fold cross validation was used to validate the models developed. Performance of the models was also evaluated with an independent test set of 1,660 proteins. The usual metrics to assess the performance of a classification algorithm viz. precision, recall/sensitivity, specificity, accuracy and Matthew's correlation coefficient have been calculated. Specificity and sensitivity are two competing but non-exclusive measures of quality useful for testing the performance of classification methods. In an ideal scenario, both the values should be as close to 1 as possible. Another measure which provides a single measure of evaluating both sensitivity and specificity together is the Matthew's correlation coefficient (MCC). The results for the test set and fivefold cross validation for both the algorithms are given in Tabs. 3-6.
| Table 3: | The statistical parameters associated with five-fold cross validation (k-NN method). |
| Location | n | Precision | Recall/Sensitivity | Specificity | Accuracy | MCC |
| Chloroplast | 3579 | 0.77 | 0.78 | 0.96 | 0.94 | 0.74 |
| Cytoplasm | 3419 | 0.50 | 0.55 | 0.92 | 0.87 | 0.45 |
| Endoplasmic reticulum | 828 | 0.47 | 0.45 | 0.98 | 0.97 | 0.44 |
| Extracellular | 5409 | 0.84 | 0.74 | 0.96 | 0.92 | 0.74 |
| Golgi apparatus | 287 | 0.19 | 0.16 | 0.99 | 0.98 | 0.17 |
| Lysosome | 150 | 0.51 | 0.53 | 1.00 | 0.99 | 0.51 |
| Mitochondria | 2803 | 0.66 | 0.61 | 0.96 | 0.92 | 0.59 |
| Nucleus | 4442 | 0.60 | 0.64 | 0.91 | 0.87 | 0.54 |
| Peroxisome | 186 | 0.26 | 0.31 | 0.99 | 0.99 | 0.28 |
| Plasma membrane | 4849 | 0.71 | 0.76 | 0.93 | 0.90 | 0.67 |
| Vacuole | 204 | 0.26 | 0.20 | 1.00 | 0.99 | 0.22 |
| Table 4: | The statistical parameters associated with independent test set (k-NN method). |
| Location | n | Precision | Recall/Sensitivity | Specificity | Accuracy | MCC |
| Chloroplast | 307 | 0.84 | 0.72 | 0.97 | 0.92 | 0.74 |
| Cytoplasm | 148 | 0.53 | 0.66 | 0.94 | 0.92 | 0.55 |
| Endoplasmic reticulum | 62 | 0.58 | 0.55 | 0.98 | 0.97 | 0.55 |
| Extracellular | 324 | 0.90 | 0.79 | 0.98 | 0.94 | 0.81 |
| Golgi apparatus | 32 | 0.56 | 0.56 | 0.99 | 0.98 | 0.55 |
| Lysosome | 22 | 0.86 | 0.86 | 1.00 | 1.00 | 0.86 |
| Mitochondria | 179 | 0.66 | 0.65 | 0.96 | 0.93 | 0.62 |
| Nucleus | 227 | 0.67 | 0.80 | 0.94 | 0.92 | 0.68 |
| Peroxisome | 27 | 0.63 | 0.56 | 0.99 | 0.99 | 0.58 |
| Plasma membrane | 303 | 0.77 | 0.80 | 0.95 | 0.92 | 0.74 |
| Vacuole | 29 | 0.50 | 0.41 | 0.99 | 0.98 | 0.45 |
| Table 5: | The statistical parameters associated with five-fold cross validation (PNN method). |
| Location | n | Precision | Recall/Sensitivity | Specificity | Accuracy | MCC |
| Chloroplast | 3579 | 0.78 | 0.78 | 0.96 | 0.94 | 0.74 |
| Cytoplasm | 3419 | 0.50 | 0.53 | 0.92 | 0.87 | 0.44 |
| Endoplasmic reticulum | 828 | 0.47 | 0.44 | 0.98 | 0.97 | 0.44 |
| Extracellular | 5409 | 0.85 | 0.73 | 0.97 | 0.92 | 0.74 |
| Golgi apparatus | 287 | 0.19 | 0.15 | 0.99 | 0.98 | 0.16 |
| Lysosome | 150 | 0.51 | 0.53 | 1.00 | 0.99 | 0.52 |
| Mitochondria | 2803 | 0.67 | 0.61 | 0.96 | 0.93 | 0.60 |
| Nucleus | 4442 | 0.60 | 0.66 | 0.91 | 0.87 | 0.55 |
| Peroxisome | 186 | 0.27 | 0.27 | 0.99 | 0.99 | 0.27 |
| Plasma membrane | 4849 | 0.70 | 0.77 | 0.92 | 0.90 | 0.67 |
| Vacuole | 204 | 0.26 | 0.19 | 1.00 | 0.99 | 0.22 |
| Table 6: | The statistical parameters associated with independent test set (PNN method). |
| Location | n | Precision | Recall/Sensitivity | Specificity | Accuracy | MCC |
| Chloroplast | 307 | 0.84 | 0.70 | 0.97 | 0.92 | 0.72 |
| Cytoplasm | 148 | 0.54 | 0.65 | 0.95 | 0.92 | 0.55 |
| Endoplasmic reticulum | 62 | 0.58 | 0.55 | 0.98 | 0.97 | 0.55 |
| Extracellular | 324 | 0.90 | 0.80 | 0.98 | 0.94 | 0.81 |
| Golgi apparatus | 32 | 0.59 | 0.59 | 0.99 | 0.98 | 0.59 |
| Lysosome | 22 | 0.95 | 0.86 | 1.00 | 1.00 | 0.90 |
| Mitochondria | 179 | 0.67 | 0.66 | 0.96 | 0.93 | 0.62 |
| Nucleus | 227 | 0.65 | 0.81 | 0.93 | 0.91 | 0.67 |
| Peroxisome | 27 | 0.57 | 0.44 | 0.99 | 0.99 | 0.50 |
| Plasma membrane | 303 | 0.77 | 0.83 | 0.94 | 0.92 | 0.75 |
| Vacuole | 29 | 0.58 | 0.38 | 1.00 | 0.98 | 0.46 |
An analysis of the results obtained with the two algorithms was done. The performance of the two algorithms with respect to the different statistical parameters was compared. It has been found that the two algorithms perform almost equally well while making the classification. If we look at the values of MCC obtained with different locations (for five-fold cross validation results, Tab. 3), we found that this value ranges from 0.17 to 0.74 (for k-NN). The value of MCC is, however, low for three locations viz. Golgi apparatus, peroxisome and vacuole i. e. 0.17, 0.28 and 0.22 respectively. Similarly, for the five-fold cross validation results obtained with the PNN method, the value of MCC (Tab. 5) for these three locations is again quite low. This low value of MCC can be attributed to the fact that for these locations, the number of proteins in the training set is comparatively low and the maximum descriptor space is not covered. If we analyze locations where a higher value of MCC (>0.65) is obtained e. g. chloroplast, extracellular and plasma membrane, these locations have a large number of proteins in the training set. The same trend was observed for the test set also (Tabs. 4 and 6). Since the performance of both the algorithms is dependent upon the number of instances to which this algorithm is exposed during training, therefore the statistical parameters are likely to show an improvement if more instances are added to the training set.
The models built are for 11 subcellular localizations only, in case the query protein has a localization other than these 11 locations; the models developed will classify the query protein into one of these 11 locations only. Since the objective of this work is to classify the proteins into different subcellular localizations correctly, probability of correct classification has been calculated based on the angle between the query protein and closest hit. A retro-analysis of the results obtained with the k-NN algorithm was done. The prediction accuracy of five-fold cross validation and independent test set results was used as a base to define the probabilities (Tab. 7). We tried to find an appropriate value for the angle, so that prediction accuracy can be defined. For the results obtained with the five-fold cross validation and the test set, we look at the ranges of the angle between a query protein and its closest hit and prediction accuracy. It is evident that with decreasing values of θ, the percentage of correct predictions increases. This implies that if we have a lower value of θ for a query protein with its nearest neighbor in the dataset, the confidence in the prediction is also high. This is very fundamental to our algorithm which states that the nearer an unknown object is to a known object, the better the chances are of making a correct prediction. The prediction accuracies for the angle ranges (6 > θ ≥ 0) and (8 > θ ≥ 6) for five-fold cross validation are 85.39 and 60.72, respectively. Similarly, prediction accuracies for the independent dataset for the angle ranges (6 > θ ≥ 0) and (8 > θ ≥ 6) are 90.21 and 61.57. An identical behavior was observed for both the five-fold cross validation results and the independent test results. We, therefore, conclude that the cut-off value for the angle between query protein and closest hit is 6 degrees. If for an unknown protein, the angle between the protein and its nearest neighbor is lower than 6°, the confidence in the prediction of location class is high. Conversely, if the value of θ is more than 6° with its nearest neighbor, then the confidence in the prediction of the location class is low.
| Table 7: | The ranges of angle and the corresponding percent accuracy. |
| Angle θ (degree) | % Accuracy (five-fold cross validation) |
% Accuracy (test set) |
| 6 > θ ≥ 0 | 85.39 | 90.21 |
| 8 > θ ≥ 6 | 60.72 | 61.57 |
| 10 > θ ≥ 8 | 56.39 | 53.78 |
| 12 > θ ≥ 10 | 56.22 | 49.56 |
To finally classify a protein into one out of the eleven subcellular localizations, we used the following methodology. If for an unknown protein, the prediction is same from both the algorithms, it is assigned to that consensus location. The θ value associated with the prediction made by k-NN is examined and in case it is less than 6°, a higher probability of 0.85 is assigned to the prediction of protein being targeted to the consensus location. If the predictions from both the algorithms are the same but the θ value is more than 6° and less than 8°, a lower probability value of 0.6 is assigned to the prediction. The probabilities assigned to the various possibilities are given in Tab. 8.
| Table 8: The probabilities associated with each possibility. |
| Prediction by k-NN | Prediction by PNN | Angle θ (degree) | Result | Probability |
| A | A | 6 > θ ≥ 0 | Protein is classified into the consensus location | 0.85 |
| A | A | 8 > θ ≥ 6 | Protein is classified into the consensus location | 0.6 |
| A | B | 6 > θ ≥ 0 | Protein is classified into two locations | 0.4 for both the localizations i. e. A and B |
| A | B | 8 > θ ≥ 6 | Protein is classified into two locations | 0.25 for both the localizations i. e. A and B |
| A | A | 10 > θ ≥ 8 | Protein is classified into the consensus location | 0.25 |
| others | 0.1 | |||
Developed models can be used to annotate uncharacterized proteome into eleven subcellular localizations of intracellular compartments in a high throughput manner. The work presents two machine learning approaches for assigning subcellular localizations to the uncharacterized proteins. Since the dataset used for training has been used from a large range of eukaryotic organisms, therefore this approach can be effectively used for any eukaryotic organism. The use of non-redundant datasets for training also ensures that the model has a good generalization capability. The key feature of this approach is the use of two different algorithms for prediction and the cut-off value that we have stated, so that the degree of confidence in making a prediction is known. This approach can be successfully used to assign locations to several uncharacterized proteins and may prove a useful tool to facilitate the annotation of various proteomes.
Availability: SubCellProt is available at www.databases.niper.ac.in/SubCellProt
Uniprot Accession numbers and their respective locations for Training set (TrainingSet.xls) and Test set (TestSet.xls).