ISB Home



- Article -





Volume 2

Special Issue
GCB'01



Full article

In Silico Biology 2, 0027 (2002); ©2002, Bioinformation Systems e.V.  


An overview on predicting the subcellular location of a protein

Zhi-Ping Feng

Department of Physics, Institute of Science, Tianjin University, Tianjin 30072, P. R. China
LiuHui Center for Applied Mathematics, Nankai University and Tianjin University, Tianjin 300072, P. R. China
Email: zpfeng@eyou.com


Edited by E. Wingender; received November 22, 2001; revised and accepted December 18, 2001; published March 28, 2002


Abstract

The present paper overviews the issue on predicting the subcellular location of a protein. Five measures of extracting information from the global sequence based on the Bayes discriminant algorithm are reviewed. 1) The auto-correlation functions of amino acid indices along the sequence; 2) The quasi-sequence-order approach; 3) the pseudo-amino acid composition; 4) the unified attribute vector in Hilbert space, 5) Zp parameters extracted from the Zp curve. The actual performance of the predictive accuracy is closely related to the degree of similarity between the training and testing sets or to the average degree of pairwise similarity in dataset in a cross-validated study. Many scholars considered that the current higher predictive accuracy still cannot ensure that some available algorithms are effective in practice prediction for the higher pairwise sequence identity of the datasets, but some of them declared that construction of the dataset used for developing software should base on the reality determined by the Mother Nature that some subcellular locations really contain only a minor number of proteins of which some even have a high percentage of sequence similarity. Owing to the complexity of the problem itself, some very sophisticated and special programs are needed for both constructing dataset and improving the prediction. Anyhow finding the target information in mature protein sequence and properly cooperating it with sorting signals in prediction may further improve the overall predictive accuracy and make the prediction into practice.

Key words: subcellular location, N-terminal targeting sequences, sorting signals, targeting information, amino acid composition, quasi-sequence-order-effect, pseudo-amino acid composition, auto-correlation functions, unified attribute vector, Zp curve, Zp parameters, Bayes discriminant algorithm, component-coupled algorithm, k-nearest neighbor method, hidden Markov model, neural networks, Support Vector Machine (SVM), jackknife test, hydrophobicity, pairwise sequence similarity