ISB Home



- Article -





Volume 8


Full article

In Silico Biology 8, 0042 (2008); ©2008, Bioinformation Systems e.V.  



Analysis of n-gram based promoter recognition methods and application to whole genome promoter prediction

T. Sobha Rani* and Raju S. Bapi

Computational Intelligence Lab, Department of Computer and Information Sciences, University of Hyderabad, Hyderabad, India

* Corresponding author
   Email: tsrcs@uohyd.ernet.in


Edited by H. Michael; received May 27, 2008; revised September 15, 2008; accepted September 19, 2008; published January 01, 2009


Abstract

Promoter prediction is an important and complex problem. Pattern recognition algorithms typically require features that could capture this complexity. A special bias towards certain combinations of base pairs in the promoter sequences may be possible. In order to determine these biases n-grams are usually extracted and analyzed. An n-gram is a selection of n contiguous characters from a given character stream, DNA sequence segments in this case. Here a systematic study is made to discover the efficacy of n-grams for n = 2,3,4,5 in promoter prediction. A study of n-grams as features for a neural network classifier for E. coli and Drosophila promoters is made. In case of E. coli n = 3 and in case of Drosophila n = 4 seem to give optimal prediction values. Using the 3-gram features, promoter prediction in the genome sequence of E. coli is done. The results are encouraging in positive identification of promoters in the genome compared to software packages such as BPROM, NNPP, and SAK. Whole genome promoter prediction in Drosophila genome was also performed but with 4-gram features.


Keywords: biological data sets, machine learning method, neural networks, in silico method for promoter prediction, binary classification, cascaded classifiers