|In Silico Biology 7, 0010 (2007); ©2007, Bioinformation Systems e.V.|
1 Department of Computer Engineering and Informatics,
University of Patras, GR-26500, Rio, Patras, Greece
2 Abramson Family Cancer Research Institute, 412 Curie Blvd. Philadelphia, PA 19104-6160, University of Pennsylvania, School of Medicine, Philadelphia, USA
3 Department of Environmental and Natural Resources Management University of Ioannina, Greece
* Corresponding author
Edited by E. Wingender; received August 05, 2006; revised October 24 and December 12, 2006; accepted December 12, 2006; published January 06, 2007
Ab initio coding potential prediction in a bacterial genome is an important step in determining an organism's transcriptional regulatory function. Extensive studies of genes structure have been carried out in a few species such as Escherichia coli, fewer resources exist in newly sequenced genomes like Wolbachia. A model of gene prediction trained on one species may not reflect the properties of other, distantly related prokaryotic organisms. These issues were encountered in the course of predicting genes in the genome of Wolbachia, very important gramnegative bacteria that form intracellular inherited infections in many invertebrates. We describe a coding potential predictor based on artificial neural networks and we compare its performance by using different architectures, learning algorithms and parameters. We rely on a dataset of positive samples constructed from coding sequences and on a negative dataset consisted of all the intergenic regions that were not located between the genes of an operon. Both datasets, positive and negative, were output as fasta formatted files and were used for neural network training. The fast, adaptive, batch learning algorithm Resilient propagation, exhibits the best overall performance on a 64input-10hidden-1output nodes multi-layer perceptron neural network.
Keywords: gene prediction, prokaryotes, Wolbachia, artificial neural networks, resilient backpropagation
The complete sequence of the 1,267,782 bp single circular molecular genome of Wolbachia pipientis wMel, an obligate intracellular bacteria of Drosophila melanogaster, has now been determined. Its overall GC content is 35.1% . Wolbachia, which is found in a variety of invertebrate species, is of great interest due to its diverse interactions with different hosts, which range from many forms of reproductive parasitism to mutualistic symbioses.
Analysis of the wMel genome, in particular phylogenomic comparisons with other intracellular bacteria, has revealed many insights into the biology and evolution of wMel and Wolbachia in general .
With the availability of the complete genomes of both the host (Drosophila melanogaster) and the symbiotic microorganism (Wolbachia) species and the existence of excellent genetic tools for the host, the Drosophila-Wolbachia symbiosis is now an ideal system for studying the biology and evolution of Wolbachia infections. An interesting problem for assembly is the high percentage of contamination from Drosophila in the sequencing library, a consequence of the fact that the organism had to be grown inside flies to prepare its DNA for sequencing.
Gene finding typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced. In the genomes of prokaryotes, genes have specific and relatively well-understood promoter sequences (signals), such as the Pribnow box  and the transcription factor binding sites, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous open reading frame, which is typically many hundred or thousands of base pairs long. The statistics of stop codons are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 out of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20-25 codons, or 60-75 base pairs, in a random sequence.) Furthermore, protein-coding DNA has certain periodicities and other statistical properties that are easy to detect in sequence of this length. These characteristics make prokaryotic gene finding more straightforward compared to eukaryotes where the existence of introns and the regulation of gene expression from many different elements make the gene prediction much more complicated.
A number of computational methods exist for coding prediction [4 - 10]. The majority of these methods identify coding potential in prokaryotes using complex probabilistic models, such as hidden Markov models, in order to combine information from a variety of different signal and content measurements. To the best of our knowledge the Glimmer  and GeneMark  are the most widely used gene finders for prokaryotes. Nevertheless, computational gene prediction methods have yet to achieve perfect accuracy, even in the relatively simple prokaryotic genomes. Problems in gene prediction centre on the fact that many protein families remain uncharacterized. As a result, it seems that only approximately half of an organism's genes can be confidently predicted on the basis of homology to other known genes, so ab initio prediction methods are usually employed to identify many protein-coding regions of DNA. In this work, we seek to compare and evaluate the performance of specially designed neural networks for coding potential prediction in the Wolbachia strain wMel.
We have implemented several feedforward neural networks (FFNN) with different architectures, different learning algorithms and starting parametric values and we have compared them by means of their performance in coding potential prediction. All the multi-layer perceptrons had a fixed number of input-output while the hidden layer could have a variable length of nodes with permitted values range from zero to ten. That is 64 input layer nodes, 0 to 10 nodes for a single hidden layer and 1 node for the output layer. The input layer has 64 nodes because this is the number of different codons in all living organisms. The output layer has only one node because the output has a binary format one for the positive case and zero for the negative. A sigmoidal (S-shaped) function used to transform the activation level of a unit (neuron) into an output signal.
We made use of the codon usage frequencies in order to train and evaluate our predictor. Codons are triplets of nucleotides that together specify an amino acid residue in a protein. Most organisms use 20 amino acids to make their proteins. Because there are four possible nucleotides with the nucleobases adenine (A), guanine (G), cytosine (C) and thymine (T) in DNA, there are 64 possible triplets to recognize only 20 amino acids plus the translation termination signal. Because of this redundancy, most amino acids are coded by more than one triplet. Different organisms often show particular preferences for one of the several codons that encode the same given amino acid. How these preferences arise is a much debated area of molecular evolution. Finding the codons that an organism prefers to use in order to build its proteins can predict the rest of this organism's genes. This, of course, applies to amino acids that are encoded by more than one codon. In most living organisms, including bacteria, all but two amino acids are encoded by at least two codons. Only two amino acids are specified by a single codon. One of these is the amino acid methionine, specified by the codon AUG which also specifies the start of translation, and the other is tryptophan, specified by the codon UGG.
Positive and negative sequence datasets were extracted from the Wolbachia wMel genomic sequence using locally developed Perl scripts. A positive sequence dataset was constructed from sequences that were very likely to encode a real gene. To this end, as a first step, all the long Open Reading Frames (ORFs) (longer than 300 nucleotides) were extracted from the raw genomic sequence of the Wolbachia strain wMel. ORFs that are long enough are more likely to represent real genes. As a result, a downstream analysis based on these ORFs is more reliable. Subsequently, the homologs of all these ORFs were searched. Searches were conducted against the "non-redundant" (known as "nr") database of NCBI (National Center for Biotechnology Information), using the BLAST program . If an ORF did not have a good matching in the database with another protein of known function, then this ORF was filtered out and not included in any downstream analysis. Significant database matches were those having at least 70% identity at the amino acid level while the alignment length should be above 80% of the ORF's length.
Negative sequence dataset was constructed from non-coding sequences. In order to successfully select these sequences operon prediction was carried out. Operons in prokaryotic microorganisms (like Wolbachia) are groups of genes that are simultaneously transcribed. Usually all these genes are located in the same strand and have a small spacing between them. We picked the 50 bp spacing threshold in order to identify genes (of the same strand) belonging to the same operon. Subsequently, we chose as negative sequences (non-coding) those located between two neighboring positive (coding) sequences but not located between two coding sequences which belonged to the same operon. Thus, we chose as negative sequences those which are not simply non-coding, but also non-transcribed. As it is known, intergenic sequences (sequences between genes/ORFs) which are located between genes of the same operon are transcribed even though they are not translated.
Both datasets, positive and negative, were output as fasta formatted files and used for neural network training. The positive dataset contains 1122 positive data and the negative dataset 1298. One of the major advantages of the neural nets is their ability to generalize. This means that a trained neural network could classify data from the same class as the learning data that it has never seen before. In our problem we have only a small part of all possible patterns (codon usage in the Wolbachia genome) for the generation of a neural net. Thus, to reach the best generalization, the dataset should be split into three parts and our dataset was separated into training, evaluation and test sub-sets. The cardinality of each set is the following:
The training set is used to train a neural net. The error of this dataset is minimized during training. Finally, codon usage frequencies were extracted from both positive and negative sequences and printed out as vectors that can be analyzed by the Stuttgart Neural Network Simulator SNNS .
For this latter sequence manipulation we used a locally developed java script. Briefly, the steps followed for pattern generation are the following:
For the training process two online and two batch learning algorithms were used. The Standard Backpropagation algorithm and the Backpropagation with momentum were used for online learning, while for batch learning we have used the Resilient propagation (Rprop) and the Quick Propagation.
In online learning, the weight changes are applied to the network after each training pattern that is after each forward and backward pass. In offline learning or batch learning the weight changes are cumulated for all patterns in the training file and the sum of all changes is applied after one full cycle (epoch) through the training pattern file.
The validation set consists of 604 (280 positive, 324 negative) data and it was used to tune the parameters like the network architecture, (number of hidden units) in a neural network. This approach is called the hold out method . The learning procedure was stopped in the minimum of the validation set error. At this point the net generalizes best. When learning is not stopped, overtraining occurs and the performance of the net on the whole data decreases, despite the fact that the error on the training data still gets smaller. In our experiment we perform one validation cycle every 100 training cycles. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set.
It is unsafe to judge the network performance from the training data alone. Therefore it is of great need to load in a test set once in a while to ensure that the net is not over-training and generalizing correctly.
A test set consisted of 281 positive and 325 negative data not previously shown to the networks was used to assess the overall performance (generalization) of the neural networks. The estimation of the accuracy for each neural network predictor is presented in detail in Tab. 1 in the Results section.
Neural network learning algorithms
The most famous online learning algorithm is the Standard Backpropagation . In the Standard Backpropagation learning algorithm online training is usually significantly faster than batch training, especially in the case of large training sets with many similar training examples.
The backpropagation weight update rule, also called delta-rule works as follows:
|η :||learning rate|
|δj :||This is an estimation of the slope of the error surface and between the real output and the teaching input of unit j|
|tj :||teaching input of unit j|
|oi :||output of the preceding unit i|
|oj :||output of unit j|
|i :||index of a predecessor to the current unit j with link wij from i to j|
|j :||index of the current unit|
|k :||index of a successor to the current unit j with link wjk from j to k|
|wjk :||weight of the link from unit j to unit k|
|f 'j (netj) :||the derivative of the activation function of unit j|
The standard backpropagation takes two parameters: the learning rate and the parameter dmax. We define the parameter dmax as the maximum difference between a teaching value tj and an output oj of an output unit which is tolerated. For example if we want values above 0.9 to be regarded as 1 and values below 0.1 as 0, then dmax should be set to 0.1. We use this parameter in order to prevent the overtraining of the network. In our experiments the dmax was set to take values from 0 to 0.1.
Additionally, a range of learning rates that specify the step width of the gradient descent was taken at uniform intervals. This was done in order to investigate the most suitable learning rates for the best performance of the neural network. Learning rates: 0.01, 0.03, 0.06,...1.
Backpropagation with momentum
An enhanced version of backpropagation uses a momentum term and a flat spot elimination. The momentum term introduces the old weight change as a parameter for the computation of the new weight change. This avoids oscillation problems common with the regular backpropagation algorithm when the error surface has a very narrow minimum area. The new weight change is computed by:
|η :||learning rate|
|α :||is a constant that specifies the influence of the momentum|
|c :||is the flat spot elimination value, a constant value which is added to the derivative of the activation function to enable the network to bypass the flat spots of the error surface.|
The effect of these enhancements is that the flat spots of the error surface are traversed relatively rapidly with a few big steps, while the step size is decreased as the surface gets rougher. This adaptation of the step size increases learning speed significantly. Note that the old weight change is lost every time the parameters are modified, new patterns are loaded, or the network is modified. The momentum term η was set to take values from the interval (0-0.99) and the learning rate was the same as in the standard backpropagation case, the parameter c was set to take values from (0-0.1).
Resilient propagation (Rprop)
Rprop  is a local adaptive learning scheme, performing supervised batch learning in multi-layer perceptrons. The basic principle of Rprop is to eliminate the harmful influence of the size of the partial derivative on the weight step. As a consequence, only the sign of the derivative is considered to indicate the direction of the weight update. The size of the weight change is exclusively determined by a weight-specific, so-called update-value
denotes the summed gradient information over all patterns of the pattern set.
The Rprop algorithm takes three parameters: the initial update value Δ0, a limit for the maximum step size Δmax, and the weightdecay exponent α. When learning starts, all update-values are set to an initial value 0.01. In order to prevent the weights from becoming too large, the maximum weight-step determined by the size of the update-value, is limited. The upper bound is set by the second parameter of Rprop, the maximum step size. The default upper bound is set arbitrarily to 40. Nevertheless it was proven advantageous to allow only very cautious (small) steps, in order to prevent the algorithm getting stuck too quickly in suboptimal local minima. The weight decay α term is used as an exponent (x = 10−α ) and the allowed values are within the range from 1 to 5. Note that the weight-decay parameter α denotes the exponent, to allow comfortable input of very small values. Hence, a choice of α = 4 corresponds to a ratio of weight decay term to output error of 1:10000. This parameter determines the relationship of two goals, namely to reduce the output error (the standard goal) and to reduce the size of the weights (so as to improve generalization).
Quick propagation (QuickProp)
Another method to speed up the learning is to use information about the curvature of the error surface. Unfortunately, this would require the computation of the second order derivatives of the error function, which can be time consuming. To avoid that, Quickprop  assumes the error surface to be locally quadratic and attempts to jump in one step from the current position directly into the minimum of the parabola. Quickprop computes the derivatives in the direction of each weight. After computing the first gradient with regular backpropagation, a direct step to the error minimum is attempted by:
|wij:||weight between units i and j|
|Δ(t+1):||actual weight change|
|S(t+1):||partial derivative of the error function by wij|
|S(t):||the previous partial derivative|
The QuickProp takes three parameters. These are the learning parameter, the maximum growth parameter, and the maximum difference dj = tj − oj between a teaching value tj and an output oj of an output unit which is tolerated. We have run extensive tests to determine the problem dependence of these parameters. The available learning rates vary from 0.01 to 1 in uniform intervals. The maximum growth parameter that specifies the maximum amount of weight change which is added to the current change was set to take values from 1.2 to 4. The maximum difference is preferably a very small number (from 0 to 0.2) so as to speed up the learning process.
Both Rprop and QuickProp try to adapt their learning process to the topology of the error function. This actually means that weight-update and adaptation are performed after the gradient information of the whole pattern set is computed.
The mean square error (MSE) is the condition to terminate training of all the neural networks. MSE is originally set at exp(−5), however, these learning methods can also stop once the training exceeds the number of epoch set. The number of epochs is set at 500. For the batch learning algorithms the training cycles (epochs) used were 500. Each experiment repeated 5 times and the training data consisted of 1210 instances used to fit the parameters (weights) of every network. The patterns were presented randomly once during each cycle.
The experiments looked at a fairly broad range of values for all the networks parameters. Rprop outperformed all the other algorithms by means of accuracy and training time. When the learning algorithm used is resilient (Rprop), then it has a quicker convergence time (fewer training cycles, or epochs) than QuickProp, when both exposed to the same training data. Additionally, it was interesting to see what happened after 500 cycles. As it was expected overtraining occurred in cases with large learning rates, and thrashing or oscillation was exhibited.
|Table 1:||Results for the testing set|
|Neural networks learning algorithm|
|Rprop||QuickProp||BackProp||BackProp with momentum|
|Number of positive examples||281||281||281||281|
|Number of negative examples||325||325||325||325|
|Best neural networks||SNN-i64-h10-o1||SNN-i64-h5-o1||SNN-i64-h5-o1||SNN-i64-h10-o1|
|Best learning rate||0.1||0.3||0.1|
|Best max growth||2.75|
|True negatives, TN||279||292||202||289|
|False negatives, FN||9||29||2||47|
|True positives, TP||272||252||279||234|
|False positives, FP||46||33||123||36|
|Time||0H 1M 33S||0H 3M 56S||0H 12M 29S||0H 56M 58S|
The best values for each algorithm together with the threshold for which the sensitivity and specificity are the highest possible are presented in Tab. 1 for Rprop, Quickprop, BackProp and BackProp with momentum respectively. We define the sensitivity as TP/(TP + FN) (that is the probability of correctly prediction a positive example) and the specificity as TP/(TP + FP) (the probability of a positive prediction to be correct). We define the term Accuracy as the ratio of the number of correct predictions to the total number of irrelevant and relevant predictions retrieved. The networks were simulated using the Stuttgart Neural Network Simulator . The simulations were conducted on a machine, running Red Hat Linux 9.0.
In this paper, we defined the problem of coding region finding in Wolbachia as a prediction problem. We then applied four different neural network training algorithms to solve this problem. We listed the performance of the four methods in coding potential prediction in Tab. 1. We found that the resilient propagation yielded the best performance in predicting coding regions in a 64-10-1 neural network.
The need for training data limits the utility of most gene predictors to organisms with many experimentally characterized genes, such as Escherichia coli [4, 6] and Bacillus subtilis . In contrast, new genomes like Wolbachia lack a set of validated genes, so predictors that train organism-specific models cannot easily be applied to them. In such cases, there is an uncertainty whether the training sets of genes from one species might be useful in another. For example, a gene predictor based on intergenic distances in E. coli worked equally well when applied to the known operons of B. subtilis . However, other gene finders trained in one organism have proved less portable when the target species is not closely related to that used for training. Portability of a gene finder may depend on the types and amount of information used for training, particularly if no explicit effort is made to produce a predictor that is portable across genomes. Our neural network based predictor is an important tool that: 1) does not demand a large training set for the genomes of interest, and 2) is designed in a generalized manner such that it enables fast and reliable prediction of coding regions in any other bacterium that contain few, sufficiently long and well conserved genes.
The authors would like to thank A. Hatzigeorgiou for the financial support and helpful comments for this work.