|In Silico Biology 3, 0005 (2003); ©2003, Bioinformation Systems e.V.|
1 Integrated Genomics, POBox 348, Moscow, 117333, Russia
2 State Scientific Center GosNIIGenetika Moscow 113545, Russia.
* corresponding author
Edited by H. Michael and E. Wingender; received September 30, 2002; revised and accepted January 10, 2003; published February 11, 2003
There exist numerous algorithms for identification of regulatory signals in unaligned DNA fragments. Here we present two genetic algorithms for signal identification and describe their implementation and testing on simulated and real data. The first algorithm selects the start position of the signal in a given fragment. The second one builds a "universal" word that is recognized by the transcription factor. We compare these approaches and study the behavior of the genetic algorithm.
Key words: regulation signal, genetic algorithms
Recognition of regulatory sites in DNA fragments is one of the classical problems of computational molecular biology. Lately it has become particularly popular because of the increasing number of completely sequenced bacterial genomes and mass application of DNA chips. This allows for the comparative analysis of regulatory interactions.
Existing algorithms for identification of regulatory signals can be roughly subdivided into stochastic and deterministic ones. The former class includes simulated annealing  and the Gibbs sampler , . The deterministic algorithms are greedy algorithms [4, 5], expectation-maximization [6, 7, 8, 9], DMS , MEME [11, 12, 13] as well as ConsInd  and MatInd , WORDUP [16, 17], CONSENSUS , WINNOWER , pattern graphs [20, 21, 22], and numerous other algorithms. The genetic algorithms suggested here belong to the stochastic class. In , genetic algorithms were applied to a similar problem of the signal identification in eukaryotic genomes.
The goal of this work was to check application of genetic algorithms in the task of regulatory signals identification.
The following abstract concepts will be used (to avoid confusion with standard biological terms, they will be italicized): genome, gene, allele, quality of a genome, population, crossing, selection and mutation. Two algorithms will be described that make the same steps, but differ in definitions of genome and related concepts (gene, allele and quality of the genome).
General algorithm: This is the common part of both algorithms. The detailed definitions of genome and related concepts for each algorithm will be given later.
Genome represents a set of genes. Each genome is characterized by its quality. At each step the algorithm processes population, that is a set of genomes, and performs the following operations:
Crossing: select at random a pair of genomes and generate a new one:
|Genome 1:||S1, S2, ... Sk Sk+1 ... Sn|
|Genome 2:||T1, T2, ... Tk Tk+1 ... Tn|
|New genome:||S1, S2, ... Sk Tk+1 ... Tn|
Position k of the cut is given by the random uniform distribution.
Selection: Delete the genome with the lowest quality.
Mutation: Select a random gene in a random genome and change the current allele to a random one.
These steps are iterated for some fixed time.
Definition of genome in algorithm 1. Each fragment corresponds to a gene, and each position, specifying a candidate site, is an allele. Thus a set of candidate sites, one in each fragment, generates a set of alleles, that is, a genome. A specific value of allele means that the fragment does not contain a site. The quality is defined as the information content of the respective set of sites :
where f(i, k) is the relative frequency of nucleotide i at position k, 0.5/ and 2/ are pseudocounts.
Definition of genome in algorithm 2. Here we define a genome as a word which represents a candidate signal. Thus each position in this word is a gene and a nucleotide occupying this position is an allele. To calculate the quality, the algorithm performs the following operations. For each fragment it identifies the best instance of the genome, that is, the site in the current fragment, which is most similar to the genome word. Then the algorithm selects the top set of sites, such that the information content is maximum. The quality of the genome is the information content of the produced set of best sites.
If we use the naive formula of the information content (1) without pseudocounts, the selected set of sites would contain only one site, as it will have the maximum quality. However, using pseudocounts (2) we can select a non-trivial set of sites and thus generate a non-trivial genome.
Both algorithms were tested on simulated data. Each test file contained ten fragments of length 200. The signal was a fixed word of twenty nucleotides. The sites were modeled by introducing mismatches into the signal word and then inserting the resulting word into the sequence fragments at random positions. The number of mismatches varied from one through seven. To model corrupted samples, some fragments did not contain the sites.
In addition, algorithm 2 was tested on two real samples from the E.coli genome. The "Purine" sample initially consisted of 19 PurR-binding sites, and the "Arginine" sample contained 11 ArgR-binding sites. Most fragments in the "Purine" sample contained one site. The signal in the "Arginine" sample was duplicated in each fragment and weak. Then we masked one by one sites from the samples, but retained the corresponding sequence fragments, thus imitating contamination of the sample by spurious sequences. Each time the strongest site of the sample was deleted.
As both algorithms are stochastic and thus identify the signal with some probability, each test was repeated ten times.
The results for algorithm 1 are presented in Table 1. The dependency of the mean quality and its standard deviation on the number of iterations for different population sizes (500 and 1000) are presented in Fig. 1 and Fig. 2. It can be seen that in larger populations the mean is larger and the standard deviation is smaller, although the stationary values are reached later. The iteration process should be stopped when the mean becomes practically constant and the standard deviation is close to zero.
Table 1: The probability of correct identification of the signal for algorithm 1.
|Number of fragments with a signal|
If two different signals are introduced in each fragment simultaneously, only one of them is found (an arbitrary one). In these tests the best results were obtained with the population of 1000 genomes at 200000 iterations.
Table 2: The probability of correct identification of the signal for algorithm 2
|Number of fragments with a signal|
Analysis of Table 2 shows that the mean probability to identify the correct signal with algorithm 2 is higher. To obtain the same performance level as for algorithm 1, this algorithm required smaller population size. The number of iterations also was smaller than for algorithm 1, although each iteration took more time. The best results were obtained with the population of 150 at 2000 iterations. Results of this tests demonstrated that algorithm 2 is more efficient then algorithm 1. Moreover, the results produced by algorithm 2 demonstrated very weak dependence on the number of fragments without a site.
Definition 1. A real site is found, if there exists a predicted site that overlaps with it the former by at least half of its length.
Definition 2. A predicted site is hit, if there exists a real site that overlaps with it the former by at least half of its length.
Then we calculate two values that serve as characteristics of the algorithm's performance:
Sf = found/real
Sh = hit/predicted
Table 3: The mean value of Sf and Sh for the "Purine" test.
Table 4: The mean value of Sf and Sh for the "Arginine" test.
It can be seen that for smaller number of fragments with a site, the probability of signal identification (Sf value) is lower. This is caused by the characteristics of the testing procedure, where we delete each time the strongest site from the sample.
Comparison of the performance of both algorithms leads to some interesting observations.
The typical size of a genome (the number of genes) in both algorithms is the same, but in algorithm 1, the number of potential alleles for each gene is very large. Probably this is the reason why algorithm 1 requires larger population size and more iterations to obtain reasonable results. At the same time, algorithm 2 uses a small number (four) of potential alleles for each gene. It allows for smaller populations and significantly reduces the number of iterations.
It would be reasonable to use position weight matrices instead of words in algorithm 2, but in this case the number of potential alleles would be very large and thus a large number of iterations would be required.
To calculate the quality in algorithm 2 we use the top subset of sites, which consists of the best occurrences of the genome word. It makes this algorithm robust with regard to the number of fragments without a site (Table 2). The results of testing demonstrate that algorithm 2 is applicable for signal identification in real biological data.
In principle it is possible to use the concepts of algorithm 2 for identification of sites of unknown length. Asymmetrical crossing produces words with different lengths. However, in this case it is difficult to determine the quality. This will be implemented during further development of this project. We also plan to modify this algorithm for identification of complex multibox signals, e. g. promoters.
This study was partially supported by a grant from HHMI (55000309). We are grateful to L. Danilova for processing our results and to M. Gelfand and D. Vinogradov for useful discussions.