In Silico Biology 3, 0005 (2003); ©2003, Bioinformation Systems e.V.  
BGRS 2002

Two genetic algorithms for identification of regulatory signals

Elena D. Stavrovskaya *, 1 and Andrey A. Mironov 1, 2




1 Integrated Genomics, POBox 348, Moscow, 117333, Russia
2 State Scientific Center GosNIIGenetika Moscow 113545, Russia.
   Email: esta191@fromru.com
*  corresponding author





Edited by H. Michael and E. Wingender; received September 30, 2002; revised and accepted January 10, 2003; published February 11, 2003



Abstract

There exist numerous algorithms for identification of regulatory signals in unaligned DNA fragments. Here we present two genetic algorithms for signal identification and describe their implementation and testing on simulated and real data. The first algorithm selects the start position of the signal in a given fragment. The second one builds a "universal" word that is recognized by the transcription factor. We compare these approaches and study the behavior of the genetic algorithm.

Key words: regulation signal, genetic algorithms



Introduction

Recognition of regulatory sites in DNA fragments is one of the classical problems of computational molecular biology. Lately it has become particularly popular because of the increasing number of completely sequenced bacterial genomes and mass application of DNA chips. This allows for the comparative analysis of regulatory interactions.

Existing algorithms for identification of regulatory signals can be roughly subdivided into stochastic and deterministic ones. The former class includes simulated annealing [1] and the Gibbs sampler [2], [3]. The deterministic algorithms are greedy algorithms [4, 5], expectation-maximization [6, 7, 8, 9], DMS [10], MEME [11, 12, 13] as well as ConsInd [14] and MatInd [15], WORDUP [16, 17], CONSENSUS [18], WINNOWER [19], pattern graphs [20, 21, 22], and numerous other algorithms. The genetic algorithms suggested here belong to the stochastic class. In [23], genetic algorithms were applied to a similar problem of the signal identification in eukaryotic genomes.

The goal of this work was to check application of genetic algorithms in the task of regulatory signals identification.



Description of algorithms

The following abstract concepts will be used (to avoid confusion with standard biological terms, they will be italicized): genome, gene, allele, quality of a genome, population, crossing, selection and mutation. Two algorithms will be described that make the same steps, but differ in definitions of genome and related concepts (gene, allele and quality of the genome).

General algorithm:  This is the common part of both algorithms. The detailed definitions of genome and related concepts for each algorithm will be given later.

Genome represents a set of genes. Each genome is characterized by its quality. At each step the algorithm processes population, that is a set of genomes, and performs the following operations:


Crossing: select at random a pair of genomes and generate a new one:
Genome 1: S1, S2, ... Sk Sk+1 ... Sn
Genome 2: T1, T2, ...   Tk Tk+1 ... Tn
New genome:    S1, S2, ... Sk Tk+1 ... Tn

Position k of the cut is given by the random uniform distribution.


Selection: Delete the genome with the lowest quality.

Mutation: Select a random gene in a random genome and change the current allele to a random one.

These steps are iterated for some fixed time.


Definition of genome in algorithm 1.   Each fragment corresponds to a gene, and each position, specifying a candidate site, is an allele. Thus a set of candidate sites, one in each fragment, generates a set of alleles, that is, a genome. A specific value of allele means that the fragment does not contain a site. The quality is defined as the information content of the respective set of sites [9]:

where f(i, k) is the relative frequency of nucleotide i at position k, 0.5/ and 2/ are pseudocounts.

Definition of genome in algorithm 2.   Here we define a genome as a word which represents a candidate signal. Thus each position in this word is a gene and a nucleotide occupying this position is an allele. To calculate the quality, the algorithm performs the following operations. For each fragment it identifies the best instance of the genome, that is, the site in the current fragment, which is most similar to the genome word. Then the algorithm selects the top set of sites, such that the information content is maximum. The quality of the genome is the information content of the produced set of best sites.

If we use the naive formula of the information content (1) without pseudocounts, the selected set of sites would contain only one site, as it will have the maximum quality. However, using pseudocounts (2) we can select a non-trivial set of sites and thus generate a non-trivial genome.



Testing

Both algorithms were tested on simulated data. Each test file contained ten fragments of length 200. The signal was a fixed word of twenty nucleotides. The sites were modeled by introducing mismatches into the signal word and then inserting the resulting word into the sequence fragments at random positions. The number of mismatches varied from one through seven. To model corrupted samples, some fragments did not contain the sites.

In addition, algorithm 2 was tested on two real samples from the E.coli genome. The "Purine" sample initially consisted of 19 PurR-binding sites, and the "Arginine" sample contained 11 ArgR-binding sites. Most fragments in the "Purine" sample contained one site. The signal in the "Arginine" sample was duplicated in each fragment and weak. Then we masked one by one sites from the samples, but retained the corresponding sequence fragments, thus imitating contamination of the sample by spurious sequences. Each time the strongest site of the sample was deleted.

As both algorithms are stochastic and thus identify the signal with some probability, each test was repeated ten times.



Results

The results for algorithm 1 are presented in Table 1. The dependency of the mean quality and its standard deviation on the number of iterations for different population sizes (500 and 1000) are presented in Fig. 1 and Fig. 2. It can be seen that in larger populations the mean is larger and the standard deviation is smaller, although the stationary values are reached later. The iteration process should be stopped when the mean becomes practically constant and the standard deviation is close to zero.


Table 1: The probability of correct identification of the signal for algorithm 1.

  Number of fragments with a signal
  10 9 8 7 6 5
0 1 0.89 1 1 0.7 0.64
1 1 1 1 0.77 0.8 0.8
2 0.99 0.82 1 0.86 0.7 0.6
3 0.93 0.87 0.94 0.83 0.73 0.5
4 0.76 0.61 0.68 0.73 0.58 0.2
5 0.56 0.56 0.64 0.61 0.58 0.24
6 0.6 0.32 0.41 0.4 0.4 0.14
7 0.54 0.17 0.38 0.37 0.08 0.04




Figure 1: The mean quality for algorithm 1.


Figure 2: The standard deviation of quality for algorithm 1.

If two different signals are introduced in each fragment simultaneously, only one of them is found (an arbitrary one). In these tests the best results were obtained with the population of 1000 genomes at 200000 iterations.

The results obtained for algorithm 2 are given in Table 2 and Fig. 3 and Fig. 4.


Table 2: The probability of correct identification of the signal for algorithm 2

  Number of fragments with a signal
  10 9 8 7 6 5
0 1.0 1.0 1.0 1.0 1.0 1.0
1 1.0 0.99 1.0 0.99 1.0 1.0
2 1.0 0.99 1.0 1.0 1.0 0.96
3 0.96 0.99 1.0 1.0 0.9 0.96
4 0.91 0.97 0.98 0.97 0.98 0.98
5 0.91 0.89 0.95 0.87 0.92 0.76
6 0.64 0.89 0.81 0.64 0.53 0.74
7 0.5 0.6 0.54 0.36 0.35 0.36




Figure 3: The mean quality for algorithm 2.


Figure 4: The standard deviation of quality for algorithm 2.

Analysis of Table 2 shows that the mean probability to identify the correct signal with algorithm 2 is higher. To obtain the same performance level as for algorithm 1, this algorithm required smaller population size. The number of iterations also was smaller than for algorithm 1, although each iteration took more time. The best results were obtained with the population of 150 at 2000 iterations. Results of this tests demonstrated that algorithm 2 is more efficient then algorithm 1. Moreover, the results produced by algorithm 2 demonstrated very weak dependence on the number of fragments without a site.

The results of algorithm 2 on real data are given in Tables 3 and 4. To assess the output of the algorithm we used the following criterion:

Definition 1.   A real site is found, if there exists a predicted site that overlaps with it the former by at least half of its length.

Definition 2.   A predicted site is hit, if there exists a real site that overlaps with it the former by at least half of its length.


Then we calculate two values that serve as characteristics of the algorithm's performance:

Sf = found/real

Sh = hit/predicted



Table 3: The mean value of Sf and Sh for the "Purine" test.

Number of fragments with a signal Sf Sh
   190.551
   180.51
   170.590.98
   160.580.98
   150.570.94
   140.670.97
   130.640.98
   120.640.95
   110.570.86
   100.590.73
   90.590.69
   80.630.66
   70.660.75
   60.40.39
   50.440.35
   40.450.28
   30.270.13


Table 4: The mean value of Sf and Sh for the "Arginine" test.

Number of fragments with a signal Sf Sh
   9 0.84 1
   8 0.2 0.25
   7 0.21 0.3
   6 0.08 0.1
   5 0.14 0.13
   4 0.03 0.02
   3 0.03 0.02


It can be seen that for smaller number of fragments with a site, the probability of signal identification (Sf value) is lower. This is caused by the characteristics of the testing procedure, where we delete each time the strongest site from the sample.



Discussion

Comparison of the performance of both algorithms leads to some interesting observations.

The typical size of a genome (the number of genes) in both algorithms is the same, but in algorithm 1, the number of potential alleles for each gene is very large. Probably this is the reason why algorithm 1 requires larger population size and more iterations to obtain reasonable results. At the same time, algorithm 2 uses a small number (four) of potential alleles for each gene. It allows for smaller populations and significantly reduces the number of iterations.

It would be reasonable to use position weight matrices instead of words in algorithm 2, but in this case the number of potential alleles would be very large and thus a large number of iterations would be required.

To calculate the quality in algorithm 2 we use the top subset of sites, which consists of the best occurrences of the genome word. It makes this algorithm robust with regard to the number of fragments without a site (Table 2). The results of testing demonstrate that algorithm 2 is applicable for signal identification in real biological data.

In principle it is possible to use the concepts of algorithm 2 for identification of sites of unknown length. Asymmetrical crossing produces words with different lengths. However, in this case it is difficult to determine the quality. This will be implemented during further development of this project. We also plan to modify this algorithm for identification of complex multibox signals, e. g. promoters.



Acknowledgements

This study was partially supported by a grant from HHMI (55000309). We are grateful to L. Danilova for processing our results and to M. Gelfand and D. Vinogradov for useful discussions.



References

  1. Lukashin, A. V., Engelbrecht, J. and Brunak, S. (1992). Multiple alignment using simulated annealing: branch point definition in human mRNA splicing. Nucleic Acids Res. 20, 2511-2516.

  2. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. and Wootton, J. C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208-214.

  3. Roth, F. P., Hughes, D., Estep, P. W. and Church, G. M. (1998). Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol. 16, 939-945.

  4. Stormo, G. D. and Hartzell, G. W. 3rd. (1989). Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. USA  86, 1183-1187.

  5. Hertz, G. Z., Hartzell, G. W. 3rd and Stormo, G. D. (1990). Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci. 6, 81-92.

  6. Lawrence, C. E. and Reilly, A. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7, 41-51.

  7. Cardon, L. R. and Stormo, G. D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol. 223, 159-170.

  8. Frishman, D., Mironov, A. and Gelfand, M. (1999). Starts of bacterial genes: estimating the reliability of computer predictions. Gene 234, 257-265.

  9. Gelfand, M. S., Koonin, E. V. and Mironov, A. A. (2000). Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. Nucleic Acids Res. 28, 695-705.

  10. Hu, Y.-J., Sandmeyer, S., McLaughlin, C. and Kibler, D. (2000). Combinatorial motif analysis and hypothesis generation on a genomic scale. Bioinformatics 16, 222-232.

  11. Bailey, T. L. and Elkan, C. P. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Con. Intell. Syst. Mol. Biol. 2, 28-36.

  12. Bailey, T. L. and Elkan, C. P. (1995). The value of prior knowledge in discovering motifs with MEME. Proc. Int. Con. Intell. Syst. Mol. Biol. 3, 21-29.

  13. Grundy, W. N., Bailey, T. L. and Elkan, C. P. (1996). ParaMEME: a parallel implementation and a web interface for a DNA and protein motif discovery tool. Comput. Appl. Biosci. 12, 303-310.

  14. Frech, K., Herrmann, G. and Werner, T. (1993). Computer-assisted prediction, classification, and delimitation of protein binding sites in nucleic acids. Nucleic Acids Res. 21, 1655-1664.

  15. Quandt, K., Frech, K., Karas, H., Wingender, E. and Werner, T. (1995). MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23, 4878-4884.

  16. Pesole, G., Prunella, N., Liuni, S., Attimonelli, M. and Saccone, C. (1992). WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res. 20, 2871-2875.

  17. Liuni, S., Prunella, N., Pesole, G., D'Orazio, T., Stella, E. and Distante, A. (1993). SIMD parallelization of the WORDUP algorithm for detecting statistically significant patterns in DNA sequences. Comput. Appl. Biosci. 9, 701-707.

  18. Hertz, G. Z. and Stormo, G. D. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563-577.

  19. Pevzner, P. A. and Sze, S. H. (2000). Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 269-278.

  20. Jonassen I. (1997). Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci. 13, 509-522.

  21. Brazma, A., Jonassen, I., Vilo, J. and Ukkonen, E. (1998). Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8, 1202-1215.

  22. Marsan, L. and Sagot, M.-F. (2000). Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comput. Biol. 7, 345-362.

  23. Kel-Margoulis, O. V., Ivanova, T. G., Wingender E. and Kel, A. E. (2002). Automatic annotation of genomic regulatory sequences by searching for composite clusters. Pac. Symp. Biocomput. 7, 187-198.