In Silico Biology 3, 0004 (2003); ©2003, Bioinformation Systems e.V.  
BGRS 2002

An algorithm for identification of regulatory signals in unaligned DNA sequences, its testing and parallel implementation

Ludmila V. Danilova1,*, Vassily A. Lyubetsky1 and Mikhail S. Gelfand2,3




1 Institute for Information Transmission Problems RAS, Moscow, 101447, Russia
2 Integrated Genomics-Moscow, Russia
3 State Scientific Center GosNIIGenetika, Moscow, 113545, Russia

*  corresponding author
   Email: dlv2k@mail.ru





Edited by T. Werner; received September 30, 2002; revised and accepted April 04, 2003; published April 14, 2003



Abstract

We describe an algorithm (IRSA) for identification of common regulatory signals in samples of unaligned DNA sequences. The algorithm was tested on randomly generated sequences of fixed length with implanted signal of length 15 with 4 mutations, and on natural upstream regions of bacterial genes regulated by PurR, ArgR and CRP. Then it was applied to upstream regions of orthologous genes from Escherichia coli and related genomes. Some new palindromic binding and direct repeats signals were identified. Finally we present a parallel version suitable for computers supporting the MPI protocol. This implementation is not strictly bounded by the number of available processors. The computation speed linearly depends on the number of processors.

Key words: bioinformatics, regulatory signals, bacterial genomes, orthologous genes, parallel computing, MPI protocol



Introduction

Recognition of common regulatory signals in sets of DNA sequence fragments is an old and still actual problem of computational molecular biology. The pattern-based approach to this problem reduces to identification of a set of similar words present in (almost) all fragments from a given sample [Pevzner and Sze, 2000; Bailey and Elkan, 1995; Li et al., 1999; Gelfand et al., 2000] The sample can consist of upstream regions of genes believed to be co-regulated based on experimental data [Staden, 1989; Pesole et al., 1992; Wolfertstetter et al., 1996; van Helden et al., 1998; Tompa, 1999] or of regions upstream of orthologous genes from related genomes [McCue et al., 2001; Terai et al., 2001], for a review see [Gelfand, 1999]. In both cases it is assumed that these fragments contain sites that can be bound by some common transcription factor.

The plan of the paper is as follows. First, we describe the algorithm. Then we test it on simulated data in a controlled setting and study its tolerance to the increase of the fragments' length. Then we apply the algorithm to natural data and test its tolerance to the sample corruption by fragments not containing binding sites. Next, we describe the results of a large-scale analysis of regulation of orthologous genes of gamma-proteobacteria. Finally, we present a parallel version of the algorithm designed for such huge datasets.



The algorithm

The input of the algorithm is n sequences of arbitrary length and the output is a system, that is, a set of similar words of length l (in a particular case, one word per sequence). The system quality is the total pairwise similarity of words from this system. Similarity of two words is some function F (x, y), e. g. the Hamming distance between words, or any other fixed metric. The similarity function may also take into account additional features of words, e. g. their palindromicity.

Figure 1 features the general layout of the algorithm. At step (1) the input sequences are split into subwords of length l and all pairwise similarities F (x, y) are computed and stored. Then (2) a random graph is constructed that determines the order in which the fragments from the sample are processed (this step will be described later). The auxiliary graph G contains n vertices corresponding to input sequences. At step (3) the sequence fragments are assigned to the graph vertices. For each permutation (4) quadruples of fragments are considered in the order given by the current random graph G, and in each quadruple the best subset of words is selected. Output (6) is the set of the most similar words in all sequences.



Figure 1: Layout of the algorithm.


At each step of construction of the auxiliary graph G, the set of vertices is split into two subsets of (almost) equal size and an arc joining two subsets is selected. Figure 2 shows the graph G for n = 7. Consider this example in more detail.



Figure 2: Construction of the auxiliary graph.


Below we describe two main procedures, construction of the graph G and construction of candidate subsystem after merging the subsets.

Considered one fixed assignment r of sequences to the vertices of graph G, and let r(A) denote the sequence assigned to vertex A. Let us call sequence r(A) a sequence over A. A new step (called assembling) is executed for the fixed r; it performs transition from the data on two "neighboring" (and smaller) parts P1 and P2 of graph G to the data on their merging into a (larger) part P of graph G or, to be precise, transition from the data on merged sequences over A (where A runs P1 part) and the data on merged sequences over B (where B runs P2 part) to the data on merged sequences over C (where C runs P part).

Let sequences over P denote merged sequences r(A), where A runs P part. The system of words (no more than) one per each sequence over P is called a system of words over P. Of course, we search for a system over P = G which is called the signal; however, we search this signal as a special case over an arbitrary P.

Finally, the best possible system of words (no more than one per each sequence over P) with the signal value over the arc for part P equal to two predefined (arbitrary in the general case) words x and y from the corresponding sequences is called an extension of words x and y to the system over P. Clearly, extension of some words x and y to the system over P can be not a (sensible) system over P due to these "fixed" x and y values.

Thus, the above-mentioned transition (iterative step) consists in the following (Figure 3). Assume we have determined two sets of t best systems all over the parts P1 and P2, respectively; these parts with the arcs (A, C) and (B, D), respectively, were obtained by splitting subgraph P (of the initial graph G) with the arc (A, B). Clearly, simple merging of the best system over P1 and the best system over P2 is not even a (sensible) system over P (in the general case). Hence we will use a more delicate solution. The above set, e. g., for P1 consists of "best (possible) systems with fixed ends" over P1 rather than of the best systems over P1; i. e., the first set consists of extensions of any two words a and c from the sequences over A and C, respectively (where the pairwise quality of a and c exceeds certain fixed threshold u), to the P1 set. In a similar way, the second set is composed of t best extensions to the whole set P2 of words b and d arbitrarily selected over vertices B and D.



Figure 3: Iterative step of Assembling.


For example (Figure 2), let P1 = {1, 5, 6, 3} and P2 = {2, 7, 4}; (A, C) = (1, 3) and (C, D) = (2, 4); while P equals merged sets P1 and P2 (coincidence of P and G in this case is clearly a random event; it always happens at the end of assembling rather than at the preceding steps). Let t = 1. At the previous iterations of the considered cycle two functions f(a, c) and g(b, d) were calculated. Function f(a, c) on any pair of words (similarity between a and c from sequences over A and C exceeds certain fixed threshold u) outputs the best signal in the sample {r(1), r(3), r(5), r(6)} corresponding to the P1 set (at the fixed r). In a similar way, function g(b, d) on any pair of words (b and d from sequences over B and D are similar enough in the same sense) outputs the best signal in the sample {r(2), r(4), r(7)} corresponding to the P2 set (at the same fixed r). This iteration underlies the building function h(a, c) outputting analogous information for a larger set P.

Namely, search for all words c and d from the sequences over C and D, respectively, with the pairwise quality of words a and c as well as b and d above this threshold (to be precise, with defined extensions) allows us to select (using f and g, respectively) pairs (a, c) and (b, d) extensions on P1 and P2 giving us t best signals over the whole P upon merging all P.

If the threshold condition cannot be satisfied, the corresponding value of the signal over P is set to zero; stated differently, the threshold condition can provide for partially defined signals over G, which is, however, natural in terms of statement of the initial problem.

The generation of permutations the sequences by graph G vertices provides for at least single assignment of each sequence pair are an arc in graph G; in this case the next permutation is selected to increase the number of jointed sequence pairs not joined in previous iterations. This provides for sensible amount of iterations of the external cycle and sufficient diversity of the calculated permutations r.



Results and discussion


Testing: simulated data

Pevzner and Sze [2000] suggested a general scheme for testing signal identification algorithms on simulated data. Eight samples (the average value of these samples was used in Table 1 and they were used as tests 1-8 in Table 2) each containing n = 20 sequences of length m are generated. A common signal of length l = 15 is implanted in each sequence with 4 mutations. The fragments length m varies from 100 through 1000.

Definition 1.   Let K be set of known signal positions in a sample and let P be the set of predicted positions. Then the performance coefficient is |KP|/|KP| [Pevzner and Sze, 2000].

Table 1 gives the average performance coefficient of our IRSA algorithm as defined for these eight samples. For comparison we reproduce here the data for several algorithms reported in Pevzner and Sze [2000].

Table 1: Testing results (performance coefficients) of IRSA in comparison with data for other algorithms [Pevzner and Sze, 2000].
Programs Length of sequence fragments (m)
100200300400 500600700800 9001000
CONSENSUS0.920.940.530.31 0.290.07 0.150.090.010.04
GibbsDNA0.930.960.510.46 0.290.120.090.340.000.12
MEME0.910.780.590.370.17 0.100.020.030.000.00
WINNOWER (k=2)0.980.980.970.950.970.92 0.580.020.020.02
WINNOWER (k=3)0.980.980.970.940.970.92 0.900.930.900.88
SP-STAR0.980.981 0.960.96 0.840.830.690.64 0.23
IRSA 0.99 0.950.910.74 0.64 0.600.47 0.37 0.310.28


The only algorithm with better performance is WINNOWER with k = 3. However, the upper estimate of the number of steps of the winning algorithm is exponential relative to the length l of the signal (22l), although, as shown in Pevzner and Sze [2000] the actual run time of the algorithm in realistic situations is lower. Still, it is not clear, whether this algorithm is applicable to signal of length l essentially more then 20.

On the other hand, the upper estimate of the number of steps in the consequent variant of IRSA is: n2m3l, where n is the number of sequences, m is the maximum of length of input fragments and l the length of the signal. In addition, a good parallelized version of IRSA can be constructed (as it can be seen from the Appendix).

We also considered the following discrete criterion for the same data (Table 2)

Definition 2.   A real site is assumed to be found if it is overlapped by some predicted site by at least half of its length.

Definition 3.   A predicted site is assumed to be hit if it is overlapped by some real site by at least half of its length.

Definition 4.   Sensitivity Sf = found/real. Specificity Sh = hit/predicted.

Table 2: Sensitivity Sf for the simulated samples.
  Length of sequences
100200300400 500600700 8009001000
test 11.001.00 1.000.850.00 0.85 0.500.650.400.40
test 21.000.950.700.80 0.750.85 0.600.250.30 0.25
test 31.001.000.85 0.900.90 0.750.350.350.250.15
test 41.001.001.00 0.80 0.850.750.70 0.450.150.35
test 5 0.95 0.850.951.00 0.900.000.75 0.400.25 0.40
test 60.95 0.950.850.85 0.750.85 0.700.50 0.50 0.25
test 71.00 0.900.900.85 0.800.800.50 0.300.350.30
test 81.00 1.001.000.70 0.750.550.30 0.250.400.25
Average0.99 0.96 0.910.84 0.710.68 0.55 0.390.33 0.29


Testing: natural data

The upstream regions of genes regulated by PurR, ArgR and Crp are considered. The PurR sample consisted of 19 sequences of length 200 with 21 sites of length 16 bp (two sequences contain two sites each). The ArgR sample consisted of 9 sequences of length 200 with 19 sites of length 18 bp (one sequence contained three sites and the remaining sequences contained two sites each). The CRP sample consists of 31 sequences of length 200 with 48 sites of length 22 bp (16 sequences contained one site each and the remaining sequences contained 2 through 4 sites). If a sequence contained more than one site, additional sites were recognized when the first found site was masked. To imitate corruption of the sample by siteless fragments, we eliminated one by one the strongest site from each sample while there were more then 3 sites in the sample or the average similarity of the remaining sites exceeded 1.

The values of sensitivity Sf and specificity Sh for the PurR, ArgR and CRP samples are given in Tables 3, 4, 5, respectively.


PurR sample.   Loss of true sites started when sites in eight fragments were masked. Still, even when half fragments had no sites, the majority of remaining sites were correctly recognized.

ArgR sample.   Arginine box is a weak signal and the binding specificity is cooperative recognition of site pairs by multimeric repressor complexes [Maas, 1994]. Nevertheless, ArgR binding sites were identified even when a considerable number of the best sites were eliminated. The first lost site was observed after five sites were eliminated.

CRP sample.   The sample of CRP binding sites included numerous weak sites. Many of these sites were not found even in the primary search over the uncorrupted sample (Table 5). Still, the specificity and sensitivity remained at the acceptable levels while more than 90% of sequence fragments in the sample contained sites.

Table 3: The results for PurR samples.
Number of sites in sample Number of sequences without sites Number of found sites Sensitivity Sf Specificity Sh
210190.901.00
200190.951.00
191180.950.95
182170.940.89
173160.940.84
164150.940.79
155140.930.74
146130.930.68
137110.850.58
128100.830.53
11960.550.32
101060.600.32
91150.560.26
81250.630.26
71320.290.11
61420.330.11
51520.400.11
41600.000.00
31600.000.00


Table 4: The results for ArgR samples.
Number of sites in sample Number of sequences without sites Number of found sites Sensitivity Sf Specificity Sh
19090.471.00
18090.501.00
17090.531.00
16090.561.00
15090.601.00
14080.570.89
13080.620.89
12140.330.44
11120.180.22
10120.200.22
9120.220.22
8120.250.22
7220.290.22
6320.330.22
5420.400.22
4500.000.00
3600.000.00


Table 5: The results for CRP samples.
Number of sites in sample Number of sequences without sites Number of found sites Sensitivity Sf Specificity Sh
480270.560.87
470230.490.74
460240.520.77
451250.560.81
442190.430.61
432180.420.58
422170.400.55
412140.340.45
403150.380.48
393130.330.42
384100.260.32
37590.240.29
36690.250.29
35660.170.19
34780.240.26
33860.180.19
32960.190.19
31970.230.23
301060.200.19
291150.170.16
281140.140.13
271250.190.16
261250.190.16
251250.200.16
241330.130.10
231460.260.19
221440.180.13
211430.140.10
201430.150.10
191540.210.13
181630.170.10
171730.180.10
161740.250.13
151840.270.13
141940.290.13
132030.230.10
122120.170.06
112100.000.00



Application: groups of orthologous genes

Complete genomes of gamma-proteobacteria Escherichia coli [Blattner et al., 1997], Escherichia coli O157 [Hayashi et al. 2001], Salmonella typhi [Parkhill et al., 2001b], Salmonella typhimurium [McClelland et al., 2001], Yersinia pestis [Parkhill et al., 2001a], Vibrio cholerae [Heidelberg et al., 2000], Haemophilus influenzae [Fleischmann et al., 1995], Pasteurella multocida [May et al., 2001] were considered.

A pair of genes from two genomes was considered to be orthologous if these two genes were the closest relatives of each other in these two genomes [Tatusov et al., 1997]. Then, pairs of orthologs were merged into clusters using the single linkage criterion. Transitivity was not required and small differences in the similarity level were ignored (thus one gene could have more than one ortholog in any given genome).

Upstream regions of length 200 bp were selected. No overlaps with other genes were allowed, so if the distance to the upstream gene was shorter than 200 bp, only the spacer was considered.

Closely similar fragments were filtered out, retaining E. coli fragments whenever possible. This allowed us to search for conserved regulatory signals without interference from insufficiently divergent sequences from closely related genomes (strains). The criterion of excessive similarity was existence of matches in at least 35 out of any 40 consecutive positions.

After the filtration step, there were 1967 subsamples of at least three fragments each. After that, 345 sequences of known regulatory sites of E. coli were taken from the dpinteract database [Robison et al., 1998] and matched to the samples. Both directions of DNA sequence fragments were considered. Overall, 311 known sites were mapped to 239 sequences. Other sites were not found either because they were located outside of the selected fragments or because the regulated genes had no orthologs. Known sites could be placed in several samples if they were located between divergently transcribed genes, since each known site was considered in direct and complementary directions.

Thus the input files contained 3 through 40 sequences fragments of length 40 through 200 bp. Signals of length 15, 20, 22, and palindromic signals of length 15, 16 and 22 were considered. The results are shown in Table 6. Ninety nine out of 311 known sites were found (that is, coincided with predicted signals or were subwords of the signals). Other sites were not found either becase the signals were too weak to be identified or because orthologs lost the regulation.

Table 6: Number of known sites and detected sites.
RegulatorNumber of known sites Number of known sites in samples Known sites on the direct strand Known sites on the complementary strand Detected sites
ArcA149279
ArgR17201463
CpxR126422
Crp494126153
CspA46331
CynR24221
CytR54400
DeoR31100
DnaA84311
FadR79721
FarR48264
Fnr1410641
FruR126334
Fur99634
GalR74312
GcvA41101
GlpR1314954
Hns1511654
Hu31100
IclR21011
LacI30000
LexA19171349
MalT1017985
MelR24220
MetJ152013711
MetR810731
NarL117431
NarP810640
NtrC54311
OmpR97522
PdhR22201
PurR22151238
RpoN64313
TorR412846
TyrR17131305
Total345311 20310899


In a separate study, IRSA was successfully applied for analysis, in particular, of the GlpR-regulon in alpha-, beta-, gamma-proteobacteria. This regulon encodes proteins required for utilization of glycerol-3-phosphate and its precursors. New palindromic binding signals were identified in -proteobacteria:

TGTTCGATAACGAACA in Enterobacteriaceae,
wTTTTCGTATACGAAAAw in Pseudomonadaceae,
AATGCTCGATCGAGCATT in Vibrionaceae.

In - and -proteobacteria the signal consisted of 3-4 direct repeats of TTTCGTT with a spacer of 3-4 base pairs [Danilova et al., in press].

The exe-file of the program is available from L.V. Danilova.




Appendix: the parallel implementation

Like other signal identification algorithms, IRSA is quite time-consuming, as it requires O(m3n2l) time. Thus, large-scale application of this algorithm requires an efficient computational scheme, e. g. using a parallel computer.

Parallelization of the existing algorithm is complicated by two factors which are typical for many algorithms in genomics: cyclic (i. e. consecutive) structure of the algorithm and the model of the shared memory it is implicitly based on. However, the majority of modern parallel computers divide the random access memory more or less equally among processors interacting with each other via the internal network. The communication is accomplished by short messages conforming to the MPI or a similar standard. Apparently, this method of data sharing is more time-consuming than direct access to the shared memory. Now we shall discuss the first of these difficulties.

The sequential algorithm has the structure outlined in Figure 1. Here each repetition of the loop searches the for quasioptimal solution of the problem for some fixed enumeration of the given sequence fragments. For brevity this enumeration we call permutation, and the process of the problem solving is called assembling of the given permutation. Once a permutation has been assembled, the algorithm generates the next permutation, and so on. The optimal solution would be the best one found from all possible n! permutations, which is of course unattainable.

In reality we choose a termination condition of the loop so that the solution obtained from already considered permutations would be close to the absolute optimum in terms of some fixed quality function. For that we propose two principles and two corresponding conditions of the algorithm termination. The first principle takes into account the quality of the found signal and the relative contribution of each sequence to the joint quality: the higher is the score of a site in a particular fragment, the smaller is the rank of this fragment in the next permutation. Another principle takes into account the completeness of coverage of the input sample by main arcs of graph G which controls the consecutive dichotomy of the sample (see "The Algorithm").

The main burden of calculations (and, therefore, the operating time of the algorithm) falls to the assembling phase. Since this block is located inside a loop (see Figure 1), the only opportunity to parallelize the algorithm is to perform parallel calculations inside the block. Though possible in principle, the assembling process is hardly scalable, as vector-like data structures correspond to the consecutive dichotomy of the input fragments. So dimension of a vector becomes rigidly connected to the number n of sequences, and besides it is changeable at different levels of the graph G, i. e. at different stages of assembling. In addition, the time of computation for separate elements of this vector (i.e. sequence pairs) varies, whereas the algorithm needs to wait until the slowest component is complete prior to processing the next level of the graph. Thus, a naive attempt to link separate stages or similar elements of assembling to processors would not be time effective, nor would it allow us to use all available processors and balance their workload.


Two-dimensional list of permutations and the wavelike scheme of computation

When the algorithm solves an optimization problem for a given quality function, one can observe the following analogies. Each permutation is similar to a point in the space of optimum search, and assembling leads to the result similar to calculation of the function value in this point. The consecutive algorithm generates points of two kinds corresponding to the above mentioned principles. The second principle of the coverage maximization provides new base points to search for the global optimum (we need them as we have no information on the geometry of the response surface). The first principle leads to the best local solution, starting at a chosen base point.

The general idea of the proposed parallelization technique is that all available processors of the parallel computer system (except for one root processor) at every moment are engaged in local optimization, each one starting at its own base point. All data necessary to apply the first principle are kept in the local memory of the processor in charge. The dedicated root branch receives only results of the local calculations (recognized signals along with their quality data). On completion of the local optimization (this process we call extension), released processors receive new base permutations, and so forth. The termination criterion of this algorithm will be exhaustion of the entire set of base points generated using the second principle, provided that all parallel branches have completed their computations.

To implement this idea we generate not a straight queue of permutations like in the consecutive algorithm, but the two-dimensional <P, Q> list (Figure 4). Its backbone is P-list of permutations P1, P2, ..., Pk composed at the start of the algorithm; the length k of this list is determined by the desirable degree of coverage for the set of base points. For example, if we demand that each pair of input fragments appears at least once as a principal arc at the upper level of graph G, i. e. these sequences fall into different halves at the very first dichotomy of the entire set, then, obviously, k = n (n-1)/2.



Figure 4: <P, Q> list of permutations and assembling order.


Further, each element of the P-list initiates so a called Q-list corresponding to the above described process of extension. These Q-lists are constructed dynamically during the work of the algorithm. Specifically, after analysis of quality of the signals already found in the i-th Q-list (including its root Pi), the list either proceeds with a next permutation Qi, j or terminates (symbolically denoted by asterisk in Figure 4). In the latter case the processor earlier serving the i-th Q-list is released and takes the next unprocessed element of the P-list, extending from it a new Q-list.

It is easy to see that in such two-dimensional set of permutations, the parallel processing is conducted like a computing wave propagating from the left top corner of the structure until the complete <P, Q> list is processed. For the sake of simplicity Figure 4 shows successive positions of the wave front for two processors working in parallel and with the assumption of constant duration of assembling for any permutation. However, one can easily imagine how this algorithm works in the general case.

From the parallelization efficiency point of view, the described scheme has many advantages of which we note three most important ones: (1) it allows one to put into operation many processors at once (at least as many as is the length k of the P-list); (2) all secondary branches assembly their given permutations independent of other branches; and (3) low-intensive data exchange between each of secondary branches and the root branch of the algorithm: a permutation itself towards the secondary branch (n values), and the found signal along with a quality of each its of words (2n values) towards the root branch.

These advantages allowed us to create the parallel algorithm independent of any specific parallel computer and the number of available processors; the only prerequisite is that the computer complex supports the MPI standard (at least the first edition). We also avoid using shared memory, as the criteria checking and generation of the permutations are performed by the single root branch. The only common data (a matrix of pair-wise similarities of all words from all sequences) is not changed during the work of the algorithm, so it may be calculated and stored simultaneously in all parallel branches.

The described parallel algorithm has been implemented as a 32 bit console application written in ANSI C. It may be compiled by gcc, pgcc, Borland C++ 5.02 and other compilers without any modification, and works in Windows 9x/NT4/ME/2k/ XP and Linux environments. Debugging and initial testing of the program were carried out on PCs using WMPI 1.54 by Critical Software. The experiments are run on the MBC1000M supercomputer of the Joint Supercomputer Centre of RAS. The results obtained for artificial and real examples show high efficiency of parallelization of the algorithm: the load ratio of secondary processors equals 94-96%. Experiments also confirm the theoretically predicted linear dependencies of the assembling time on the number of input fragments (Figure 5) and of the computation speed on the number of available processors (Figure 6). We did not observe any deceleration of this linear growth of performance.



Figure 5: The assembling time as a function of the number of input sequence fragments.



Figure 6: Linear growth of the parallel algorithm performance (n = 14, m = 200, l = 20).



Acknowledgements

This study was partially supported by grant 55000309 from The Howard Hughes Medical Institute. We are grateful to P. Novichkov for the help with the data, and to K. Yu. Gorbunov, L. I. Rubanov and A. A. Mironov for discussions, to O. N. Laikova for help with analyzing GlpR and to E. Stavrovskaya for processing the results.



References