AliBaba2: Context Specific Identification of Transcription Factor Binding Sites

Niels Grabe





Bioinformatics Group,Institute for Technical and Business Information Systems,
Otto-von-Guericke University Magdeburg,
PB 4120, 39016 Magdeburg
E-mail: grabe@iti.cs.uni-magdeburg.de





Edited by E. Wingender; received December 22, 1999; revised February 28, 2000; accepted April 28, 2000


ABSTRACT

Currently, prediction of transcription factor binding sites is widely done using matrices collected from literature. This leads to several problems. We cannot actively control the conservation of the matrices, we cannot systematically use all binding sites available, we do not know which sites were used and which were discarded in matrix construction, we cannot compare and evaluate matrices easily, we cannot detect redundancy and we cannot control sensitivity and specificity. So we are lacking control during the identification process. In this paper I propose a method overcoming these problems. I assume that each binding site has an unknown context which determines its sequence. This leads to the idea of constructing specific matrices for each sequence we are analysing. To do so we have to regard identification of binding sites as a general process, starting at a dataset of known binding sites and ending with the identification of a potential new binding site. In this paper such a process is presented. Besides overcoming the mentioned problems, the implementation also reaches a significantly higher accuracy than current approaches. Evaluations are done analysing all binding sites of TRANSFAC 3.5 public. The resulting tool AliBaba2 is available at http://wwwiti.cs.uni-magdeburg.de/~grabe/alibaba2.

Keywords: bioinformatics, data mining, clustering, context specific prediction, multiple alignment, nucleotide matrix, promoter analysis, sequence modeling, specificity, sensitivity, transcription factor binding sites



INTRODUCTION

Identification of transcription factor binding sites is a basic step for the analysis of gene regulatory networks. Currently it is done rather unsystematically by comparing matrices taken from literature to an unknown DNA sequence. There are several programs using these matrices for analysing unknown binding sites, e. g. Signal Scan [1], TESS [2], MatInspector [3] and MatrixSearch [4].The current matrix approach has several disadvantages. By using the predefined matrices, we cannot influence the conservation of the matrices. For some factors there are several very specific matrices available, for others we just have one single weak matrix. So we cannot say how accurate a prediction will be. Also, we often do not know how many binding sites were used for constructing a matrix. This is essential for supporting a prediction. As we do not know the construction process of the matrices, we do not know which binding sites are used and which are not. This means we do not know the criteria which led to the decisions taken. Also, we cannot measure the similarity of matrices, which leads to redundancy in predictions. So we can hardly interpret the results of a prediction program. This makes it impossible to process the results in further in silico experiments. And, absolutely important, we cannot actively control specificity and sensitivity of the predictions. To overcome all these questions arising from the current state of the art, I propose a new approach for the identification of binding sites: a context-specific process starting at the known binding sites and ending with the identification of a potential new site. The basic idea for this process is the idea of context.


Context

A context comprises all conditions which determine the DNA sequence where a transcription factor binds.


Context specific identification of sites

Context specific identification of binding sites means the construction of individual matrices for each sequence to be analysed.

Principal idea and algorithms presented in this paper are not restricted to the matrix approach, an oligonucleotide approach is possible as well. But the matrix approach is supported by the Berg and von Hippel theory which yields a clear distance measure as we will see later. There are several motivations leading to the assumption of a context. I do not want to define the semantics of context in detail. I just assume each binding site possesses a context given by the role it plays in gene regulation.

  1. Suppose we use the current approach of predefined matrices and therefore construct a new matrix. Constructing a matrix means grouping sites. So what is our grouping criterion? As there are more and more sites for one factor available, this is an important question. The problem of grouping sites exceptionally shows in the well known gap problem which is partly caused by flexible binding domains [5]. Introducing gaps in matrices is not only a difficult task but also leads to unidentifiable binding sites not containing the gap. We can avoid the gap problem if we construct individual matrices for each number of gaps we observe. If we do this, we already perform an individual context specific analysis of the sequence under study. We see, the gap problem is just a specialisation of the question of how to group sites. We have to find a clear algorithm to decide which sites to use for constructing a matrix and which not.

  2. The idea of context specificity can be seen as an extension to the Berg and von Hippel [6] theory. In their model, Berg and von Hippel construct a matrix of energy values from a multiple alignment. The matrix allows to calculate a relative energy needed to bind a transcription factor to DNA. The energy is calculated relatively to the binding constant of the factor. For the consensus sequence of the matrix a relative energy of zero is needed. The more a DNA sequence differs from the consensus sequence the more energy is needed. The question never asked by Berg and von Hippel is where the alignment comes from. Berg and von Hippel made the assumption of one factor – one ideal alignment. But already now for several factors there are too many binding sites available to construct only one matrix. The idea of context specificity addresses this problem which means to derive the alignment from the sequence analysed. For each sequence analysed the algorithm builds a matrix from known binding sites. Each matrix allows us to calculate the energy needed to bind the factor. If not too much energy is needed, a binding of the factor can be predicted. In the ideal case we build matrices with consensus sequences equal to our analysed sequence. Then, according to Berg and von Hippel, there is no relative energy needed.

  3. The extension of the Berg and von Hippel model has a significant theoretical advantage. For each known binding site the Berg and von Hippel model gives the energetic distance to the consensus of the matrix. The approach presented here allows to build matrices with consensus sequences equal to our unknown sequence. In this case the individual energy distances do not only represent the distances to the consensus sequence. They also give the distances to the unknown sequence analysed. With the Berg and von Hippel energies we have an energy based measure to calculate the distances between the known binding sites and our unknown sequence. Thus, with the extended Berg and von Hippel model we can measure the confidence of a binding site prediction in a physical way.

  4. An example of context specificity can be found in the organisation of actin alpha cardiac promoters [7]. If we compare the promoters of human, chicken, frog and mouse, we see four CArG Boxes (CArG1-4) in each single promoter. Although all sites in all promoters bind SRF, the binding sequences differ inside each promoter. However, the sequences of CArG1 in all four species are highly similar. The same applies for CArG2-4. So we can directly see that each binding site of SRF is determined not by random but by a constant context force.

  5. An operational motivation for context specificity can be derived from Wassermann and Fickett [7]. They show that by using muscle-specific matrices a much higher specificity can be achieved in promoter recognition. Similarly Frech et al. [8] use muscle-specific TATA und Inr matrices. This supports the idea of tissue as a context determining parameter.


METHODS

The method for a context specific process to identify binding sites consists of five steps. Steps 1 and 2 form a preprocessing phase. The context dependency begins with step 3.

  1. Binding sites are extracted from TRANSFAC [9] and expanded to a standard length using EMBL Database [10].
  2. The binding sites are grouped by transcription factors (Txxxxx) assigned to a class of the classification domains from [11]. Steps 1 and 2 are trivial and not shown here.
  3. Pairwise alignment: the known sites are pairwise aligned to the sequence to be analysed.
  4. Segmentation: the aligned known binding sites are grouped according to their function and position in the unknown sequence. Each segment ideally contains just one binding site. However, in case that positions overlap, we may get segments containing more than one site.
  5. Model construction: For each determined segment a matrix is constructed. If a segment contains more than one potential site, the algorithm constructs multiple matrices. As already stated above, matrices are used for modelling.


Pairwise alignment

In this step the known binding sites are pairwise aligned to the unknown sequence to be analysed, u. Firstly, a score which constantly rises by h for runs of successive matching nucleotides is given. Secondly, the score of the last matching nucleotide is assigned to all nucleotides of the run. Each run of length l gets a score of h*(l-1). This ensures that a successive run of length l is always scored higher than two runs of length a and b with l = a + b. With n as the number of matching runs in a binding sequence s and the unknown sequence u we get:

pairwisesim(u, s) := åi=1..n h* length_of_run(i)

Example with h=2:

a a c a g a c g t t t

t a c a g t c g t c a

0 8 8 8 8 0 6 6 6 0 0

In step 1 binding sites have been expanded using EMBL to a length of, e.g., 20 bp. Only a relatively short part of the 20 bp sequence will be physically necessary for binding the factor. Therefore, the pairwise alignment is done considering only a window of the extracted binding sites of a predefined length (e.g. 12 bp). This window is called the carrier of the alignment. The length of the carriers is chosen equal to the length of the matrices constructed later on. Result of this step is a set P of pairwise aligned binding sites with pairwisesim exceeding a given threshold pairsim.


Segmentation

To construct matrices from P, we first have to structure it by function and position. We can do this as follows. TRANSFAC supports a classification of DNA-binding domains organised in six levels (Tab. 1).

Table 1: Classification of Transcription Factors taken from [11]

Level Name Criterion Example
1 Superclass General topology of DBD Zinc-coordinating
2 Class Structural blueprint of DBD Zinc finger nucl. recept.
3 Family Functional criteria T3R/RAR
4 Subfamily Sequence similarity of DBD RAR
5 Genus According to factor gene RAR-ß
6 Factor "species" Initiation/splice variants RAR-ß1, RAR-ß2

So we can set up a relation between two domain classes c1 und c2 with regard to a considered level k:

c1 <k c2 Û c1 <k-1 c2 with c1 <1 c2  Û c1 < c2.

This enables us to sort P = {p} when a level k is given.We then build blocks B of P. Each block contains sites belonging to just one binding domain structure.

Example: A pairwise alignment p is described by a pair <site_id, site_class>. There is P = { <R0001, 4.2.3.1>, <R0002, 4.2.3.2.>, <R0003, 4.2.3.5.>, <R0004, 4.2.4.1.>, <R0005, 4.2.4.2.> }. The user of the program sets level k to 3. So we get the following blocks B:

    B1 = { <R0001, 4.2.3.1>, <R0002, 4.2.3.2.>}

    B2 = { <R0004, 4.2.4.1.>, <R0005, 4.2.4.2.> }.

Then we split each B into segments G using the position of each p Î B regarding the analysed sequence. We introduce a split when two p of B are further apart than half the desired matrix width. So the segmentation step ends with a set of Segments G. Each G contains p Î P from binding sites bound to similarly structured domains and found at similar positions.


Matrix Construction

Matrix construction is done during a gap free heuristic alignment. Gaps are not needed because only similar sites are aligned to the sequence to be analyzed, u. The heuristic alignment is inevitable for adequate performance. It aligns the binding sites of each segment G in the order given by each site’s similarity to u. The problem now is that we cannot simply construct one single matrix out of the sites found in one segment. As explained above, it is possible that there are multiple sites in one segment. This is the case when binding sites overlap. In such a case we have to construct multiple matrices for our segment G. This is exceptionally the case when analysing promoters because neighbouring or alternative binding sites are essential for promoter function. So how can we deal with multiple sites in a segment ?

The most obvious approach constructs one common matrix in the beginning and tries to split it afterwards. This is illustrated in Fig. 1. Out of four binding sites we construct one common matrix. The common single matrix should contain two distinct parts (shaded). Due to the overlapping of sites noise is introduced. To avoid this kind of complications, we adopt the strategy of hierarchical clustering. We successively merge binding sites of G to an increasingly larger matrix until the matrix’s quality is below our requirements.

 
Figure 1: Using overlapping sites in matrix construction leads to noise.


Each site that we merge to a matrix has to be checked. I propose three requirements for a site to be used in a matrix:

  1. The matrix’s average conservation should be above a given minimum conservation, called cons

  2. The matrix should have a minimum similarity to u. We therefore introduce the similarity measure promsim in the following.

  3. There should be a minimum number of sites in a matrix. The number of sites used to construct the matrix is called the support of the matrix.

The conservation can be easily calculated [12; 13] by the average information content. For similarity calculation between matrix and u the Berg and von Hippel [6] measure gives best results (not shown here). In their model Berg and von Hippel first calculate the relative binding energy r(u) which is needed for a protein to bind to an unknown DNA sequence u. Multiplication of relative binding energy with the factor’s binding constant gives the total binding energy. This relative energy is calculated from a matrix of base-pair probabilities plB and a correction parameter l. The relative reduction in binding energy is

where elB is the energy needed for binding to each single nucleotide. The probability plB is taken from the matrix to which the unknown sequence u is compared. pl0 is the matrix entry for the corresponding nucleotide from the unknown sequence. We set l = 1 since we just process pure sequences. As we now know the reduction of energy, we can calculate the resulting binding energy of the factor binding to the sequence u:

energy (u, m) := K0 * exp (- r(u) )

where K0 is the binding constant of the transcription factor and the matrix m = (plB). We have to set K= 1 because we have no information about it in our sequence databases. We get our similarity measure promsim:

promsim (u, m) = energy(u, m) with K0 = 1 and l = 1.

The following shows the resulting algorithm for matrix construction. For each site the algorithm decides whether to use it in an already existing matrix or to create a new one. Poorly supported matrices are discarded at the end. For matrix construction we have to specify the width of each matrix. The algorithm presented uses a standard width given by the parameter matwidth. This may seem not appropriate since different factors possess binding sites of a different length. But in the general case we do not know the ‘real’ length of binding sites. With the parameter cons we require a minimal conservation, which means we require a minimal amount of information needed for a decision about the existence of a binding site. By varying matwidth we also vary the requested amount of information. We do this indirectly by reducing the amount of information in the matrix. As we do not need two parameters for the same thing, we can set matwidth constant and vary only cons.


Algorithm matrixconstruction

for all p Î G

              ownmatrix := false

              for all m Î already constructed matrices M

                                 if p already used for any M

                                               do nothing, continue with next p

                                 m’ := m merged with p

                                 if promsim (u, m’) < promsim

                                               ownmatrix := true

                                 else if avgconservation(m’) < cons

                                               ownmatrix := true

                                 else

                                               m := m’ // Keep merger

                                               continue with next p

              if ownmatrix

                                 M := M È initial_matrix_from(p)

for all m Î M

              if support(m) < minsupp // support = number of sites a matrix is derived from

                                 discard m

Ñ

Fig. 2 illustrates the interaction of the introduced concepts: binding sites, pairwise alignment, segments and matrices in an example. There are 12 sites given in s1 to s12. They lead to 12 pairwise alignments p1 to p12. These are aggregated in two segments of which each contains two matrices.

 
Figure 2: Interaction of concepts.


IMPLEMENTATION

The method was implemented as the program AliBaba2 (alignments for analysis of binding sites) using C++ with the Standard Template Library (STL) on Windows NT and Solaris. For convenient usage an interactive web interface was developed (http://wwwiti.cs.uni-magdeburg.de/~grabe/alibaba2). There is a two step user interface for viewing the prediction results. At first only the names of segments are given as the prediction output. Then each segment can be clicked to show the sites used for the according alignment. All sites, factors, and classes are linked to TRANSFAC and can be individually explored.


RESULTS

For evaluation of the presented approach several tests have been done. As a source for binding sites TRANSFAC 3.5 public has been used. EMBL Release 60 was used to expand the sites. As a reference tool MatInspector 2.2 in the public version was used for comparison. As a negative test set the exon sequences (AF009652, AFPYRG_5, AFUOXGENE_4) with length 2 kb have been used. To check if these sequences are a representative sample, the sample as well as the following 40 kb of exon sequences have been analysed (AAD7712_8, AALFUL_9, ABPGKA_6, AF011341_6, AFMETALLO_9, AFMETPRO_8, AFU132504_10, AN03278_8, ANDNAUAY_6, ANGPDAA_13, ANNIIA_8, ANPACA_6, ANPGP1_8, ANPGP2_6, ANPRNB1_4, ANQUTR_3, ANQUTR_5, ANUAPA_10, AOD895_7, AOPGKA_8, BCPGP1_6, CCD11_5, CCHTS1G_9, CCY15418_7, CHY15839_8, CPGNRPA_7, CSD540_9, ED18SRE_5, ED18SRR1_4, EN1836_8, ENAJ11563_8, ENNIRA_5, ENU62482_9, GCGLN_8, GFNIAD_7, GFP450I_7, GFU17243_8, HSGAMP_9, MCPYRG_11, NC11565_8, NCBLI4GEN_6, NCCPC3_10, NCPKNCHOM_5, NCSPII_6, NHPDA61_9, PAFPR1_7, PBAB3043_7, PBR6073_15, PCCBHI2A_6, PCCELLUG_5, PCCELLUG_11, PCFACAG_5, RNPYR4_9, RRPAL_8, SCPFY_13, SCSMIP_7, SMRAD9_8, SNO9826_11, SP18SRE_5, SPALP1_7 ).

The number of binding sites normalised to 500 bp predicted for the sample and the 40 kb set are shown in Tab. 2 with varying parameter conservation. The sample can thus be considered representative.


Table 2: Accuracy of the chosen 2 kb exon sample

cons predicted sites per 500 bp in 2 kb predicted sites per 500 bp in 40 kb
65 % 180 180
70 % 111 111
75 % 34 39
80 % 12 12
85 % 3 2
90 % 0 0

The data set resulting from steps 1 (expansion) and 2 (class assignment) results in 3196 sites grouped to 742 factors, where each factor contains at least 4 non redundant binding sequences. All tests are done in respect to the following reference set of parameters.


Reference parameter set


Parameter evaluation

In the following the influence of the single parameters on the binding site recognition accuracy is evaluated. Recognition accuracy is described by a two dimensional point with sensitivity and specificity as coordinates. Both, sensitivity and specificity are measured in absolute numbers to simulate a realistic application.

Fig. 3 shows the influence of parameter pairsim on the recognition accuracy. Specificity is given in logarithmic scale to address its approximately linear correlation to sensitivity.

Figure 3: Influence of pairsim on recognition accuracy.


We see that the selection of the pairsim parameter has little influence on recognition accuracy. But it has a big influence on runtime (Fig. 4). Shown is the runtime of segmentation and matrix construction steps. They are measured together as matrix construction is called for each segment identified in segmentation step. As segmentation is cheap in runtime, matrix construction is the time consuming step of both. The lower pairsim the more pairwise alignments are found and thus the time for matrix construction rises. Therefore we should keep pairsim at 64.


Figure 4: Influence of pairsim on runtime.

The influence of matrixwidth is checked in the following. Reasonable values comprise 8, 10, 12 and 14 bp. We clearly see that 10 and 12 bp give the best results (Fig. 5). The runtime is also influenced by matrixwidth. A matrixwidth of 12 bp is nearly 50 % more time consuming than a width of 10 bp (Fig. 6). Still, if runtime is not most important, 12 bp should be chosen as the standard parameter.

Figure 5: Influence of matwidth on recognition accuracy.
Figure 6: Influence of matwidth on runtime.


Next, the parameter promsim is evaluated. A promoter similarity of 100 % reflects that the matrix consensus is identical to the analysed sequence. If an unknown sequence contains a nucleotide different from a maximally conserved nucleotide in a matrix the Berg and von Hippel score will yield a similarity of 0 %. Thus, the Berg and von Hippel measure is rather strict. Other measures could be used which would enable a comparison of measures. In Fig. 7 promsim is varied from 1% to 100 %.


Figure 7: Influence of promsim on recognition accuracy.


As expected, a 100 % value of promsim gives highly specific results in the very high specificity area above – 15 FP / 500 bp. For the general application case 1 or 10 % are optimal values. Runtime for promsim is not shown as no significant dependency could be measured.

The last parameter evaluated is the classification level. During segmentation binding sites are aggregated by “function” using the domain classes of transcription factors. Which class level yields optimal results? Fig. 8 compares the levels 3, 4, 5 and “3 to 5”. “3 to 5” means the whole range of levels is chosen: AliBaba2 then constructs matrices for all levels. We see that considering solely level 4 yields optimal results. Runtime correlation is shown in Fig. 9. It can be seen that building models for multiple levels has additive time costs where level 4 shows average runtime.


Figure 8: Influence of classification levels on recognition accuracy.
Figure 9: Influence of classification levels on runtime.


From the evaluation done here we derive an optimal parameter set for AliBaba2 regarding the defined absolute sensitivity and specificity.



Optimal parameter set



Evaluation of Algorithmic Complexity

To evaluate the algorithmic complexity of the developed method pairwise alignment and matrix construction in combination with segmentation have been investigated (Fig. 10). I constructed an artificial long promoter sequence by concatenating 30 kb of promoters of class 6.1.2. from EPD Release 60. Class 6.1.2. stands for vertebrate genes / chromosomal genes / structure proteins. The runtime of the pairwise alignment depends on the number of sites found similar. The runtime of the matrix construction and segmentation depends on the number of matrices built (Fig. 11). By normalising the runtime to an average promoter length we get a constant runtime over the whole promoter sequence. Thus we see a linear complexity referring to the length of the sequence analysed.


Figure 10: Runtime of pairwise alignment normalised to 500 bp while varying sequence length.
Figure 11: Runtime of matrix construction and segmentation normalised to 500 bp while varying sequence length.


In Fig. 12 the runtime dependence from the number of binding sites processed is evaluated. We also see a principally linear dependence. Thus we conclude the time complexity to be:

O (n, l) = n * l

with n as number of sites and l as sequence length. Space complexity depends naturally solely on the number of processed sites:

O (n) = n.


Figure 12: Comparison of not normalised runtimes while varying the number of used sites.


Comparison with MatInspector

To compare AliBaba2 with MatInspector public those binding sites which are linked to a matrix are selected from TRANSFAC 3.5 publilc. We get 1829 sites from 366 factors. Each site is linked to a matrix which is contained in MatInspector’s library because MatInspector uses the TRANSFAC 3.5 matrix library. Therefore MatInspector has the potential to recognize all of the 1829 sites. MatInspector is mainly controlled by matscore, the central threshold for matrix comparisons. To perform automatic testing of the 1829 sites, a test bed has been constructed. In it, MatInspector is wrapped with an adapter. The adapter writes a binding site to a file, starts MatInspector and reads the results afterwards. Parsing the results, the adapter checks if the right matrix has been assigned to the binding site tested. For each threshold of matscore we get a pair of true positives (TP) and false positive (FP) predictions. Varying matscore in a given range, we get a graph of sensitivity and specificity. The test bed is shown in the Fig. 13. Fig. 14 compares sensitivity and specificity for MatInspector and AliBaba2 with the reference parameters.


Figure 13: MatInspector test bed.
Figure 14: MatInspector public vs. AliBaba2 on TRANSFAC 3.5 public.


MatInspector knows two settings: "selected matrices" and "all matrices". In "selected matrices" mode only better conserved matrices with a ‘random expectation’ below a certain threshold are used (see MatInspector documentation for more details). This increases recognition accuracy but limits the number of sites which can be recognized. Outcome of the comparison is that AliBaba2 clearly has as well a higher sensitivity as higher sensitivity / specificity ratio. In all specificity ranges it constantly can detect about 500 binding sites more than MatInspector public which means about a doubled sensitivity in the average. It should be mentioned that the recognition accuracy of AliBaba2 is measured via Jack-Knife test. For the MatInspector this was not possible as we did not repeat the construction of all matrices of TRANSFAC 3.5 public. Thus, for MatInspector there is no separation of training and test data. But we take the results from MatInspector as an upper boundary of recognition accuracy.


DISCUSSION

I have shown a new approach for identification of transcription factor binding sites. The approach is the first which sees the prediction of binding sites as a process starting directly at the known binding sites and ending at newly identified potential binding sites. So all binding sites available can be used in a systematic way. The main idea of the approach shown here is to construct individual matrices for each binding site predicted. The matrix construction algorithm clearly defines which sites are used to do this. The main criterion for grouping sites for matrix construction is the structure of the DNA-binding domains of the transcription factors. The construction process ensures that for each matrix the given conservation will be reached. There is a clearly assigned function to each matrix and thus redundancy can be avoided. Finally it is possible to define a minimum number of sites for each matrix. Thus, we get a maximum control over the identification process. I evaluated the effect of all parameters on sensitivity and specificity which leads to an optimal parameter set regarding the defined overall sensitivity and specificity. Generally the selection of parameters is not critical. In comparison to MatInspector public, AliBaba2 exhibited an approximately two-fold higher sensitivity which means a large improvement. All determined experimental results rely on the testing procedure. It is certainly possible to apply different measures of success. In this paper the absolute number of sites recognized is counted. An alternative measure is to count not the number of sites, but the number of potentially binding factors. Such a measure would strengthen MatInspector because only one single site for each matrix would have to be recognized. Results not shown here indicate that the number of factors recognized by MatInspector and AliBaba2 are similar. But the main task for programs like AliBaba2 and MatInspector is to recognize a single binding site with the highest accuracy possible. So it is sensible to measure how many sites can be recognized by a program. There is also a second argument for measuring sites and not factors. The number of sites known per factor can be expected to increase in future. So the task will be not to suggest one factor in general but to recognize factor binding in association with the organism or expression conditions of the corresponding genes. This is the idea of context. So we need a site based approach and therefore a site based measurement of success. Concluding, I think the measure of success used here is appropriate. All tests have been done based on the freely available tools and data. There is also a commercial version of TRANSFAC 3.5 with many more sites and a commercial version of MatInspector with more matrices and individual matrix thresholds available. Evaluations regarding commercial TRANSFAC and MatInspector versions will be done in the future.


ACKNOWLEDGMENTS

Thanks to Gesine Folkers for proof-reading.


REFERENCES


  1. Prestridge, D. S. (1991). Signal Scan: a computer program that scans DNA sequences for eukaryotic transcriptional elements. Comput. Applic. Biosci. 7, 203-206.
  2. Schug, J. and Overton, G. C. (1997). TESS: Transcription Element Search Software on the WWW, Computational Biology and Informatics Laboratory, University of Pennsylvania.
  3. Quandt, K., Frech, K., Karas, H., Wingender, E. and Werner, T. (1995). MatInd and Matinspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23, 4878-4884.
  4. Chen, Q. K., Hertz, G. Z. and Stormo, G. D. (1995). Matrix-Search 1.0: A computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput. Applic. Biosci. 11, 563-566.
  5. Spolar, R. S. and Record Jr., M. T. (1994). Coupling of Local Folding to Site Specific Bindung of Proteins to DNA. Science 263, 777-784.
  6. Berg, O. G. and von Hippel, P. H. (1987). Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193, 723-750.
  7. Wasserman, W. W. and Fickett, J. W. (1998). Identification of Regulatory Regions which Confer Muscle-Specific Gene Expression. J. Mol. Biol. 278, 167-181.
  8. Frech, K., Quandt, K. and Werner, T. (1998). Muscle actin genes: A first step towards computational classification of tissue specific promoters. In Silico Biol. 1, 0005.
  9. Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pruss, M., Reuter, I. and Schacherer, F. (2000). TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28, 316-319.
  10. Baker, W., van den Broek, A., Camon, E., Hingamp, P., Sterk, P., Stoesser, G. and Tuli, M. A. (2000). The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 28, 19-23.
  11. Wingender, E. (1997). Classification Scheme of Eukaryotic Transcription Factors. Mol. Biol. 31, 483-497.
  12. Schneider, T. D., Stormo, G. D., Gold, L. and Ehrenfeucht, A. (1986). Information content of binding sites on nucleotide sequences. J.Mol. Biol. 188, 415-431.
  13. Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. 27, 379-423 and 623-656.