On the Effects of Transcription Factor Properties on the Information Content of Binding Sites

Jan T. Kim, Thomas Martinetz and Daniel Polani




Institut für Neuro-und Bioinformatik
Seelandstraße 1a,
23569 Lübeck, Germany
Phone: +49 451 3909-585,
Fax: +49 451 3909-545
Email: {kim, martinetz, polani}@ informatik. mu-luebeck. de






INTRODUCTION

Networks of genes which encode transcription factors (regulatory networks) play a central role in the realization of phenotypic traits based on genetic information. Sequence-specific recognition of DNA subsequences by proteins is a key mechanism in constituting regulatory networks. Understanding the information theoretic principles underlying the evolution of transcription factors and their binding sites is therefore a major challenge in bioinformatics [1]. Advances in this field are expected to provide a basis for improving algorithmic binding site identification and promoter analysis [2], and for deciphering regulatory codes.

Previous studies [3] have suggested that the information content deduced from binding site sequence sets (Rsequence) approximately equals the information content deduced from relative binding site abundance (Rfrequency). Here, we investigate the relation between these two infor-mation quantities using a maximum entropy approach.


OUTLINE OF THE MODEL

We formally model genomes of length N by vectors of words where . Transcription factors are represented by binary vectors, where K = 4l is the number of possible words and = 1 if the factor binds to wj and = 0 otherwise. The number of binding sites in the genome is denoted by n and the number of words recognized by the transcription factor is denoted by k. Within this modelling framework, Rsequence and Rfrequency are given by the equations

Thus, if Rsequence = Rfrequency we expect

From a probabilistic point of view, the expectation to find Rsequence Rfrequency is to be understood to mean that tuples in which k Kn/N are the most common type for a given value of n. We therefore derived a formula for calculating ´(n, k), the number of tuples composed of a transcription factor binding to k different words and recognizing n sites on :

As a motivation of this formula, note that

A more detailed derivation and discussion of this equation will appear in a forthcoming paper.



RESULTS

Fig. 1 shows results of an analysis based on ´ for genome length N = 106 and binding site word length l = 10 (hence, K = 65536). The surface plot in Fig. 1 may appear quite even-levelled, but one should notice the logarithmic scale: The ´ values span four million decimal orders of magnitude. Thus, the probability for observing k values other than the one maximizing ´ for a given n practically vanishes. For each value of n, the maximal ´ value is highlighted by a diamond.


Figure 1: Top: plot of log 10 (´) for a = 4, N = 106 and l = 10, bottom left: coordinates of maximal ´ values for each n value (diamonds) and graph of (line), bottom right: plot of the (Rfrequency; Rsequence) values calculated from the coordinates plotted above according to equation 1 (asterisks) and graph of Rsequence = Rfrequency (line).

The bottom left plot displays the coordinates of these maxima on the n, k plane, showing a clear and significant deviation from the line expected if Rsequence = Rfrequency (equation 2). The bottom right plot in Fig. 1 reveals the discrepancies between Rsequence and Rfrequency directly. Here, the n values shown in the middle plot were translated into Rfrequency values and the corresponding k values that maximize ´ were translated into Rsequence values according to eq. 1. The deviation from Rsequence = Rfrequency is particularly prominent in the range of larger Rfrequency values. This finding is especially interesting, as binding site frequencies are usually in the order of magnitude of 10-3 or below, so cases of Rfrequency > 8 are biologically most relevant.

In summary, for genome and binding site sizes in the order of magnitude encountered in prokaryotic systems, our model predicts substantial deviations from Rsequence = Rfrequency.


DISCUSSION

Our results calls for explanations in two respects. In a theoretical respect, the question arises why previous analyses implied that Rsequence Rfrequency was to be expected. Differently from previous models, our model explicitly comprises the space of protein binding behaviours within the state space. The deviations from Rfrequency = Rsequence which we have observed with our model are to be ascribed to evolutionary effects originating from the protein side. More detailed analyses of these effects are currently underway.

In an empirical respect, our findings call for revisiting the cases in which Rsequence Rfrequency was observed, paying particular attention to deviations from equality and possible regularities detectable therein. Such analyses may provide information about the biological structure of the influence which DNA binding proteins have on the information content of their binding sites.

In a longer perspective, we expect this direction of research to lead to a deepened under-standing of the evolutionary biological forces shaping protein-DNA interactions, which in turn may serve as a basis for developing tools with improved performance for the detection of biologically significant binding sites and for the analysis and characterization of regulatory mechanisms and networks.


REFERENCES

  1. Gary D. Stormo and Dana S. Fields. Specificity, free energy and information content in protein-DNA-interactions. TIBS, 23: 109- 113, 1998.
  2. Kornelie Frech, Kerstin Quandt and Thomas Werner. Software for the analysis of DNA sequence elements of transcription. CABIOS, 13: 89- 97, 1997.
  3. Thomas D. Schneider. Evolution of biological information. Nucleic Acids Research, 28: 2794- 2799, 2000