In Silico Biology 4, 0040 (2004); ©2004, Bioinformation Systems e.V.  


Annotating significant pairs of transcription factor binding sites in regulatory DNA

Katja Rateitschak1*, Tobias Müller2 and Martin Vingron




Max Planck Institute for Molecular Genetics
Ihnestr. 73, D-14195 Berlin, Germany
Email: Katja.Rateitschak@molgen.mpg.de, Tobias.Mueller@molgen.mpg.de, Martin.Vingron@molgen.mpg.de

1 Present address: Max Planck Institute of Colloids and Interfaces, D-14424 Potsdam, Germany
2 Present address: Department of Bioinformatics, Biozentrum, Universität Würzburg, Am Hubland, D-97074 Würzburg, Germany




*  Corresponding author





Edited by E. Wingender; received May 21, 2004; revised and accepted August 10, 2004; published August 20, 2004



Abstract

In the presented work we search for transcription factor binding sites (BS) by including additional information about typical BS patterns. The new proposed score combines the ordinary profile score based on TRANSFAC-matrices together with a score based on pairs of BS. The latter score positively weights pairs of BS that tend to occur together in many regulatory DNA-sequences, in contrast to a random background model. The empirical BS pair frequencies result from our evaluation of a large dataset of orthologous genes.

Key words: recognition of genes and regulatory elements, transcription factor binding site, score matrix



Introduction

Vital processes in cells require a coordinated regulation of gene transcription. Gene transcription is controlled by protein complexes embedding transcription factors which bind to specific sequence patterns in regulatory DNA. These transcription factor binding sites (BS) occur in combinations of core sites and sites that are specific for groups of genes in consideration of order, distance and strand orientation. Therefore investigating the organization of BS in regulatory DNA is essential for understanding transcriptional regulation.

Binding sites are patterns of conserved and non conserved nucleotides, and can be described by TRANSFAC-matrices [1]. Searching BS with TRANSFAC-matrices in regulatory DNA leads to many putative sites covering biological BS and random hits. A current line of research analyzes combinations of BS to improve the discrimination of biological signals from noise. Pairs of BS have been identified in expression data [2, 3, 4] and in tissue data [5]. Combinations of more than two BS have been studied by refining an initial model of a BS-combination [6, 7, 8] and by applying a genetic algorithm [9]. More general approaches applicable to all transcription factors and to whole genomes have been analyzed in [10, 11]. In [10] pairs of BS in a fixed window are scored against a random background model and in [11] combinations of BS are scored according to their distance.

Today many genomes have been completely sequenced which allows to apply the concept of phylogenetic footprinting on orthologous genes of different organisms [12]. This concept identifies DNA with regulatory function as conserved subsequences in noncoding DNA due to the conjecture that for orthologous genes the transcriptional regulation should be conserved. Thus conserved BS in noncoding DNA have a higher significance for biological function than nonconserved BS. This is an advantage in higher organisms where the noncoding DNA can have a length up to 15kB and conserved subsequences form about 10 percent. Combinations of BS in conserved regulatory DNA have been analyzed in [13, 14, 15, 16].

In this work we search for significant pairs of BS in a large sequence set of conserved regulatory DNA of man and mouse. We obtain the set of significant pairs of BS by the following steps: First we label conserved BS in these sequences with TRANSFAC-matrices. Then we search in this set for pairs. We propose a score to quantify which pairs of BS prefer to occur together in comparison with a random background model. As a result we obtain a set of significant pairs of BS which is of general validity for human and mouse genes. Finally we apply our results to annotate significant pairs of BS in regulatory DNA with labelled putative BS. The annotated significant pairs are stronger signals for biological function than the single sites.

Significant pairs of BS in the whole human genome have been analyzed also in [10]. However, our approach differs by the following properties: We search for pairs in conserved DNA, we take into account pairs which are formed by the same factors, we perform the counting of pairs such that pairs are significant for many genes, we use a different definition of the scores for binding sites and pairs of binding sites and we apply our results to annotate significant pairs of BS in regulatory DNA. We describe our approach in detail under "Data and method". Under "Annotation of significant pairs", we apply our approach to muscle-specific genes [13]. At the end, conclusions will be drawn from the results obtained.



Data and method


Transcription factor binding sites

The TRANSFAC database [1] contains matrices which describe profiles of known BS of transcription factors. With these matrices one can search for putative BS in regulatory DNA. A quantitative measure for the similarity between a sequence pattern s = (s1,...,sn) in the regulatory DNA and a TRANSFAC-profile is the score SBS defined as
(1)
where pi(si) is the probability for nucleotide si at position i based on the TRANSFAC-profile and b(si) is the background probability for nucleotide si. The b(si) have been assumed to be equidistributed so that the score is independent from the total number of sequences and can also be compared with scores in sequences which do not belong to our sequence set 3.

The score SBS can be transformed to a score
(2)
with zero mean and unit variance for all matrices. For each matrix and have been computed from a random background sequence with an equidistribution for the nucleotides probabilities. This roughly allows to compare scores of different matrices.


Conserved sequences and putative binding sites

Our sequence set is composed of conserved regulatory sequences of man-mouse orthologs from the CORG database [17]. First we removed redundant sequence-pairs, i. e. all pairs where a gene occurred several times. Then we searched with all human and mouse TRANSFAC-matrices (release 6.2) for BS in the resulting 7574 sequence-pairs. We collected candidate sites which score higher than a fixed threshold according to Eq. (2) and which are conserved between man and mouse. The threshold has the value 4. For this threshold we found the majority of a set of biological sites in human muscle genes as shown in Table 1. The number of putative BS increases by more the twofold for a threshold of 3.8.

Table 1: Experimental and putative BS in human muscle genes for a score threshold = 4.
TF Experimental BS
found
Experimental BS
not found
Putative BS
SRF 10 2 14
MEF 9 2 41
MYOGENIN 9 6 28



Pairs of binding sites

Next we search in the collection of candidate sites for pairs of BS. We are interested in significant coocurrences of pairs. We therefore introduce a score as
(3)
with
  and   (4)
where #pairi, j is the number of pairs of the BS i and j, mij is the pair probability and πi and πj are the associated marginals. A positive score Sij > 0 indicates that the two factors i and j prefer to occur together.

We search the pairs of BS in a window of 10 to 50 base pairs4. The minimal distance of 10 bp is necessary to avoid pairs formed by overlapping BS. We have chosen as maximal distance 50 bp due to the fact that biological BS frequently form closely neighbored clusters. We are interested in pairs which regulate many genes. Thus we count the number of genes in which the pair has been found instead of counting the total number of occurrences of a pair. By this way of counting we avoid that a highly scoring pair could result from a high number of pairs in only one gene. Furthermore we take into account that a factor could prefer to occur without a neighbor. These frequencies divided by two are stored in the last column and last row of the matrix mij. We distinguish the order of the TFs in a pair, i. e. upstream–TF1–TF2–downstream is another pairs as upstream–TF2–TF1–downstream. The strand orientation is neglected. The pair frequency mi, j has been annotated only for pairs occurring more often than five times to avoid a high value for Sij from a low pair frequency and low marginals.

Figure 1 shows a score-matrix which summarizes the scores Sij for all pairs of transcription factors obtained from our analysis for the following parameters: organism human, conserved BS, threshold for SBS is equal 4, ordered pairs, window 10 to 50 base pairs. The dark blue background is related to pairs observed less than six times which form 99 percent of all possible pairs. Ninety percent of 580 frequently occurring pairs have a score Sij > 0. The figure depicts that some transcription factors form pairs with only one or few other factors, like SRF. Other transcription factors form pairs with many different factors, like MAZ, MEF2 and POU3F2. Frequently a pair is formed by twice the same factor. The twin-pairs (SRF,SRF), (NRF1,NRF1) and (CHX10,CHX10) belong to the pairs with the highest scores Sij > 2. The BS in all pairs are nonoverlapping because their minimal distance is 10 bp. Looking at regulatory sequences annotated with twin-pairs revealed that they partially have three, four or more than ten binding sites of this TF. However, such clusters of BS exists also for a TF where the twin-pair has a low score as in the case of (MYOGENIN,MYOGENIN). The names for the factors we use in our paper are exactly the names of the TRANSFAC-matrices, because in several cases a matrix describes binding site profiles of different factors.



Figure 1: Score-matrix for pairs of BS.


The lines in Figure 2 connect all pairs of transcription factors with Sij > 1. Ten pairs which are connected by red colored lines are known interacting factors. For Sij > 0 we found in total 40 pairs of known interacting factors. The majority of factors is connected by black lines showing that our method has identified many new preferred pairs. Several factors form pairs with different partners and in some cases the partners form pairs too. Thus one could also check for preferred triples. Note that in these figures there is still redundancy because some TRANSFAC-matrices have similar profiles which could be removed by clustering of the matrices.



Figure 2: Graph of pairs of BS with Sij > 1



Comparison of pair scores for different parameters

In this section we compare the results for the pair score Sij for different sets of parameters.

The top-ranking pairs for the parameters: organism human, conserved BS, threshold for SBS is equal 4, ordered pairs, window 10 to 50 bp which were discussed in the previous subsection are listed in the first column of Table 2.


Table 2: Top-ranking pairs in human and mouse genome with pair score.
Conserved sequences
man
Pair score Whole sequences
man
Pair score Whole sequences
mouse
Pair score
SRF,SRF 2.7 NRF1,NRF1 2.8 TB,COUP 2.9
NRF1,NRF1 2.2 MTF1,NRF1 2.2 NRF1,NRF1 2.7
CHX10,CHX10 2.1 CF,HOX13 2.0 BRN2,COUP 2.6
ATF6,CREB 1.9 GATA4,RSRFC4 1.9 COUP,TM 2.3
CREB,CREB 1.9 MTF1,MTF1 1.8 PITX2,ERR1 2.3
ATF,MYCMAX 1.8 GABP,NRF1 1.8 COUP,TB 2.2
ATF6,MYCMAX 1.8 NRSF,PAX5 1.8 ERR1,PITX2 2.2
ATF6,MAX 1.7 STAT1,CREB 1.8 PITX2,GNCF 2.0
GABP,GABP 1.7 ELK,GABP 1.8 ARP1,GATA1 1.9
MAX,SREBP1 1.7 RREB1,CMYB 1.7 MAZR,RREB1 1.8
Abbreviations: CB, COREBINDINGFACTOR; CF, CACCCbindingfactor; CJ, CREBP1CJUN; CP, CACBINDINGPROTEIN; TB, TALIBETAITF2; TM, TCF11MAFG; CS, CETS1P54.


The top-ranking pairs in the nonconserved noncoding DNA (281 sequence pairs) are different from the top ranking pairs in the conserved DNA as shown in column two and three of Table 2. This is a further evidence that top-ranking pairs in the conserved regulatory DNA have a conserved regulatory function because pairs in nonconserved DNA are less significant for regulatory function. However, top-ranking pairs in nonconserved noncoding DNA could be formed by organism specific factors, like MTF1 for man genes and COUP for mouse genes but TRANSFAC indicates activity of these TFs in both organisms. An exception is the pair (NRF1,NRF1) which belongs to the top-ranking pairs in all columns5. Note that two of the nine pairs of (NRF1,NRF1) in the nonconserved regions occur in the same genes of man.

The top-ranking pairs disregarding the BS-order6 are shown in the first column of Table 3. One partially recovers the pairs of the first column of table one but also many new pairs. The pair score Sij is lower than above. Only fourteen percent of 1064 annotated pairs have a positive pair score Sij. Twin pairs (i,i) are rare in the top-ranking pairs in contrast to above. Neglecting the order of pairs leads to an increase of mij for i j and πi whereas mii remains constant. As a consequence Sii decreases stronger than Sij.


Table 3: Top-ranking pairs in human and mouse genome with pair score, comparison of different parameters.
No order Pair score Threshold = 3.8 Pair score Window 200 bp Pair score
AP1,TM 1.6 NRF1,NRF1 2.4 NRF1,NRF1 2.3
ARNT,ATF 1.4 CS,ELK1 2.3 NFE2,NFE2 2.3
SRF,SRF 1.4 SP3,TAXCREB 2.2 CJ,ATF6 2.0
CJ,MAX 1.2 CREB,ARNT 2.1 MAX,ARNT 2.0
CJ,CREB 1.2 ATF,ARNT 2.0 MYCMAX,ARNT 2.0
CJ,MYCMAX 1.1 SREBP1,CJ 1.9 SOX5,ALX4 2.0
ATF6,MAX 1.0 GABP,CS 1.9 SRF,SRF 1.9
NRF1,NRF1 1.0 CS,GABP 1.9 CJ,CREB 1.9
ATF6,CREB 0.9 ARNT,CREB 1.8 ATF3,ATF6 1.9
ATF,MYCMAX 0.9 CS,STAT1 1.8 OCT,ARP1 1.9
Abbreviations see Table 2.


A TF can have a weaker protein-DNA binding pattern caused by an interaction with other transcription factors. We have sought for such pairs by reducing the BS-threshold to 3.8. Many new top-pairs arise for this lower score-threshold as shown in the second column of Table 3. The best candidate for a weaker DNA binding pattern are the pairs (CREB,ARNT) where the number of pairs increased by a factor 4 from 3 to 12 pairs whereas the marginals increase by a factor 3 and (ARNT,CREB) where the number of pairs increase by a factor 8 from 1 to 8 pairs and the marginals by factors 3 and 4. We have not found any information in pubmed that these two factors can interact.

In addition we have studied whether there are pairs which prefer to occur in a larger distance. Significant pairs in a window 10-200 bp are shown in the third column of Table 3 and contain many new pairs. We have not found a candidate for a preferred larger distance among the top-ten pairs because for all pairs the number of the pairs and the marginals increase with approximately the same rate.

We have compared our sets of significant pairs with the results for significant pairs in the whole human genome [10]. Pairs formed by the same factor which are significant in our results were not taken into account in a definition of a pair in [10]. We have checked whether the significant pairs reported in Table 1 and 2 in [10] occur also in our sets of significant pairs. We recovered only a small number of these pairs in our significant pairs as summarized in Table 4. A reason for these differences could be the different scoring of the BS and the pairs and the different counting of pairs in the two methods.


Table 4: Significant pairs of Hannenhalli and Levy [10] in our results.
No order Pair score Threshold = 3.8 Pair score Window 200 bp Pair score
SREB,TAXCREB 0.86 SREBP1,CJ
TAXCREB,SREB
CJ,SREBP1
CREB,SREBP1
SREBP1,CREB
CREBP1,SREBP1
SREBP1,TAXCREB
SREBP1,CREBP1
1.93
1.39
1.36
1.09
1.00
0.79
0.70
0.46
CREBP1,MEF2
MEF2,CREBP1
0.52
0.02
Abbreviations see Table 2.




Annotation of significant pairs


First we have annotated pairs of BS with Sij > 0 in our sequence-set of conserved regulatory DNA. The results are shown on our web-page.

Furthermore our results can be used to annotate pairs of BS in an arbitrary regulatory DNA with labelled putative BS. We show this for muscle specific genes in this section. The threshold for BS has the value 3.8 in the following however the BS-threshold for the significant pairs has the more conservative value 4 according to the results in the section "Pairs of binding sites", see Figure 1 and Table 1. The other parameters are the same as in the section "Pairs of binding sites".


Muscle-specific genes

Wassermann et al. have studied muscle specific conserved regulatory sequences of man and mouse [13]. Their results for BS of muscle specific factors are in good agreement with experimentally confirmed BS and in addition they have predicted new sites.

We have searched for putative BS in these sequences with all TRANSFAC-matrices. Then we have annotated the pairs of BS which are significant in the whole human and mouse conserved DNA shown in Figure 1 in the set of putative BS in the muscle sequences. Our results are summarized in Table 5. The TF AP4 occurs in the original data instead of MYOGENIN in the genes 3, 10, 53 and 60. Frequently the BS of AP4 and MYOGENIN overlap due to similar matrix-profiles. Thus we have noted for these genes the muscle-specific factor in the table. In addition experimentally confirmed BS are marked boldface.


Table 5: Human muscle-specific genes: annotated pairs of BS in conserved regulatory sequence.
Gene Significant pairs
2 AREB6, STAT1
3 MYOGENIN, MYOGENIN
7 SRF,SRF; SRF,SRF
8 HMEF2,MYOGENIN
10 MAZ,ATF3; HMEF2,MYOGENIN; MYOD,MAZ
15 RSRFC4,MAZ
19 ALX4,MEF2
20 HEB,MEF2; MEF2,MYOGENIN
26 SRF,SRF
32 MYOGENIN, MYOGENIN
36 TFIII,AREB
39 ALPHACP1,AP1
50 MEF2,MEF2
53 MYOGENIN, MYOGENIN; SRF,SRF
60 CP,E47; HMEF2,MYOGENIN; RORA1,AREB6
Bold: experimentally confirmed binding site.


Our method has found known experimental sites and it could predict new sites. We found conserved BS in 25 genes and we could annotate significant pairs in 15 genes. The 22 annotated pairs contain 13 experimentally confirmed BS. Significant pairs of 10 genes are composed of muscle specific factors. The muscle-specific pairs are (SRF,SRF), (MYOGENIN,MYOGENIN), (MEF2,MEF2) and (MEF2,MYOGENIN).



Conclusions

In this work we have introduced a refined method to search for BS by calculating a score for pairs of BS. This score positively weights a pair of BS which prefer to occur together in a large sequence set of conserved regulatory DNA of human and mouse in comparison with a random background model. As a result we have obtained a set of significant pairs of BS which contains known interacting pairs, but also many new pairs. It can be used to annotate significant pairs of BS in any arbitrary regulatory DNA with labelled putative BS. This refined searching for BS improves the discrimination between biological sites and random sites.

We have applied our results to annotate significant pairs in muscle-specific genes. The 22 annotated pairs contain 13 experimentally confirmed binding sites. In future work we will improve the scoring of putative BS [18] and will augment the scoring scheme to include information on the position, on strand orientation, and on the length of the intermediate sequence between BS. The challenges are the functional form of the score and the weighting of the individual pieces of information.

Availability: Full list of results at http://www.molgen.mpg.de/~rateit_k/data.html.




Acknowledgements

We acknowledge support from DFG in Sonderforschungsbereich 618.




References


  1. Wingender, E., Dietze, P., Karas, H. and Knüppel, R. (1996). TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 24, 238-241.

  2. Pilpel, Y., Sudarsanam, P. and Church, G. M. (2001). Identifying regulatory networks by combinatorial analysis of promoter elements. Nat. Genet. 29, 153-159.

  3. GuhaThakurta, D. and Stormo, G. D. (2001). Identifying target sites for cooperatively binding factors. Bioinformatics 17, 608-621.

  4. Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., Zeitlinger, J., Jennings, E. G., Murray, H. L., Gordon, D. B., Ren, B., Wyrick, J. J., Tagne, J. B., Volkert, T. L., Fraenkel, E., Gifford, D. K. and Young, R. A. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799-804.

  5. Wasserman, W. W. and Fickett, J. W. (1998). Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 278, 167-181.

  6. Klingenhoff, A., Frech, K., Quandt, K. and Werner, T. (1999). Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bioinformatics 15, 180-186.

  7. Frith, M. C., Li, M. C. and Weng, Z. (2003). Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. 31, 3666-3668.

  8. Halfon, M. S., Grad, Y., Church, G. M. and Michelson A. M. (2002). Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome Res. 12, 1019-1028.

  9. Kel-Margoulis, O. V., Ivanova T. G., Wingender, E. and Kel, A. E. (2002). Automatic annotation of genomic regulatory sequences by searching for composite clusters. Pac. Symp. Biocomp. 7, 187-198.

  10. Hannenhalli, S. and Levy, S. (2002). Predicting transcription factor synergism. Nucleic Acids Res. 30, 4278-4284.

  11. Wagner, A. (1999). Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. Bioinformatics 99, 776-784.

  12. Hardison, R. C., Oeltjen, J. and Miller, W. (1997). Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res. 8, 959-966.

  13. Wasserman, W. W., Palumbo, M., Thompson, W., Fickett, J. W. and Lawrence, C. E. (2000). Human-mouse genome comparisons to locate regulatory sites. Nat. Genet. 26, 225-228.

  14. Levy, S., Hannenhalli, S. and Workman, W. (2001). Enrichment of regulatory signals in non-coding genomic sequences. Bioinformatics 17, 871-877.

  15. Loots, G. G., Ovcharenko, I., Pachter, L., Dubchak, I. and Rubin, E. M. (2002). rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res. 12, 832-839.

  16. Sharan, R., Ovcharenko, I., Ben-Hur, A. and Karp, R. M. (2003). CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics 19 Suppl. 1, i283-i291.

  17. Dieterich, C., Wang, H., Rateitschak, K., Luz, H. and Vingron, M. (2003). CORG: A database for COmparative Regulatory Genomics. Nucleic Acids Res. 31, 55-57.

  18. Rahmann, S., Müller, T. and Vingron, M. (2003). On the power of profiles for transcription factor binding site detection. Stat. Appl. Genet. Mol. Biol. 2, Article 7.






Footnotes:

3 Besides the nucleotide distribution in our sequence set is close to an equidistribution (pA, pC, pG, pT)= (0.26, 0.23, 0.24, 0.27)

4 As distance we have defined the distance between the centers of the matrices.

5 The consensus sequence of the matrix NRF1_Q6 is CGCRTGCGCR. It could be possible that these sites are non-masked CpG islands.

6 (TF1,TF2) is counted as the same as (TF2,TF1).