| In Silico Biology 4, 0040 (2004); ©2004, Bioinformation Systems e.V. |
Max Planck Institute for Molecular Genetics
Ihnestr. 73, D-14195 Berlin, Germany
Email: Katja.Rateitschak@molgen.mpg.de, Tobias.Mueller@molgen.mpg.de, Martin.Vingron@molgen.mpg.de
1 Present address: Max Planck Institute of Colloids and Interfaces, D-14424 Potsdam, Germany
2 Present address: Department of Bioinformatics, Biozentrum, Universität Würzburg, Am Hubland, D-97074 Würzburg, Germany
* Corresponding author
Edited by E. Wingender; received May 21, 2004; revised and accepted August 10, 2004; published August 20, 2004
In the presented work we search for transcription factor binding sites (BS) by including additional information about typical BS patterns. The new proposed score combines the ordinary profile score based on TRANSFAC-matrices together with a score based on pairs of BS. The latter score positively weights pairs of BS that tend to occur together in many regulatory DNA-sequences, in contrast to a random background model. The empirical BS pair frequencies result from our evaluation of a large dataset of orthologous genes.
Key words: recognition of genes and regulatory elements, transcription factor binding site, score matrix
Vital processes in cells require a coordinated regulation of gene transcription. Gene transcription is controlled by protein complexes embedding transcription factors which bind to specific sequence patterns in regulatory DNA. These transcription factor binding sites (BS) occur in combinations of core sites and sites that are specific for groups of genes in consideration of order, distance and strand orientation. Therefore investigating the organization of BS in regulatory DNA is essential for understanding transcriptional regulation.
Binding sites are patterns of conserved and non conserved nucleotides, and can be described by TRANSFAC-matrices [1]. Searching BS with TRANSFAC-matrices in regulatory DNA leads to many putative sites covering biological BS and random hits. A current line of research analyzes combinations of BS to improve the discrimination of biological signals from noise. Pairs of BS have been identified in expression data [2, 3, 4] and in tissue data [5]. Combinations of more than two BS have been studied by refining an initial model of a BS-combination [6, 7, 8] and by applying a genetic algorithm [9]. More general approaches applicable to all transcription factors and to whole genomes have been analyzed in [10, 11]. In [10] pairs of BS in a fixed window are scored against a random background model and in [11] combinations of BS are scored according to their distance.
Today many genomes have been completely sequenced which allows to apply the concept of phylogenetic footprinting on orthologous genes of different organisms [12]. This concept identifies DNA with regulatory function as conserved subsequences in noncoding DNA due to the conjecture that for orthologous genes the transcriptional regulation should be conserved. Thus conserved BS in noncoding DNA have a higher significance for biological function than nonconserved BS. This is an advantage in higher organisms where the noncoding DNA can have a length up to 15kB and conserved subsequences form about 10 percent. Combinations of BS in conserved regulatory DNA have been analyzed in [13, 14, 15, 16].
In this work we search for significant pairs of BS in a large sequence set of conserved regulatory DNA of man and mouse. We obtain the set of significant pairs of BS by the following steps: First we label conserved BS in these sequences with TRANSFAC-matrices. Then we search in this set for pairs. We propose a score to quantify which pairs of BS prefer to occur together in comparison with a random background model. As a result we obtain a set of significant pairs of BS which is of general validity for human and mouse genes. Finally we apply our results to annotate significant pairs of BS in regulatory DNA with labelled putative BS. The annotated significant pairs are stronger signals for biological function than the single sites.
Significant pairs of BS in the whole human genome have been analyzed also in [10]. However, our approach differs by the following properties: We search for pairs in conserved DNA, we take into account pairs which are formed by the same factors, we perform the counting of pairs such that pairs are significant for many genes, we use a different definition of the scores for binding sites and pairs of binding sites and we apply our results to annotate significant pairs of BS in regulatory DNA. We describe our approach in detail under "Data and method". Under "Annotation of significant pairs", we apply our approach to muscle-specific genes [13]. At the end, conclusions will be drawn from the results obtained.
Transcription factor binding sites
The TRANSFAC database [1] contains matrices which describe profiles of known BS of transcription factors. With these matrices one can search for putative BS in regulatory DNA. A quantitative measure for the similarity between a sequence pattern s = (s1,...,sn) in the regulatory DNA and a TRANSFAC-profile is the score SBS defined as
![]() | (1) |
The score SBS can be transformed to a score
![]() | (2) |

and 
have been computed from a random background sequence with an equidistribution for the nucleotides probabilities. This roughly allows to compare scores of different matrices.
Conserved sequences and putative binding sites
Our sequence set is composed of conserved regulatory sequences of man-mouse orthologs from the CORG database [17]. First we removed redundant sequence-pairs, i. e. all pairs where a gene occurred several times. Then we searched with all human and mouse TRANSFAC-matrices (release 6.2) for BS in the resulting 7574 sequence-pairs. We collected candidate sites which score higher than a fixed threshold according to Eq. (2) and which are conserved between man and mouse. The threshold has the value 4. For this threshold we found the majority of a set of biological sites in human muscle genes as shown in Table 1. The number of putative BS increases by more the twofold for a threshold of 3.8.
| Table 1: Experimental and putative BS in human muscle genes for a score threshold = 4. |
| TF | Experimental BS found |
Experimental BS not found |
Putative BS |
|---|---|---|---|
| SRF | 10 | 2 | 14 |
| MEF | 9 | 2 | 41 |
| MYOGENIN | 9 | 6 | 28 |
Pairs of binding sites
Next we search in the collection of candidate sites for pairs of BS. We are interested in significant coocurrences of pairs. We therefore introduce a score as
![]() | (3) |
![]() and | (4) |
We search the pairs of BS in a window of 10 to 50 base pairs4. The minimal distance of 10 bp is necessary to avoid pairs formed by overlapping BS. We have chosen as maximal distance 50 bp due to the fact that biological BS frequently form closely neighbored clusters. We are interested in pairs which regulate many genes. Thus we count the number of genes in which the pair has been found instead of counting the total number of occurrences of a pair. By this way of counting we avoid that a highly scoring pair could result from a high number of pairs in only one gene. Furthermore we take into account that a factor could prefer to occur without a neighbor. These frequencies divided by two are stored in the last column and last row of the matrix mij. We distinguish the order of the TFs in a pair, i. e. upstream–TF1–TF2–downstream is another pairs as upstream–TF2–TF1–downstream. The strand orientation is neglected. The pair frequency mi, j has been annotated only for pairs occurring more often than five times to avoid a high value for Sij from a low pair frequency and low marginals.
Figure 1 shows a score-matrix which summarizes the scores Sij for all pairs of transcription factors obtained from our analysis for the following parameters: organism human, conserved BS, threshold for SBS is equal 4, ordered pairs, window 10 to 50 base pairs. The dark blue background is related to pairs observed less than six times which form 99 percent of all possible pairs. Ninety percent of 580 frequently occurring pairs have a score Sij > 0. The figure depicts that some transcription factors form pairs with only one or few other factors, like SRF. Other transcription factors form pairs with many different factors, like MAZ, MEF2 and POU3F2. Frequently a pair is formed by twice the same factor. The twin-pairs (SRF,SRF), (NRF1,NRF1) and (CHX10,CHX10) belong to the pairs with the highest scores Sij > 2. The BS in all pairs are nonoverlapping because their minimal distance is 10 bp. Looking at regulatory sequences annotated with twin-pairs revealed that they partially have three, four or more than ten binding sites of this TF. However, such clusters of BS exists also for a TF where the twin-pair has a low score as in the case of (MYOGENIN,MYOGENIN). The names for the factors we use in our paper are exactly the names of the TRANSFAC-matrices, because in several cases a matrix describes binding site profiles of different factors.
The lines in Figure 2 connect all pairs of transcription factors with Sij > 1. Ten pairs which are connected by red colored lines are known interacting factors. For Sij > 0 we found in total 40 pairs of known interacting factors. The majority of factors is connected by black lines showing that our method has identified many new preferred pairs. Several factors form pairs with different partners and in some cases the partners form pairs too. Thus one could also check for preferred triples. Note that in these figures there is still redundancy because some TRANSFAC-matrices have similar profiles which could be removed by clustering of the matrices.
Comparison of pair scores for different parameters
In this section we compare the results for the pair score Sij for different sets of parameters.
The top-ranking pairs for the parameters: organism human, conserved BS, threshold for SBS is equal 4, ordered pairs, window 10 to 50 bp which were discussed in the previous subsection are listed in the first column of Table 2.
| Table 2: Top-ranking pairs in human and mouse genome with pair score. |
| Conserved sequences man |
Pair score | Whole sequences man |
Pair score | Whole sequences mouse |
Pair score |
|---|---|---|---|---|---|
| SRF,SRF | 2.7 | NRF1,NRF1 | 2.8 | TB,COUP | 2.9 |
| NRF1,NRF1 | 2.2 | MTF1,NRF1 | 2.2 | NRF1,NRF1 | 2.7 |
| CHX10,CHX10 | 2.1 | CF,HOX13 | 2.0 | BRN2,COUP | 2.6 |
| ATF6,CREB | 1.9 | GATA4,RSRFC4 | 1.9 | COUP,TM | 2.3 |
| CREB,CREB | 1.9 | MTF1,MTF1 | 1.8 | PITX2,ERR1 | 2.3 |
| ATF,MYCMAX | 1.8 | GABP,NRF1 | 1.8 | COUP,TB | 2.2 |
| ATF6,MYCMAX | 1.8 | NRSF,PAX5 | 1.8 | ERR1,PITX2 | 2.2 |
| ATF6,MAX | 1.7 | STAT1,CREB | 1.8 | PITX2,GNCF | 2.0 |
| GABP,GABP | 1.7 | ELK,GABP | 1.8 | ARP1,GATA1 | 1.9 |
| MAX,SREBP1 | 1.7 | RREB1,CMYB | 1.7 | MAZR,RREB1 | 1.8 |
| Abbreviations: CB, COREBINDINGFACTOR; CF, CACCCbindingfactor; CJ, CREBP1CJUN; CP, CACBINDINGPROTEIN; TB, TALIBETAITF2; TM, TCF11MAFG; CS, CETS1P54. |
The top-ranking pairs in the nonconserved noncoding DNA (281 sequence pairs) are different from the top ranking pairs in the conserved DNA as shown in column two and three of Table 2. This is a further evidence that top-ranking pairs in the conserved regulatory DNA have a conserved regulatory function because pairs in nonconserved DNA are less significant for regulatory function. However, top-ranking pairs in nonconserved noncoding DNA could be formed by organism specific factors, like MTF1 for man genes and COUP for mouse genes but TRANSFAC indicates activity of these TFs in both organisms. An exception is the pair (NRF1,NRF1) which belongs to the top-ranking pairs in all columns5. Note that two of the nine pairs of (NRF1,NRF1) in the nonconserved regions occur in the same genes of man.
The top-ranking pairs disregarding the BS-order6 are shown in the first column of Table 3. One partially recovers the pairs of the first column of table one but also many new pairs. The pair score Sij is lower than above. Only fourteen percent of 1064 annotated pairs have a positive pair score Sij. Twin pairs (i,i) are rare in the top-ranking pairs in contrast to above. Neglecting the order of pairs leads to an increase of mij for i
j and πi whereas mii remains constant. As a consequence Sii decreases stronger than Sij.
| Table 3: Top-ranking pairs in human and mouse genome with pair score, comparison of different parameters. |
| No order | Pair score | Threshold = 3.8 | Pair score | Window 200 bp | Pair score |
|---|---|---|---|---|---|
| AP1,TM | 1.6 | NRF1,NRF1 | 2.4 | NRF1,NRF1 | 2.3 |
| ARNT,ATF | 1.4 | CS,ELK1 | 2.3 | NFE2,NFE2 | 2.3 |
| SRF,SRF | 1.4 | SP3,TAXCREB | 2.2 | CJ,ATF6 | 2.0 |
| CJ,MAX | 1.2 | CREB,ARNT | 2.1 | MAX,ARNT | 2.0 |
| CJ,CREB | 1.2 | ATF,ARNT | 2.0 | MYCMAX,ARNT | 2.0 |
| CJ,MYCMAX | 1.1 | SREBP1,CJ | 1.9 | SOX5,ALX4 | 2.0 |
| ATF6,MAX | 1.0 | GABP,CS | 1.9 | SRF,SRF | 1.9 |
| NRF1,NRF1 | 1.0 | CS,GABP | 1.9 | CJ,CREB | 1.9 |
| ATF6,CREB | 0.9 | ARNT,CREB | 1.8 | ATF3,ATF6 | 1.9 |
| ATF,MYCMAX | 0.9 | CS,STAT1 | 1.8 | OCT,ARP1 | 1.9 |
| Abbreviations see Table 2. |
A TF can have a weaker protein-DNA binding pattern caused by an interaction with other transcription factors. We have sought for such pairs by reducing the BS-threshold to 3.8. Many new top-pairs arise for this lower score-threshold as shown in the second column of Table 3. The best candidate for a weaker DNA binding pattern are the pairs (CREB,ARNT) where the number of pairs increased by a factor 4 from 3 to 12 pairs whereas the marginals increase by a factor 3 and (ARNT,CREB) where the number of pairs increase by a factor 8 from 1 to 8 pairs and the marginals by factors 3 and 4. We have not found any information in pubmed that these two factors can interact.
In addition we have studied whether there are pairs which prefer to occur in a larger distance. Significant pairs in a window 10-200 bp are shown in the third column of Table 3 and contain many new pairs. We have not found a candidate for a preferred larger distance among the top-ten pairs because for all pairs the number of the pairs and the marginals increase with approximately the same rate.
We have compared our sets of significant pairs with the results for significant pairs in the whole human genome [10]. Pairs formed by the same factor which are significant in our results were not taken into account in a definition of a pair in [10]. We have checked whether the significant pairs reported in Table 1 and 2 in [10] occur also in our sets of significant pairs. We recovered only a small number of these pairs in our significant pairs as summarized in Table 4. A reason for these differences could be the different scoring of the BS and the pairs and the different counting of pairs in the two methods.
| Table 4: Significant pairs of Hannenhalli and Levy [10] in our results. |
| No order | Pair score | Threshold = 3.8 | Pair score | Window 200 bp | Pair score |
|---|---|---|---|---|---|
| SREB,TAXCREB | 0.86 | SREBP1,CJ TAXCREB,SREB CJ,SREBP1 CREB,SREBP1 SREBP1,CREB CREBP1,SREBP1 SREBP1,TAXCREB SREBP1,CREBP1 |
1.93 1.39 1.36 1.09 1.00 0.79 0.70 0.46 |
CREBP1,MEF2 MEF2,CREBP1 |
0.52 0.02 |
| Abbreviations see Table 2. |
First we have annotated pairs of BS with Sij > 0 in our sequence-set of conserved regulatory DNA. The results are shown on our web-page.
Furthermore our results can be used to annotate pairs of BS in an arbitrary regulatory DNA with labelled putative BS. We show this for muscle specific genes in this section. The threshold for BS has the value 3.8 in the following however the BS-threshold for the significant pairs has the more conservative value 4 according to the results in the section "Pairs of binding sites", see Figure 1 and Table 1. The other parameters are the same as in the section "Pairs of binding sites".
Muscle-specific genes
Wassermann et al. have studied muscle specific conserved regulatory sequences of man and mouse [13]. Their results for BS of muscle specific factors are in good agreement with experimentally confirmed BS and in addition they have predicted new sites.
We have searched for putative BS in these sequences with all TRANSFAC-matrices. Then we have annotated the pairs of BS which are significant in the whole human and mouse conserved DNA shown in Figure 1 in the set of putative BS in the muscle sequences. Our results are summarized in Table 5. The TF AP4 occurs in the original data instead of MYOGENIN in the genes 3, 10, 53 and 60. Frequently the BS of AP4 and MYOGENIN overlap due to similar matrix-profiles. Thus we have noted for these genes the muscle-specific factor in the table. In addition experimentally confirmed BS are marked boldface.
| Table 5: Human muscle-specific genes: annotated pairs of BS in conserved regulatory sequence. |
| Gene | Significant pairs |
| 2 | AREB6, STAT1 |
| 3 | MYOGENIN, MYOGENIN |
| 7 | SRF,SRF; SRF,SRF |
| 8 | HMEF2,MYOGENIN |
| 10 | MAZ,ATF3; HMEF2,MYOGENIN; MYOD,MAZ |
| 15 | RSRFC4,MAZ |
| 19 | ALX4,MEF2 |
| 20 | HEB,MEF2; MEF2,MYOGENIN |
| 26 | SRF,SRF |
| 32 | MYOGENIN, MYOGENIN |
| 36 | TFIII,AREB |
| 39 | ALPHACP1,AP1 |
| 50 | MEF2,MEF2 |
| 53 | MYOGENIN, MYOGENIN; SRF,SRF |
| 60 | CP,E47; HMEF2,MYOGENIN; RORA1,AREB6 |
| Bold: experimentally confirmed binding site. |
Our method has found known experimental sites and it could predict new sites. We found conserved BS in 25 genes and we could annotate significant pairs in 15 genes. The 22 annotated pairs contain 13 experimentally confirmed BS. Significant pairs of 10 genes are composed of muscle specific factors. The muscle-specific pairs are (SRF,SRF), (MYOGENIN,MYOGENIN), (MEF2,MEF2) and (MEF2,MYOGENIN).
In this work we have introduced a refined method to search for BS by calculating a score for pairs of BS. This score positively weights a pair of BS which prefer to occur together in a large sequence set of conserved regulatory DNA of human and mouse in comparison with a random background model. As a result we have obtained a set of significant pairs of BS which contains known interacting pairs, but also many new pairs. It can be used to annotate significant pairs of BS in any arbitrary regulatory DNA with labelled putative BS. This refined searching for BS improves the discrimination between biological sites and random sites.
We have applied our results to annotate significant pairs in muscle-specific genes. The 22 annotated pairs contain 13 experimentally confirmed binding sites. In future work we will improve the scoring of putative BS [18] and will augment the scoring scheme to include information on the position, on strand orientation, and on the length of the intermediate sequence between BS. The challenges are the functional form of the score and the weighting of the individual pieces of information.
Availability: Full list of results at http://www.molgen.mpg.de/~rateit_k/data.html.
We acknowledge support from DFG in Sonderforschungsbereich 618.
Footnotes:
3 Besides the nucleotide distribution in our sequence set is close to an equidistribution (pA, pC, pG, pT)= (0.26, 0.23, 0.24, 0.27)
4 As distance we have defined the distance between the centers of the matrices.
5 The consensus sequence of the matrix NRF1_Q6 is CGCRTGCGCR. It could be possible that these sites are non-masked CpG islands.
6 (TF1,TF2) is counted as the same as (TF2,TF1).