MATCHTM - a tool for searching transcription factor binding sites in DNA sequences. Application for the analysis of human chromosomes.

Ellen Goessling1, Olga V. Kel-Margoulis, Alexander E. Kel and Edgar Wingender

BIOBASE Biological Databases GmbH,
Halchtersche Strasse 33,
D-38304 Wolfenbüttel, Germany,
1To whom correspondence should be addressed
Phone: +49 (0) 5331-858422
Fax: +49 (0) 5331-858470

Today, with a number of complete genomes at hand, one of the major tasks of bioinformatics is to develop methods for the identification of gene regulatory regions.
Here, we present MATCH™ - a weight matrix-based tool for searching putative transcription factor binding sites in DNA sequences. MATCH™ is closely interconnected and distributed together with the TRANSFAC® database [Wingender et al., 2001]. In particular, MATCH™ uses a matrix library derived from matrices collected in TRANSFAC® and therefore provides the possibility to search for a great variety of different transcription factor binding sites. We have developed a WWW interface and a graphical representation of the program output. The user may construct and save his/her specific user profiles which are selected subsets of matrices including default or user-defined cut-off values.

A public version of the MATCH™ tool is available at

MATCHTM Algorithm.   The algorithm uses two values to score putative hits: the matrix similarity score and the core similarity score resembling herein the previously published MatInspector algorithm [Quandt et al., 1995]. The matrix similarity score is a weight for the quality of a match between the sequence and the matrix, whereas the core similarity weights the quality of a match between the sequence and the core sequence of a matrix, which consists of the five most conserved consecutive positions in a matrix. Both scores range from 0 to 1 where 1 denotes the exact match.

The core similarity allows a pre-selection of possible matches, as the matrix similarity score is only calculated for those matches, whose core similarity score exceeds a certain cut-off. This increases the speed of the algorithm. By using 0 as core similarity cut-off, this feature can be turned off.
When searching a sequence with MATCH™ a cut-off for the matrix similarity score is used to filter significant matches out of the high amount of possible matches.
The appropriate cut-off selection is very important and depends largely on the user's objectives. Exact matches between matrix and sequence can lack any biological relevance since some transcription factors have low affinity binding sites of biological significance. So, we have calculated three different kinds of cut-offs, each answering a different purpose.

  1. Cut-offs to minimize the false negative rate (Cut-offs to minimize the number of biological relevant binding sites which are missed by MATCHTM.)

  2. Cut-offs to minimize the false positive rate (Cut-offs to minimize the number of random matches found.)

  3. Cut-offs to minimize the sum of both error rates

Matrix Similarity cut-off estimation.

Cut-offs minimizing the false negative rate (minFN).

We have estimated the cut-offs to minimize false negative matches by two different ways. As a first approach, we used the weight matrices themselves to calculate the probability of a nucleotide to occur at a certain position of a binding site. Based on these probabilities we have generated a set of oligonucleotides for each matrix of TRANSFAC® 5.1 and applied MATCHTM to this set without using any cut-offs. Then we set the cut-off to a value that provides recognition of at least 90% of a set of oligonucleotides. We decided to tolerate an error rate of ten percent taking into account that the respective set of oligonucleotides might contain weak representatives. We call these cut-offs minFN10 cut-offs.
In the second approach we mainly followed [Pickert et al., 1998]. We have applied individual matrices to sets of corresponding genomic binding sites collected in TRANSFAC® 5.1. In contrast to [Pickert et al., 1998]) we did not only include those sites in our test sets which were used to calculate a matrix, but all genomic binding sites for the binding factor of the respective matrix which are contained in TRANSFAC® 5.1. Each of these binding sites was prolonged by ten base pairs at each end with the help of the corresponding EMBL entry. Sets containing at least 15 individual binding sites were used which allowed us to apply this approach onto 111 matrices. Here, we considered a cut-off which leads to 0% false negative matches as minFN cut-off, i.e. the cut-off which allows to recognize all TRANSFAC® sites of the sample.
It should be noticed that the minFN cut-offs estimated on the sets of genomic binding sites are lower than the minFN10 cut-offs that are estimated on the sets of computed oligonucleotides. This result coincides with the known fact that many genomic binding sites are low affinity binding sites. Transcription factors may be anchored to their low-affinity binding sites through contacts with other transcription factors along with the factor-DNA contacts. Applying the minFN cut-offs the user will find all genomic binding sites, but in this case a high rate of false positives should be taken into account as well. The minFN cut-offs are useful for the detailed analysis of relatively short DNA fragments.

Cut-offs minimizing the false positive rate (minFP).

Again we followed [Pickert et. al., 1998] estimating cut-offs to minimize the false positive rate. We have applied MATCH™ to the sequences of the second exons (6x106 bp). When minFP cut-offs are applied for searching a DNA sequence, MATCH™ will return a relatively low number of matches per nucleotide. In the output the user will find only putative sites with a high similarity to the weight matrix, however, some known genomic binding sites will not be recognized. This kind of cut-off is useful, for instance, for searching the most promising potential binding sites in extended genomic DNA sequences.

Cut-offs minimizing the sum of both errors (minSum).

These thresholds were also computed as described earlier [Reuter, 2000].

Application of MATCH™ for the analysis of human chromosomes.

We have applied MATCH™ to estimate the frequency of the putative binding sites for various transcription factors in extended DNA sequences. For this analysis, 25 matrices were selected which cover all types of DNA binding domains. We selected cut-off values which allow the recognition of 50% of corresponding genomic binding sites collected in the TRANSFAC® release 5.1.
We have compared the frequencies of the putative binding sites between the human chromosomes 22 ( 23x106 bp), 21 ( 34x106 bp), X ( 64x106 bp), and random sequences with equal nucleotide distribution ( 20x106 bp).
Our results show a significant variation between different matrices in the number of matches found within the same DNA sequence. For some matrices matches could be found with relatively high frequency within chromosomal and random sequences - up to 7 matches per 1000 bp. This subset of matrices includes those for the TATA binding protein (M00252), for the glucocorticoid receptor (M00192), for the C/EBP family (M00116), and for the AP-1 family (M00199) (results are shown in Table 1). Another subset of matrices is characterized by the significantly lower frequency - 8 and less matches per 100000 bp. Four matrices of this subset are shown in Table 1: matrix NF-Y (M00185), for the oestrogen receptor (M00191), for NF-kappaB (M00054), and for the serum responsive factor (M00215). The frequency of the matches does not correlate with the GC-content of the matrix, and possibly reflects some other characteristics.

Table 1: Frequency of the putative binding sites found by the MATCHTM tool (per 1 kb nucleotides). Cut-off for each matrix is set to find 50% of genomic binding sites.

  Chr. 22 Chr. 21 Chr. X random
M00252 V$TATA_01 4.29 7.29 8.20 2.30
M00192 V$GR_Q6 4.97 5.11 5.24 3.94
M00116 V$CEBPA_01 1.69 2.60 2.80 1.60
M00199 V$AP1_C 1.95 2.50 2.52 1.76
M00185 V$NFY_Q6 5.1x10-2 7.0x10-2 8.1x10-2 6.2x10-2
M00191 V$ER_Q6 6.5x10-2 5.1x10-2 4.3x10-2 4.8x10-2
M00054 V$NFKAPPAB_01 3.3x10-2 2.7x10-2 3.0x10-2 2.9x10-2
M00215 V$SRF_C 1.1x10-2 1.4x10-2 1.4x10-2 1.9x10-2

Along with the differences between matrices within the same DNA sequence, our results show a significant variation in the frequency of matches between chromosomal and random sequences for the same matrix. We have grouped the analysed matrices into the three following subsets:

  1. the frequency of the matches within the chromosomal sequences is lower than within the random sequences, "chr<r", matrices for CREBP1, USF, EGR, SRF and E2F;

  2. the frequency of the matches within the chromosomal sequences is approximately equal to the frequency within the random sequences, "chr"r", matrices for NF-1, NF-kappaB, c-Myb, NF-Y, ER, GR, HNF-4 and PPAR;

  3. the frequency of the matches within the the chromosomal sequences is higher than within the random sequences, "chr>r", matrices for HNF3B, OCT, NF-AT, MEF2, TBP, AP-1, YY1, and HNF1.

Thus, our results show that MATCHTM may be applied for the analysis of extended genomic sequences to find overall genomic regularities in the distribution of binding sites as well as to define interesting DNA fragments for further analysis.