In Silico Biology 9, 0002 (2008); ©2008, Bioinformation Systems e.V. |
Russ College of Engineering and Technology, Ohio University, Athens, Ohio, USA
^{1} Bioinformatics Laboratory, School of Electrical Engineering and Computer Science, Ohio University,
Biomolecular Engineering Program, Ohio University, and
Molecular and Cellular Biology Program, Ohio University, Athens, Ohio
* Corresponding author
Email: {ecker/lichtenj/welch}@ohio.edu
Edited by H. Michael; received April 21, 2008; revised November 03, 2008; accepted November 05, 2008; published December 30, 2008
In this paper we describe some utilizing conditions of a recently published tool that offers two basic functions for the classical problem of discovering motifs in a set of promoter sequences. For the first it is assumed that not necessarily all of the sequences possess a common motif of given length l. In this case, CHECKPROMOTER allows an exact identification of maximal subsets of related promoters. The purpose of this program is to recognize putatively co-regulated genes. The second, CHECKMOTIF, solves the problem of checking if the given promoters have a common motif. It uses a fast approximation algorithm for which we were able to derive non-trivial low performance bounds (defined as the ratio of Hamming distance of the obtained solution to that of a theoretically best solution) for the computed outputs. Both programs use a novel weighted Hamming distance paradigm for evaluating the similarity of sets of l-mers, and we are able to compute performance bounds for the proposed motifs. A set of At promoters were used as a benchmark for a comparative test against five known tools. It could be verified that SiteSeeker significantly outperformed these tools.
Keywords: weighted Hamming distance, motif discovery, co-regulation, Arabidopsis thaliana
Experiments for identifying transcription factor binding sites are very complex, expensive and time consuming. On the other hand it has been verified that computational motif discovery tools based on methods known from data mining can facilitate experimental search by providing small numbers of putative cis-regulatory structures in promoter sequences left for experimental verification. A number of such tools have been developed during the last few decades (see [1] for an overview).
A computationally identified motif can be expected to have biological meaning if it is statistically unlikely to appear in randomly chosen sequences. A number of common methods for quantifying the credibility of words as putative binding sites are based on entropy [2] or log likelihood [3]. Due to the nature of statistical methods, they can be expected to lead to useful results only for larger number of promoters. However, with having promoter sequences of up to approx. 3000 base pairs (bp) in mind, methods searching the solution space exhaustively for motifs of length already around 10 exclude themselves from consideration because of computational restrictions. Consequently available alignment approaches are heuristic in nature. Though usually very fast, the known heuristic algorithms, as for example Gibbs sampling [4, 5], do not give any clue of the optimality of the produced outputs, or have derived performance bounds.
Other approaches are based on the observation that genes being co-regulated by the same transcription factor have the same or similar binding sites in their promoter regions. Of some help is the fact that such binding sites are usually identical or differ by mismatches in only few positions. Several known tools such as Teiresias [6], cWINNOWER [7], Weeder [8], and LP/DEE [9] follow this idea and search promoter regions for multiple occurrences of motifs that differ at most by a small number of mismatches, without allowing deletions or inclusions. The creditability of words as putative binding sites is then measured by their degree of similarity. A simple method of measuring the distance of words is the Hamming distance which counts the number of mismatch positions in the words. Good motivation for using Hamming distance was, for example, given by Lanctot et al. [10].
Distance-based approaches allow avoiding the computational explosion because they may start a search on small numbers (e. g., two) of sequences and combine the intermediate results in subsequent steps.
The recently developed motif discovery tool SiteSeeker deals particularly with two basic biological questions. For the first problem it is assumed that not necessarily all of the sequences possess a common motif, in which case maximal subsets of related promoters together with the corresponding motifs are identified. The second tool solves the problem of checking if a given set of promoters have a common motif of a given length l. It uses an algorithm that is heuristic in nature but - in contrast to known heuristic motif discovery tools - the authors are able to derive low performance bounds (defined as the ratio of Hamming distance of the obtained solution to that of a theoretically best solution) for the computed output. The programs presented there differ from the aforementioned by the way the results are evaluated. We are using a Hamming distance approach that is refined in two directions: firstly we use the sum of Hamming distances of word pairs for evaluating blocks of two or more words. Secondly we introduce the notion of weighted Hamming distance where the distance of two nucleotides can be any value between 0 and 1. To the best of our knowledge, such variation of the Hamming distance model has never been discussed in literature. As an interesting fact, we were able to show that with this modified distance paradigm most of the known Arabidopsis thaliana motifs could be verified (see section "Experience" for details). This fact gave strong motivation to continue the development of the SiteSeeker tools.
The purpose of this paper is to illuminate some of the program features, give practical guidelines for efficient use of the programs, and prove through an assessment that the tool performs significantly better than MEME [11], AlignAce [12], MDscan [13], Weeder [8], and Gibbs Centroid Sampler [5].
The SiteSeeker tools
Depending on the kind of problem the user is confronted with, he may want to know which sequences of a particular set of promoters have some yet unknown word in common. The user would try out searches for words of different lengths, and, for given word length, look for all promoters that share a word. The motivation is that promoters sharing a highly scored word can be considered as putatively co-regulated. SiteSeeker meets this demand by offering the program CHECKPROMOTER which allows analyzing a set of promoters with respect to co-regulation. CHECKPROMOTER is based on the assumption that not necessarily all of the sequences possess a common motif, in which case maximal subsets of related promoters and the corresponding motifs are identified. The program accepts a lower bound for the number of promoter sequences that have either a common exactly matching word or have similar words whose Hamming distance is below a given threshold.
Another important question is identifying words that can be found in all promoter sequences. CHECKMOTIF helps the user to identify such words which may be considered as putative cis-regulatory binding sites. Further he needs a criterion that allows selecting words that offer a high confidence level for having biological significance. CHECKMOTIF solves the problem by scoring motif candidates with the sum of weighted Hamming distances of pairs. The algorithm is fast and heuristic in nature and - in contrast to known heuristic motif discover tools - we are able to derive low performance bounds (which is the ratio of Hamming distances of the obtained solution and a theoretically best solution) for the computed output.
CHECKMOTIF is designed to work cooperatively with CHECKPROMOTER. An interesting way to deal with the programs is to first apply CHECKPROMOTER on the input set of promoters. After a subset of promoters has been identified as putatively co-regulatory, applying CHECKMOTIF on this subset would then be a logical subsequent step in order to learn more about structures hidden in these promoters.
Hamming-based evaluation of blocks of words
Hamming distance as used in known tools penalizes character mismatches with 1, whereas matching positions add 0 to the total Hamming distance. This distance seems to be a too coarse measure as it is well known that the frequency of nucleotides in genomic sequences varies greatly. For example, we see that in some Arabidopis thaliana (At) promoters nucleotides A and G are each overrepresented with over 30 %. On the other hand, the At motifs known in the AGRIS database [14] show a considerably larger frequency of C and G. Following Zaslavsky and Singh [9], to capture such words in a computer-based tool it would make sense to penalize matching pairs of nucleotides with a small value greater than 0. We adopted this idea and chose pr(X) for the penalty of matching characters, where, pr(X) is the probability of finding nucleotide X in the sequences. Notice that this allows background distribution to be considered in the - originally background-independent - Hamming distance model.
There is another observation that can be made from known At motifs. Counting allowed alternatives in the known motifs showed a surprising imbalance, as most variations found are A/T (see Tab. 1).
Table 1: | Number of variations in known At motifs. |
A/T | A/C | C/T | A/G | C/G | A/G/T | G/T | A/C/T | C/G/T |
35 | 8 | 5 | 5 | 3 | 3 | 2 | 2 | 1 |
An easy way to take care of these is to penalize A/T mismatches by a value smaller than 1; experiments showed that 0.7 is a reasonable choice for At.
Motivated by these facts, we introduce a weight function ω: {A, C, G, T}^{2} → [0,1] (the closed interval of length 1), and define the weighted Hamming distance of l-mers x = a^{1}...a^{l} and y = b^{1}...b^{l} by d(x,y) = Σ ω (a^{i}, b^{i}). For a block (w_{1},...,w_{N}) of N ≥ 2 l-mers, the sum of weighted Hamming distances of pairs is
We have several good reasons to be convinced that the sum of pairs distance is an appropriate and meaningful way to evaluate the similarity of a set of l-mers:
The tools we are presenting in this paper use the weighted Hamming distance paradigm for evaluating a set of equal length words. Details of the running conditions and about the structure of CHECKPROMOTER and CHECKMOTIF are discussed in the next two sections.
Input parameters
CHECKPROMOTERS requires the specification of a number of parameters,
(1) range of motif lengths; CHECKPROMOTER will produce its output for the word lengths in the range,
(2) Hamming distances of nucleotide pairs can be set in the range from 0.0 to 1.0,
(3) upper bound on Hamming distance (acceptance threshold) for considering a pair of words as similar,
(4) the number of promoter sequences having a common word,
(5) the reverse strands can be chosen to be included in the search.
The user can choose between a standard output that lists the found words, or an extended output that in addition shows the word positions in the sequences.
The acceptance threshold allows controlling the similarity of the block words: The smaller the threshold, the closer will be the words among themselves. Especially threshold 0 in case of standard (0/1) Hamming scores will lead to exact matches (if they exist).
An example output
The user can choose between two output formats. The standard output produces a list of words that are found in a minimum number of sequences. For each word, it tells the frequency in each sequence:
Motif length = 8 Number of solution words found 2 Common words of length 8 in the sequences: Sequences 0 1 2 3 4 AAAAATTG: 3 1 0 1 1 in 4 sequences; total: 6 CTGAGAAG: 0 1 1 2 0 in 3 sequences; total: 4
The extended output shows in addition the positions of the words:
Printing position details (negative: reverse strand): AAAAATTG is found in the following sequences at the positions: seq_0: 425; 477; -522; seq_1: 376; seq_3: -782; seq_4: 667; CTGAGAAG is found in the following sequences at the positions: seq_1: 1464; seq_2: -1324; seq_3: -599; -405;
Input parameters of CHECKMOTIF
CHECKMOTIF needs the specification of the same parameters as CHECKMOTIF except (4), which is replaced by
(4') number of solutions to be presented.
The user can again choose between a standard output that lists the found words, or an extended output that in addition shows the word positions in the sequences.
Output of CHECKMOTIF
The output is a list of motifs sorted increasingly of their Hamming distance. Each is a set of N words of length l (one word from each promoter). Denoting w_{i} as the l-mer found by CHECKMOTIF in the i-th promoter, a result is described by a block w = (w_{1},...,w_{N}) of l-mers. For scoring a computed result the sum of weighted Hamming distances of word pairs is used.
For each motif the output shows the consensus word, a sum of distances, a regular expression, mean Hamming distance, and a worst case performance estimate. In case of extended output, the block words and their position in the promoters are given.
The regular expression captures the variations in each block column by using the code recommended by NC-IUB (1984). Regarding the Hamming distance it should be noticed that the output may have results with different motif lengths (motif lengths range from LOWER_MOTLEN to UPPER_MOTLEN), while longer motif length will generally lead to higher Hamming distances. It can be expected that the average Hamming distance per column allows better comparison of motifs of different lengths. Consequently, the mean Hamming distance, defined as d(w_{1},...,w_{N})/l, is shown in the outputs.
A consensus word w^{c} is determined from w = (w_{1},..., w_{N}) by the selecting in each column the most frequent character. As for the regular expression, the sum of distances is also computed as an average per block column,
Finally, the worst performance estimate compares the Hamming distance of the computed block w of l-mers to that of a block u with theoretically minimum possible Hamming distance. Following [15], the upper bound for this ratio is given by
with
and where ω_{min} ≥ 0 is the smallest distance of a pair of nucleotides. A performance bound is useful as it allows estimating the quality of the computed results. Interestingly there seems to be no other motif discovery tool that gives the user an idea of how far the presented motifs are away from the optimum.
An example output
Results for Motif length = 10 1: Consensus word CCTTTTTTGG Sum of distances = 26.00 Regul. Expression CCWWWWWWGG Mean Hamming Distance = 37.58 2: Consensus word TTATTGGGCC Sum of distances = 21.00 Regul. Expression KWDDTGGKY* Mean Hamming Distance = 40.03 3: Consensus word ATTATTGGGC Sum of distances = 22.00 Regul. Expression HWWDWWKGKY Mean Hamming Distance = 40.48
For each result the extended output shows detailed information about the block words and their positions in the promoters. For example the extended output for the first motif is:
1: Consensus word CCTTTTTTGG Sum of distances = 26.00 Regul. Expression CCWWWWWWGG Mean Hamming Distance = 37.58 ------------------------------- At1g01930.1 CCTTATATGG found on + strand (position 74) At1g01940.1 CCTTATATGG found on - strand (position 86) ... At1g07290.1 CCTTTTTAGG found on - strand (position 1226)
In AGRIS (http://arabidopsis.med.ohio-state.edu/) a list of 95 promoter sets is available, which we used as a benchmark for performing a comparative test of CHECKMOTIF against the five well-known tools MEME [11], AlignAce [12], MDscan [13], Weeder [8], and Gibbs Centroid Sampler [5]. We chose the first 40 sets for an assessment. Sets 6 and 27-29 were discarded because they have only a single sequence, while SiteSeeker as well as some of the other tools require at least two sequences for input.
Search parameters: Throughout the test we used the same weighted Hamming distance model, which penalized matching base pairs (A-A, C-C, etc.) with their statistical frequency, as described in section "The SiteSeeker Tools". The motif length was chosen between 5 and 20, and the search was done on both promoter strands. For the web tools we have chosen the same parameters as far as possible.
Results: In each tool we asked for up to the 5 best scored words. A word with a 50% or higher overlap with the known motif was considered a successful hit. Tab. 2 lists the obtained results, either on the positive strand or the reverse complement. The number in the first column refers to the set number in AGRIS. The second column shows the known motif of the set. Columns 3-7 present the highest scored words (highest on top, in decreasing score order) found by SiteSeeker, MEME, AlignAce, MDscan, Weeder, and Gibbs Centroid Sampler. Matches with the known motifs are emphasized.
The weighted Hamming distance used in this assessment turned out to perform extremely well, and SiteSeeker outperformed the other tools on Arabidopsis thaliana promoters.
Table 2: | Comparison of the outputs of six tools. |
Set | Agris: motif | SiteSeeker | MEME | AlignAce | MDScan | Weeder | Gibbs Centr | |
1 | CACGTGGC | CACGTGGC | TGCCACGTG | AAAANAAANNNNA CACGTGGC | GACACGTG CACGTGGC | CACGTG CACGTGGC | CACGTGGC | |
2 | YACGTGGC | ACGTGGC | AAAAAAAGAAGAAAA CACGTGGC | ANAAAAANAA | AAAAAAAA CACGTGGC | YACGTGGC | ACGTGDH | |
3 | BACGTGKM | ACGTG | TGCCACGTG | AAANAAAAA TNNNAAATNTTG TANNGGGCC TNCACGTGG | TTTTTTTT TATATATA | TACGTG ATCGTACC ACGTGT | --- | |
4 | GACACGTAGA | GACACGTAGA | TCTACGTGTC | ANNNAAAAAAAAA GACACGTAGA | GACACGTAGA | CACGTAGA | GACACGTAGA | |
5 | CCATTTTTAGT | CCATTTTTAG | CCTAAAAATGG | CTAAAAATGG | GAAGAAGAAG TTTCTCTCTTT TCTCTCTCTTT AGAGAGAGAGA | TCGATA ATATTCGT TATTCGTA CGTCGATA TATTCG | CATTTTTAG | |
6 | Only one sequence available ==> SiteSeeker could not be applied |
|||||||
7 | CCATTTTTGG | CCATTTTTGG | CCATTTTTGG | CAAAAAAAAA CCAAAAATGG AAANNNANAAANAAA AAAAANGNNAAAA AAAAANANANNNANNNN-NANA | CCATTTTTGG | ATTTTTGG GGCCCT CCATTTTT | CATTTTTGG | |
8 | NTTDCCWWWWNN-GGWAAN | TWDCCWWWWWWGRHWA | TTGCCAAAAAAGGAAA | ANNNNAANNNNAANAAANA-AANAANNATG TGTCACGGGTGAGTTA CAANAANANNANANAAAAA-AANNNA | GATCGGATCGGATCAG GGATCGGATCCGATCA CGGATCAGATCGGATC CCGATCGGATCAGATC TGATCGGATCGGATCA | TTACGAAA TCGATA TATTCGAT AATCGA TCGATTTA | WDCCWWWWWG | |
9 | CCAATTTAATGG | CCAATTTAATGG | CCAATTTAATGGGC | ACAAAAAAAAAC GNNNNGAAAGNNANAN-ANANAA ANANAGAGAGAGANNNA ANGNCAAAAANNNNAN-NNAGA ANACAAAANTNNATANNNNT | TCCTCCTCCTCC CTCCTCCTCCTC TCCTCCCCCCCC | CCAATTTAATGG | --- | |
10 | TTWCYAWWWWTR-GWAA | TTWCYAWWWWTRGWAA | TTACTAAAAATGGAAA | CAAAAAAATAANAANAANNA ANANNAAAANNNNAGANNANANANNNAAANT ATACCAAAAATGAAAA | CTTCTTCTTCTTCT-TC | ATGGAAAT | --- | |
11 | CCATTTTTAG | CTAAAAATGG | CTAAAAATGG | AAAAAAANAAA GNGAAGAAGNNNNAG AANNATNNANNAAAANT TNTTCAANAAAA AAGAAANNNNAAAA | TTTTTTTTTT | ATTTTTGG | CTAAAAATGG | |
12 | CCATTTTTGG | CCATTTTTGG | CCATTTTTGG | CAAAAAAAAA CCAAAAATGG AAAAGAAAAAAAA | TCTCTCTCTC GAGAGAGAGA AGAGAGAGAG | ATTTTTGG | ATTTTTGG | |
13 | TGTCTC | TGTCTC | AAAAGAAGAAAGAGAAAGAA GGGGTAGGCGGGGGGGGCTG CTCTCTCTCTCCCCC AGAAGAGAGAG GTGGTTGC | --- | CTCTCT TGGGCC TGGACC TGGTCC TGCATG | GGCGTG TTAACG GTCGTG ATGTCG GTAGAT | --- | |
14 | TGTCTC | TGTCTC | TTTTTTCTCTCTCTTCTTTT CCCCTCTGTCTCTCTCCCTC TGGATGGCGGCGTCGTTGTT TTATAGCTTTCTCTCTTTAA GTGTCGGTGCTCCGTGAAGT | GAGNGAG GNTGNNGNNGNNG GGNGGNNGNNG | TGGGCC TGGACC TGGTCC GGGCCA TCTCTC | GTGTCT | --- | |
15 | ACTCAT | ATGAGT | CTCCCTCTCCCTCTT ACGGCCCAGCAGCTT AGACAAAAAAAAAAG GTTTGGGTTTT GGCCCA | ANNAAAANAAAA AAGTTTTGT AANNANANANCNANANA ATATATAANA ANANNTATATNTA | TATATA ATATAT TTTTTT TAATTA TTAATT | ATGAGT | --- | |
16 | CAATWATTG | CAATTATTG | CTTCTTTTTCTTTTTCTCT CGGGACCGCCT TCAATTATTGTTCCG GATTTTCTTTTTTTT CAAACCGAGCCAGACCGGCT | AAAATAAAATAA CAATAATTG | CAATTATTG | GGGCCC CGTGGG CGCGCC GCCCGG TGGGCC | CAATWATT | |
17 | CAATSATTG | TTCTTTT AATTAAA | AGGGAGAGAGAG TTTTTTTTCTTT CCCCAAGGACC TGGGCCAC GTTTCTTTTTT | ANAAAAAAAAAAANAAA AGGNCCANGANCGTAANANG AAAAANAAAGAGAGTA AGNACCGAGAAGTCTT AGGCCCANGANCGNAANANG | CTCTCTCTC | TACGTGGG GGTAATTA CTCACGCG CACACGTA CGCGTGAT | --- | |
18 | CAATNATTG | CAATNATTG | ATTCAATTATTG | AAATAAAATAAAA GAGAGAGAGA AATGGGCCTGA AATAAAATAATTCATTCTCAT AACACAATGATTGGCT | CTCTCTCTC | AATCATTG | CAATNATTR | |
19 | CAATTATTA | CAATTATTA | CTCCCTCTCTCTCCC CAGAAAAAAAAAAAA AAAAAAAAAAGA GTCTCCCTCTACCCC CAATTATTAAAG | AAAANAAAANNA GAGAGGAAG TNTTGGGCCNNA ATNNNNANATATNTNNA CAATCATTG | CTCTCTCTC AGAGAGAGA | CAATTATT | AATTAT | |
20 | CTAACCA | CTAACCA | TGGTTAG | AACAAAA AAAGAAA GAGAGAG | AAAAAAAA ATATATAT CTCTCTCT | TATTCGAT TCGATA ATTCGATA CGATAA | --- | |
21 | CACATG | CACATG | TCTCTCTCTCTC CTTTTTTCTTTT AATGGGCC CCCATGTG | TGGGCC | AGAAAG GGGCCA GGACCA GGCCCA GACCCA | ATATCG ATTAATCG TCGATA TATCGA ATCGATAT | --- | |
22 | AAATTAAA | AAATTAAA | CTTCCTCTTCTT TGGGCC GATTTTTTTTTC TTTTGGTGGGTT GGGGGGGCGGGG | AAAANNAAANNNA ANNNNNNTNNANATATA GAGAGAGA AATGGGCC GNNGNAGNCGNNGNNG | TTTTTTTT TATATATA | TATATTCG CGTATA CGTATATT | --- | |
23 | AAATTAGT | ACTAATTT | AAAAAGAAGAAG TTTTTTTTTTTG AAGAAGGAACCA TGGGTTTTGCTG TTGTCGCCGGT | AAAANAAAA GAAAAAAA GANGNNGNNGNNGNNNNN-GNA AAAGAGAA GGNGGNGGGNG | CTTCTTCT ATACATAT TCTCTCTC CATATATA | ACTAATTT | --- | |
24 | ACTAATTT | YTCTWCT ACTAATTT | AAAAAGAAGAAG TTTTTTTTTTTG AAGAAGGAACCA TGGGTTTTGCTG TTGTCGCCGGT | AAAANAAAA ANNAANNNANNAANNANA AAATTAGT | CTTCTTCT TCTCTCTC CATATATA | ACTAATTT | CTCAAATTAGT | |
25 | GGTTAA | GGCCMA GGCHCA | AACCGACCAACA ATGGGCC CCCCATCTCCCC GTGGCCCAACAT ACGACGCCGCC | TGGGCC | ATATAT TTTGTT ATAAAT AAAAAA | GGTCCA CATATGCG GGGCCC TACGCA GGGTCC | --- | |
26 | CCWWWWWWGG | TGGDYHDT WWWWWTTSG | GGCCCAACAA AACAAAAAAACAAAA GCTCTCACCGCCAC CGTGGGCCAT TTGGGGTTTATCAT | AAANNAAANAAAA TNTTGGGCCAT CNNNAAAGNCTNAANNNNNNA CGGCGANGNAGA AAANTTTNNAAAA | GACGGCGATG CATCGCCGCC GACGGCGATG CCGACGGTGG | GGGCCA TGGGCCAT TAGCCCGGAT TAGCCCGGATGA TGGGCC | WWGGGCC | |
27-29 | Only one sequence available ==> SiteSeeker could not be applied |
|||||||
30 | TGGCCGAC | TGGCCGAC | TGGCCGAC | ANAAAAAAA GTCGGCCA | TTTTTTTT | TGGCCGAC | TCGGCC | |
31 | CCACGTGG | CCACGTGG | GCCACGTGGGGC | ANNANNAAAANAA CCACGTGG | CCACGTGG | CCACGTGG | CCACGTGG | |
32 | AAMAATCT | CCAGTT RGWTTTT | CGAAGAAGAAG AGAAAGAAGAGG AAAAAAAAAAAT CTGCCACGTGA CCATCCTCCGGC | AAAAAAAA ANGAAGANGA AAAANNNAAAA AGGCCCAA GAGAGAGA | AAAAAAAA | GACGCTCC GCGGGG | GGGCC | |
33 | AAACAATCTA | TAGATTGTTT | AAACAATCTA | AAAANAAAANAA AAACAATCTA | TCTCTCTCTC CCCCCCCCCC | TAGATTGT | RAACAATY | |
34 | AAAAAAAATCTATGA | AAAAAAAATCTATGA | CCACAAGTCCGCTTCTCCCT CCCAAAAAAAATCTATGA | AANCTATGANNNNN- NNNNNGNNACNNNA- NANANNNANNNT | ACCAATTCCAACGGC CCAATTCCAACGGCA ACCAATATCAACGGC GTCATGGTCATAGTC GTTGTTATTGGCTAG | TGGCGAAACACC GGCGAAACACCC GCTATTGGCTAG GGTCAGGGTCAT AAAAATCTATGA | --- | |
35 | ACACNNG | TGKTGG TTTGGT TDTTGGT WGTTGGH GTTYKG*T | GTTGGTGGGGTT TCTTCTTCTTCT ATGGGCC GGTCCAAAA ACAAAAAAAAAA | AAANNAAAA AANNNNNNANANANANA ANAANAAGA AAGAANAA GAAAANANA | CTCTCTC ATGGGCC ACACATA ACAGAGA | GTCGTG ATGCCAAC TACGCA ATACGC GGTCGTCC | --- | |
36 | TACCGACAT | TACCGACAT | ATGTCGGTA | ANAAAANAAAA ATGTCGGTA | ATGTCGGTA | TGTCGGTA | TACCGACAT | |
37 | DRCCGACNW | GTCGG | GGCCCATT TTTTTTTTTTCT AACCGACCAACC | ANNAAAANAAAA AATGGGCCA CGNGNANNNCANGTG GNCNANNNNCGNNGNGGNNT GNCGTGGGNTC | ATTGA ATTTA TGAAT AATTA TTAAG | ACCGACCA | (AG)GCCCA | |
38 | TACCGACAT | ATGTCGGTA | ATGTCGGTA | AAAACAAAA ANNNANAAANANANANA GNNGNNGNNGNNGNNGNTG-NNG ATATATNTANA AAGTTTTGG | ATGTCGGTA | TGTCGGTA | ATGTCGGTA | |
39 | TTTCCCGC | TTTCCCGC | TTTCCCGC | GGNNNNGGNNNNGTNNNNGG ANAANAANANAA AAAANAAANA GCGGGAAA | GGTGGTGG GGTGGAGG GGAGGAGG GGCGGAGG ATTATTAT | TTTCCCGC | TCCCGC | |
40 | TCTCCCGCC | TCTCCCGCC | GGCGGGAGAC | ANNNAAANAAAANNNNA GGCGGGAGA | GGCGGGAGA TCTCCCGCC | CTCCCGCC | --- |
The boxed words are the patterns found by the particular tools. |
Click on the thumbnail to enlarge the picture |
Figure 1: Comparing six motif discovery tools on 36 At promoter sets. |
We also experimented with different Hamming penalties. The 0-1 Hamming scoring allows easily finding longest exactly matching words, particularly if the word length is >6. In all these cases SiteSeeker was successful whereas tools working with position weight matrices often had difficulties finding the right motif. An already mentioned fact is that the known At motifs have a high rate of A-T mismatches. This could be taken care by choosing a mismatch penalty smaller than 1. A mismatch penalty of 0.7 proved to work well and, for example, exactly identified the - generally difficult to identify - motif CCWWWWWWGG in set 26.
We presented two programs, CHECKPROMOTER for identifying putatively co-regulated genes, and CHECKMOTIF for discovering motifs in a set of promoter sequences. The programs are designed to work cooperatively in the way that after a subset of promoters has been identified by CHECKPROMOTER as putatively co-regulatory, CHECKMOTIF would elucidate detailed structures in this subset. In both programs the weighted Hamming distance is used for evaluating pairs of words, and the sum of Hamming distances of word pairs is applied for blocks of N ≥ 2 words. On sets of promoters from Arabidopsis thaliana we were able to verify that SiteSeeker performs significantly better than some best motif discovery tools.