In Silico Biology 9, 0002 (2008); ©2008, Bioinformation Systems e.V.  


The SiteSeeker motif discovery tool


Klaus Ecker*, Jens Lichtenberg and Lonnie Welch1




Russ College of Engineering and Technology, Ohio University, Athens, Ohio, USA
1 Bioinformatics Laboratory, School of Electrical Engineering and Computer Science, Ohio University,
   Biomolecular Engineering Program, Ohio University, and
   Molecular and Cellular Biology Program, Ohio University, Athens, Ohio



* Corresponding author

   Email: {ecker/lichtenj/welch}@ohio.edu





Edited by H. Michael; received April 21, 2008; revised November 03, 2008; accepted November 05, 2008; published December 30, 2008



Abstract

In this paper we describe some utilizing conditions of a recently published tool that offers two basic functions for the classical problem of discovering motifs in a set of promoter sequences. For the first it is assumed that not necessarily all of the sequences possess a common motif of given length l. In this case, CHECKPROMOTER allows an exact identification of maximal subsets of related promoters. The purpose of this program is to recognize putatively co-regulated genes. The second, CHECKMOTIF, solves the problem of checking if the given promoters have a common motif. It uses a fast approximation algorithm for which we were able to derive non-trivial low performance bounds (defined as the ratio of Hamming distance of the obtained solution to that of a theoretically best solution) for the computed outputs. Both programs use a novel weighted Hamming distance paradigm for evaluating the similarity of sets of l-mers, and we are able to compute performance bounds for the proposed motifs. A set of At promoters were used as a benchmark for a comparative test against five known tools. It could be verified that SiteSeeker significantly outperformed these tools.

Keywords: weighted Hamming distance, motif discovery, co-regulation, Arabidopsis thaliana



Introduction

Experiments for identifying transcription factor binding sites are very complex, expensive and time consuming. On the other hand it has been verified that computational motif discovery tools based on methods known from data mining can facilitate experimental search by providing small numbers of putative cis-regulatory structures in promoter sequences left for experimental verification. A number of such tools have been developed during the last few decades (see [1] for an overview).

A computationally identified motif can be expected to have biological meaning if it is statistically unlikely to appear in randomly chosen sequences. A number of common methods for quantifying the credibility of words as putative binding sites are based on entropy [2] or log likelihood [3]. Due to the nature of statistical methods, they can be expected to lead to useful results only for larger number of promoters. However, with having promoter sequences of up to approx. 3000 base pairs (bp) in mind, methods searching the solution space exhaustively for motifs of length already around 10 exclude themselves from consideration because of computational restrictions. Consequently available alignment approaches are heuristic in nature. Though usually very fast, the known heuristic algorithms, as for example Gibbs sampling [4, 5], do not give any clue of the optimality of the produced outputs, or have derived performance bounds.

Other approaches are based on the observation that genes being co-regulated by the same transcription factor have the same or similar binding sites in their promoter regions. Of some help is the fact that such binding sites are usually identical or differ by mismatches in only few positions. Several known tools such as Teiresias [6], cWINNOWER [7], Weeder [8], and LP/DEE [9] follow this idea and search promoter regions for multiple occurrences of motifs that differ at most by a small number of mismatches, without allowing deletions or inclusions. The creditability of words as putative binding sites is then measured by their degree of similarity. A simple method of measuring the distance of words is the Hamming distance which counts the number of mismatch positions in the words. Good motivation for using Hamming distance was, for example, given by Lanctot et al. [10].

Distance-based approaches allow avoiding the computational explosion because they may start a search on small numbers (e. g., two) of sequences and combine the intermediate results in subsequent steps.

The recently developed motif discovery tool SiteSeeker deals particularly with two basic biological questions. For the first problem it is assumed that not necessarily all of the sequences possess a common motif, in which case maximal subsets of related promoters together with the corresponding motifs are identified. The second tool solves the problem of checking if a given set of promoters have a common motif of a given length l. It uses an algorithm that is heuristic in nature but - in contrast to known heuristic motif discovery tools - the authors are able to derive low performance bounds (defined as the ratio of Hamming distance of the obtained solution to that of a theoretically best solution) for the computed output. The programs presented there differ from the aforementioned by the way the results are evaluated. We are using a Hamming distance approach that is refined in two directions: firstly we use the sum of Hamming distances of word pairs for evaluating blocks of two or more words. Secondly we introduce the notion of weighted Hamming distance where the distance of two nucleotides can be any value between 0 and 1. To the best of our knowledge, such variation of the Hamming distance model has never been discussed in literature. As an interesting fact, we were able to show that with this modified distance paradigm most of the known Arabidopsis thaliana motifs could be verified (see section "Experience" for details). This fact gave strong motivation to continue the development of the SiteSeeker tools.

The purpose of this paper is to illuminate some of the program features, give practical guidelines for efficient use of the programs, and prove through an assessment that the tool performs significantly better than MEME [11], AlignAce [12], MDscan [13], Weeder [8], and Gibbs Centroid Sampler [5].



The concept of SiteSeeker


The SiteSeeker tools

Depending on the kind of problem the user is confronted with, he may want to know which sequences of a particular set of promoters have some yet unknown word in common. The user would try out searches for words of different lengths, and, for given word length, look for all promoters that share a word. The motivation is that promoters sharing a highly scored word can be considered as putatively co-regulated. SiteSeeker meets this demand by offering the program CHECKPROMOTER which allows analyzing a set of promoters with respect to co-regulation. CHECKPROMOTER is based on the assumption that not necessarily all of the sequences possess a common motif, in which case maximal subsets of related promoters and the corresponding motifs are identified. The program accepts a lower bound for the number of promoter sequences that have either a common exactly matching word or have similar words whose Hamming distance is below a given threshold.

Another important question is identifying words that can be found in all promoter sequences. CHECKMOTIF helps the user to identify such words which may be considered as putative cis-regulatory binding sites. Further he needs a criterion that allows selecting words that offer a high confidence level for having biological significance. CHECKMOTIF solves the problem by scoring motif candidates with the sum of weighted Hamming distances of pairs. The algorithm is fast and heuristic in nature and - in contrast to known heuristic motif discover tools - we are able to derive low performance bounds (which is the ratio of Hamming distances of the obtained solution and a theoretically best solution) for the computed output.

CHECKMOTIF is designed to work cooperatively with CHECKPROMOTER. An interesting way to deal with the programs is to first apply CHECKPROMOTER on the input set of promoters. After a subset of promoters has been identified as putatively co-regulatory, applying CHECKMOTIF on this subset would then be a logical subsequent step in order to learn more about structures hidden in these promoters.


Hamming-based evaluation of blocks of words

Hamming distance as used in known tools penalizes character mismatches with 1, whereas matching positions add 0 to the total Hamming distance. This distance seems to be a too coarse measure as it is well known that the frequency of nucleotides in genomic sequences varies greatly. For example, we see that in some Arabidopis thaliana (At) promoters nucleotides A and G are each overrepresented with over 30 %. On the other hand, the At motifs known in the AGRIS database [14] show a considerably larger frequency of C and G. Following Zaslavsky and Singh [9], to capture such words in a computer-based tool it would make sense to penalize matching pairs of nucleotides with a small value greater than 0. We adopted this idea and chose pr(X) for the penalty of matching characters, where, pr(X) is the probability of finding nucleotide X in the sequences. Notice that this allows background distribution to be considered in the - originally background-independent - Hamming distance model.

There is another observation that can be made from known At motifs. Counting allowed alternatives in the known motifs showed a surprising imbalance, as most variations found are A/T (see Tab. 1).


Table 1: Number of variations in known At motifs.
A/T A/C C/T A/G C/G A/G/T G/T A/C/T C/G/T
35 8 5 5 3 3 2 2 1

An easy way to take care of these is to penalize A/T mismatches by a value smaller than 1; experiments showed that 0.7 is a reasonable choice for At.

Motivated by these facts, we introduce a weight function ω: {A, C, G, T}2 → [0,1] (the closed interval of length 1), and define the weighted Hamming distance of l-mers x = a1...al and y = b1...bl by d(x,y) = Σ ω (ai, bi). For a block (w1,...,wN) of N ≥ 2 l-mers, the sum of weighted Hamming distances of pairs is

We have several good reasons to be convinced that the sum of pairs distance is an appropriate and meaningful way to evaluate the similarity of a set of l-mers:

The tools we are presenting in this paper use the weighted Hamming distance paradigm for evaluating a set of equal length words. Details of the running conditions and about the structure of CHECKPROMOTER and CHECKMOTIF are discussed in the next two sections.



Input and output of CHECKPROMOTERS


Input parameters

CHECKPROMOTERS requires the specification of a number of parameters,

  (1) range of motif lengths; CHECKPROMOTER will produce its output for the word lengths in the range,
  (2) Hamming distances of nucleotide pairs can be set in the range from 0.0 to 1.0,
  (3) upper bound on Hamming distance (acceptance threshold) for considering a pair of words as similar,
  (4) the number of promoter sequences having a common word,
  (5) the reverse strands can be chosen to be included in the search.

The user can choose between a standard output that lists the found words, or an extended output that in addition shows the word positions in the sequences.

The acceptance threshold allows controlling the similarity of the block words: The smaller the threshold, the closer will be the words among themselves. Especially threshold 0 in case of standard (0/1) Hamming scores will lead to exact matches (if they exist).


An example output

The user can choose between two output formats. The standard output produces a list of words that are found in a minimum number of sequences. For each word, it tells the frequency in each sequence:

 Motif length = 8
 Number of solution words found 2
 Common words of length 8 in the sequences:
 Sequences       0     1     2     3     4
  AAAAATTG:      3     1     0     1     1   in 4 sequences;  total:  6
  CTGAGAAG:      0     1     1     2     0   in 3 sequences;  total:  4
  

The extended output shows in addition the positions of the words:

 Printing position details (negative: reverse strand): 
 AAAAATTG is found in the following sequences at the positions: 
   seq_0:    425; 477; -522;
   seq_1:    376;
   seq_3:   -782;
   seq_4:    667;
 CTGAGAAG is found in the following sequences at the positions: 
   seq_1:   1464;
   seq_2:  -1324;
   seq_3:   -599; -405;



Input and output of CHECKMOTIF


Input parameters of CHECKMOTIF

CHECKMOTIF needs the specification of the same parameters as CHECKMOTIF except (4), which is replaced by

  (4') number of solutions to be presented.

The user can again choose between a standard output that lists the found words, or an extended output that in addition shows the word positions in the sequences.


Output of CHECKMOTIF

The output is a list of motifs sorted increasingly of their Hamming distance. Each is a set of N words of length l (one word from each promoter). Denoting wi as the l-mer found by CHECKMOTIF in the i-th promoter, a result is described by a block w = (w1,...,wN) of l-mers. For scoring a computed result the sum of weighted Hamming distances of word pairs is used.

For each motif the output shows the consensus word, a sum of distances, a regular expression, mean Hamming distance, and a worst case performance estimate. In case of extended output, the block words and their position in the promoters are given.

The regular expression captures the variations in each block column by using the code recommended by NC-IUB (1984). Regarding the Hamming distance it should be noticed that the output may have results with different motif lengths (motif lengths range from LOWER_MOTLEN to UPPER_MOTLEN), while longer motif length will generally lead to higher Hamming distances. It can be expected that the average Hamming distance per column allows better comparison of motifs of different lengths. Consequently, the mean Hamming distance, defined as d(w1,...,wN)/l, is shown in the outputs.

A consensus word wc is determined from w = (w1,..., wN) by the selecting in each column the most frequent character. As for the regular expression, the sum of distances is also computed as an average per block column,

Finally, the worst performance estimate compares the Hamming distance of the computed block w of l-mers to that of a block u with theoretically minimum possible Hamming distance. Following [15], the upper bound for this ratio is given by

with

and where ωmin ≥ 0 is the smallest distance of a pair of nucleotides. A performance bound is useful as it allows estimating the quality of the computed results. Interestingly there seems to be no other motif discovery tool that gives the user an idea of how far the presented motifs are away from the optimum.


An example output

 Results for Motif length = 10
   1: Consensus word       CCTTTTTTGG   Sum of distances = 26.00
      Regul. Expression    CCWWWWWWGG   Mean Hamming Distance = 37.58
   2: Consensus word       TTATTGGGCC   Sum of distances = 21.00
      Regul. Expression    KWDDTGGKY*   Mean Hamming Distance = 40.03
   3: Consensus word       ATTATTGGGC   Sum of distances = 22.00
      Regul. Expression    HWWDWWKGKY   Mean Hamming Distance = 40.48
	

For each result the extended output shows detailed information about the block words and their positions in the promoters. For example the extended output for the first motif is:

 1: Consensus word       CCTTTTTTGG   Sum of distances = 26.00
    Regul. Expression    CCWWWWWWGG   Mean Hamming Distance = 37.58
    -------------------------------
    At1g01930.1          CCTTATATGG   found on + strand (position 74)
    At1g01940.1          CCTTATATGG   found on - strand (position 86)
    ...
    At1g07290.1          CCTTTTTAGG   found on - strand (position 1226)
	


Experience

In AGRIS (http://arabidopsis.med.ohio-state.edu/) a list of 95 promoter sets is available, which we used as a benchmark for performing a comparative test of CHECKMOTIF against the five well-known tools MEME [11], AlignAce [12], MDscan [13], Weeder [8], and Gibbs Centroid Sampler [5]. We chose the first 40 sets for an assessment. Sets 6 and 27-29 were discarded because they have only a single sequence, while SiteSeeker as well as some of the other tools require at least two sequences for input.

Search parameters: Throughout the test we used the same weighted Hamming distance model, which penalized matching base pairs (A-A, C-C, etc.) with their statistical frequency, as described in section "The SiteSeeker Tools". The motif length was chosen between 5 and 20, and the search was done on both promoter strands. For the web tools we have chosen the same parameters as far as possible.

Results: In each tool we asked for up to the 5 best scored words. A word with a 50% or higher overlap with the known motif was considered a successful hit. Tab. 2 lists the obtained results, either on the positive strand or the reverse complement. The number in the first column refers to the set number in AGRIS. The second column shows the known motif of the set. Columns 3-7 present the highest scored words (highest on top, in decreasing score order) found by SiteSeeker, MEME, AlignAce, MDscan, Weeder, and Gibbs Centroid Sampler. Matches with the known motifs are emphasized.

The weighted Hamming distance used in this assessment turned out to perform extremely well, and SiteSeeker outperformed the other tools on Arabidopsis thaliana promoters.


Table 2: Comparison of the outputs of six tools.
SetAgris: motifSiteSeekerMEMEAlignAceMDScanWeederGibbs Centr
1CACGTGGCCACGTGGCTGCCACGTGAAAANAAANNNNA
CACGTGGC
GACACGTG
CACGTGGC
CACGTG
CACGTGGC
CACGTGGC
2YACGTGGCACGTGGCAAAAAAAGAAGAAAA
CACGTGGC
ANAAAAANAAAAAAAAAA
CACGTGGC
YACGTGGCACGTGDH
3BACGTGKMACGTGTGCCACGTGAAANAAAAA
TNNNAAATNTTG
TANNGGGCC
TNCACGTGG
TTTTTTTT
TATATATA
TACGTG
ATCGTACC
ACGTGT
---
4GACACGTAGA GACACGTAGATCTACGTGTCANNNAAAAAAAAA
GACACGTAGA
GACACGTAGACACGTAGAGACACGTAGA
5CCATTTTTAGTCCATTTTTAGCCTAAAAATGGCTAAAAATGGGAAGAAGAAG
TTTCTCTCTTT
TCTCTCTCTTT
AGAGAGAGAGA
TCGATA
ATATTCGT
TATTCGTA
CGTCGATA
TATTCG
CATTTTTAG

6

Only one sequence available ==> SiteSeeker could not be applied

7CCATTTTTGGCCATTTTTGGCCATTTTTGGCAAAAAAAAA
CCAAAAATGG
AAANNNANAAANAAA
AAAAANGNNAAAA
AAAAANANANNNANNNN-NANA
CCATTTTTGGATTTTTGG
GGCCCT
CCATTTTT
CATTTTTGG
8NTTDCCWWWWNN-GGWAANTWDCCWWWWWWGRHWATTGCCAAAAAAGGAAAANNNNAANNNNAANAAANA-AANAANNATG
TGTCACGGGTGAGTTA
CAANAANANNANANAAAAA-AANNNA
GATCGGATCGGATCAG
GGATCGGATCCGATCA
CGGATCAGATCGGATC
CCGATCGGATCAGATC
TGATCGGATCGGATCA
TTACGAAA
TCGATA
TATTCGAT
AATCGA
TCGATTTA
WDCCWWWWWG
9CCAATTTAATGGCCAATTTAATGGCCAATTTAATGGGCACAAAAAAAAAC
GNNNNGAAAGNNANAN-ANANAA
ANANAGAGAGAGANNNA
ANGNCAAAAANNNNAN-NNAGA
ANACAAAANTNNATANNNNT
TCCTCCTCCTCC
CTCCTCCTCCTC
TCCTCCCCCCCC
CCAATTTAATGG---
10TTWCYAWWWWTR-GWAATTWCYAWWWWTRGWAATTACTAAAAATGGAAACAAAAAAATAANAANAANNA
ANANNAAAANNNNAGANNANANANNNAAANT
ATACCAAAAATGAAAA
CTTCTTCTTCTTCT-TCATGGAAAT---
11CCATTTTTAGCTAAAAATGGCTAAAAATGGAAAAAAANAAA
GNGAAGAAGNNNNAG
AANNATNNANNAAAANT
TNTTCAANAAAA
AAGAAANNNNAAAA
TTTTTTTTTTATTTTTGGCTAAAAATGG
12CCATTTTTGGCCATTTTTGGCCATTTTTGGCAAAAAAAAA
CCAAAAATGG
AAAAGAAAAAAAA
TCTCTCTCTC
GAGAGAGAGA
AGAGAGAGAG
ATTTTTGGATTTTTGG
13 TGTCTCTGTCTCAAAAGAAGAAAGAGAAAGAA
GGGGTAGGCGGGGGGGGCTG
CTCTCTCTCTCCCCC
AGAAGAGAGAG
GTGGTTGC
---CTCTCT
TGGGCC
TGGACC
TGGTCC
TGCATG
GGCGTG
TTAACG
GTCGTG
ATGTCG
GTAGAT
---
14TGTCTCTGTCTCTTTTTTCTCTCTCTTCTTTT
CCCCTCTGTCTCTCTCCCTC
TGGATGGCGGCGTCGTTGTT
TTATAGCTTTCTCTCTTTAA
GTGTCGGTGCTCCGTGAAGT
GAGNGAG
GNTGNNGNNGNNG
GGNGGNNGNNG
TGGGCC
TGGACC
TGGTCC
GGGCCA
TCTCTC
GTGTCT---
15ACTCAT ATGAGTCTCCCTCTCCCTCTT
ACGGCCCAGCAGCTT
AGACAAAAAAAAAAG
GTTTGGGTTTT
GGCCCA
ANNAAAANAAAA
AAGTTTTGT
AANNANANANCNANANA
ATATATAANA
ANANNTATATNTA
TATATA
ATATAT
TTTTTT
TAATTA
TTAATT
ATGAGT---
16CAATWATTGCAATTATTGCTTCTTTTTCTTTTTCTCT
CGGGACCGCCT
TCAATTATTGTTCCG
GATTTTCTTTTTTTT
CAAACCGAGCCAGACCGGCT
AAAATAAAATAA
CAATAATTG
CAATTATTGGGGCCC
CGTGGG
CGCGCC
GCCCGG
TGGGCC
CAATWATT
17CAATSATTGTTCTTTT
AATTAAA
AGGGAGAGAGAG
TTTTTTTTCTTT
CCCCAAGGACC
TGGGCCAC
GTTTCTTTTTT
ANAAAAAAAAAAANAAA
AGGNCCANGANCGTAANANG
AAAAANAAAGAGAGTA
AGNACCGAGAAGTCTT
AGGCCCANGANCGNAANANG
CTCTCTCTCTACGTGGG
GGTAATTA
CTCACGCG
CACACGTA
CGCGTGAT
---
18CAATNATTGCAATNATTGATTCAATTATTGAAATAAAATAAAA
GAGAGAGAGA
AATGGGCCTGA
AATAAAATAATTCATTCTCAT
AACACAATGATTGGCT
CTCTCTCTCAATCATTGCAATNATTR
19CAATTATTACAATTATTACTCCCTCTCTCTCCC
CAGAAAAAAAAAAAA
AAAAAAAAAAGA
GTCTCCCTCTACCCC
CAATTATTAAAG
AAAANAAAANNA
GAGAGGAAG
TNTTGGGCCNNA
ATNNNNANATATNTNNA
CAATCATTG
CTCTCTCTC
AGAGAGAGA
CAATTATTAATTAT
20CTAACCACTAACCATGGTTAGAACAAAA
AAAGAAA
GAGAGAG
AAAAAAAA
ATATATAT
CTCTCTCT
TATTCGAT
TCGATA
ATTCGATA
CGATAA
---
21CACATGCACATGTCTCTCTCTCTC
CTTTTTTCTTTT
AATGGGCC
CCCATGTG
TGGGCCAGAAAG
GGGCCA
GGACCA
GGCCCA
GACCCA
ATATCG
ATTAATCG
TCGATA
TATCGA
ATCGATAT
---
22AAATTAAAAAATTAAACTTCCTCTTCTT
TGGGCC
GATTTTTTTTTC
TTTTGGTGGGTT
GGGGGGGCGGGG
AAAANNAAANNNA
ANNNNNNTNNANATATA
GAGAGAGA
AATGGGCC
GNNGNAGNCGNNGNNG
TTTTTTTT
TATATATA
TATATTCG
CGTATA
CGTATATT
---
23AAATTAGT ACTAATTTAAAAAGAAGAAG
TTTTTTTTTTTG
AAGAAGGAACCA
TGGGTTTTGCTG
TTGTCGCCGGT
AAAANAAAA
GAAAAAAA
GANGNNGNNGNNGNNNNN-GNA
AAAGAGAA
GGNGGNGGGNG
CTTCTTCT
ATACATAT
TCTCTCTC
CATATATA
ACTAATTT---
24ACTAATTT YTCTWCT
ACTAATTT
AAAAAGAAGAAG
TTTTTTTTTTTG
AAGAAGGAACCA
TGGGTTTTGCTG
TTGTCGCCGGT
AAAANAAAA
ANNAANNNANNAANNANA
AAATTAGT
CTTCTTCT
TCTCTCTC
CATATATA
ACTAATTTCTCAAATTAGT
25GGTTAA GGCCMA
GGCHCA
AACCGACCAACA
ATGGGCC
CCCCATCTCCCC
GTGGCCCAACAT
ACGACGCCGCC
TGGGCCATATAT
TTTGTT
ATAAAT
AAAAAA
GGTCCA
CATATGCG
GGGCCC
TACGCA
GGGTCC
---
26CCWWWWWWGGTGGDYHDT
WWWWWTTSG
GGCCCAACAA
AACAAAAAAACAAAA
GCTCTCACCGCCAC
CGTGGGCCAT
TTGGGGTTTATCAT
AAANNAAANAAAA
TNTTGGGCCAT
CNNNAAAGNCTNAANNNNNNA
CGGCGANGNAGA
AAANTTTNNAAAA
GACGGCGATG
CATCGCCGCC
GACGGCGATG
CCGACGGTGG
GGGCCA
TGGGCCAT
TAGCCCGGAT
TAGCCCGGATGA
TGGGCC
WWGGGCC

27-29  

Only one sequence available ==> SiteSeeker could not be applied

30TGGCCGACTGGCCGACTGGCCGACANAAAAAAA
GTCGGCCA
TTTTTTTTTGGCCGACTCGGCC
31CCACGTGGCCACGTGGGCCACGTGGGGCANNANNAAAANAA
CCACGTGG
CCACGTGGCCACGTGGCCACGTGG
32AAMAATCT CCAGTT
RGWTTTT
CGAAGAAGAAG
AGAAAGAAGAGG
AAAAAAAAAAAT
CTGCCACGTGA
CCATCCTCCGGC
AAAAAAAA
ANGAAGANGA
AAAANNNAAAA
AGGCCCAA
GAGAGAGA
AAAAAAAAGACGCTCC
GCGGGG
GGGCC
33AAACAATCTA TAGATTGTTTAAACAATCTAAAAANAAAANAA
AAACAATCTA
TCTCTCTCTC
CCCCCCCCCC
TAGATTGTRAACAATY
34AAAAAAAATCTATGAAAAAAAAATCTATGACCACAAGTCCGCTTCTCCCT
CCCAAAAAAAATCTATGA
AANCTATGANNNNN-
NNNNNGNNACNNNA-
NANANNNANNNT
ACCAATTCCAACGGC
CCAATTCCAACGGCA
ACCAATATCAACGGC
GTCATGGTCATAGTC
GTTGTTATTGGCTAG
TGGCGAAACACC
GGCGAAACACCC
GCTATTGGCTAG
GGTCAGGGTCAT
AAAAATCTATGA
---
35ACACNNG TGKTGG
TTTGGT
TDTTGGT
WGTTGGH
GTTYKG*T
GTTGGTGGGGTT
TCTTCTTCTTCT
ATGGGCC
GGTCCAAAA
ACAAAAAAAAAA
AAANNAAAA
AANNNNNNANANANANA
ANAANAAGA
AAGAANAA
GAAAANANA
CTCTCTC
ATGGGCC
ACACATA
ACAGAGA
GTCGTG
ATGCCAAC
TACGCA
ATACGC
GGTCGTCC
---
36TACCGACATTACCGACATATGTCGGTAANAAAANAAAA
ATGTCGGTA
ATGTCGGTATGTCGGTATACCGACAT
37DRCCGACNW GTCGGGGCCCATT
TTTTTTTTTTCT
AACCGACCAACC
ANNAAAANAAAA
AATGGGCCA
CGNGNANNNCANGTG
GNCNANNNNCGNNGNGGNNT
GNCGTGGGNTC
ATTGA
ATTTA
TGAAT
AATTA
TTAAG
ACCGACCA(AG)GCCCA
38TACCGACATATGTCGGTAATGTCGGTAAAAACAAAA
ANNNANAAANANANANA
GNNGNNGNNGNNGNNGNTG-NNG
ATATATNTANA
AAGTTTTGG
ATGTCGGTATGTCGGTAATGTCGGTA
39TTTCCCGCTTTCCCGCTTTCCCGCGGNNNNGGNNNNGTNNNNGG
ANAANAANANAA
AAAANAAANA
GCGGGAAA
GGTGGTGG
GGTGGAGG
GGAGGAGG
GGCGGAGG
ATTATTAT
TTTCCCGCTCCCGC
40TCTCCCGCCTCTCCCGCCGGCGGGAGACANNNAAANAAAANNNNA
GGCGGGAGA
GGCGGGAGA
TCTCCCGCC
CTCCCGCC---
The boxed words are the patterns found by the particular tools.



Click on the thumbnail to enlarge the picture
Figure 1: Comparing six motif discovery tools on 36 At promoter sets.

We also experimented with different Hamming penalties. The 0-1 Hamming scoring allows easily finding longest exactly matching words, particularly if the word length is >6. In all these cases SiteSeeker was successful whereas tools working with position weight matrices often had difficulties finding the right motif. An already mentioned fact is that the known At motifs have a high rate of A-T mismatches. This could be taken care by choosing a mismatch penalty smaller than 1. A mismatch penalty of 0.7 proved to work well and, for example, exactly identified the - generally difficult to identify - motif CCWWWWWWGG in set 26.



Summary

We presented two programs, CHECKPROMOTER for identifying putatively co-regulated genes, and CHECKMOTIF for discovering motifs in a set of promoter sequences. The programs are designed to work cooperatively in the way that after a subset of promoters has been identified by CHECKPROMOTER as putatively co-regulatory, CHECKMOTIF would elucidate detailed structures in this subset. In both programs the weighted Hamming distance is used for evaluating pairs of words, and the sum of Hamming distances of word pairs is applied for blocks of N ≥ 2 words. On sets of promoters from Arabidopsis thaliana we were able to verify that SiteSeeker performs significantly better than some best motif discovery tools.



References


  1. GuhaThakurta, D. (2006). Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res. 34, 3585-3598.

  2. Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Statist. 22, 79-86.

  3. Hertz, G. and Stormo, G. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 5, 563-577.

  4. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. and Wootton, J. C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208-214.

  5. Thompson, W. A., Newberg, L. A., Conlan, S., McCue, L. A. and Lawrence, C. E. (2007). The Gibbs Centroid Sampler. Nucleic Acids Res. 35, W232-W237. Accessible at http://bayesweb.wadsworth.org/gibbs/gibbs.html

  6. Rigoutsos, I. and Floratos, A. (1998). Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14, 55-67.

  7. Liang, S., Samanta, M. and Biegel, B. (2004). cWINNOWER algorithm for finding fuzzy DNA motifs. J. Bioinform. Comput. Biol. 2, 47-60.

  8. Pavesi, G., Mereghetti, P., Mauri, G. and Pesole, G. (2004). Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes, Nucleic Acids Res. 32, W199-W203.

  9. Zaslavsky, E. and Singh, M. (2006). A combinatorial optimization approach for diverse motif finding applications. Algorithms Mol. Biol. 1, 13.

  10. Lanctot, J. K., Li, M., Ma, B., Wang, S. and Zhang, L. (2003). Distinguishing string selection problems. Information and Computation 185, 41-55.

  11. Bailey, T. L. and Gribskov, M. (1998). Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14, 48-54.

  12. Hughes, J. D., Estep, P. W., Pawazoie, S. and Church, G. M. (2000). Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205-1214.

  13. Liu, X. S., Brutlag, D. L. and Liu, J. S. (2002). An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat. Biotechnol. 20, 835-859. Accessible at http://ai.stanford.edu/~xsliu/ MDscan/

  14. Davuluri, R. V., Sun, H., Palaniswamy, S. K., Matthews, N., Molina, C., Kurtz, M. and Grotewold, E. (2003). AGRIS: Arabidopsis Gene Regulatory Information Server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinformatics 4, 25.

  15. Ecker, K. H. and Welch, L. (2008). Motif finding with weighted Hamming distance minimization. Technical Report, Russ College of Engineering and Technology, Ohio University.