In Silico Biology 4, 0039 (2004); ©2004, Bioinformation Systems e.V.  


A DNA motif lexicon: cataloguing and annotating sequences

Betsey D. Dyer1, Mark D. LeBlanc2*, Stephen Benz2, Peter Cahalan2, Brian Donorfio2, Patrick Sagui2, Adam Villa2 and Gregory Williams2




1 Department of Biology, Wheaton College, Norton, MA 02766 USA
2 Department of Mathematics & Computer Science, Wheaton College,Norton, MA 02766




*  Corresponding author
    Email: mleblanc@wheatoncollege.edu





Edited by E. Wingender; received March 30, 2004; accepted August 13, 2004; published August 20, 2004



Abstract

The rapid proliferation of genomic DNA sequences has created a significant need for software that can both focus on relatively small areas (such as within genes or promoters) and provide wide-zoom views of patterns across entire genomes. We present our DNA Motif Lexicon that enables users to perform genome-wide searches for motifs of interest and create customizable results pages, where results differ in the degree and extent of annotation. Searching for a particular motif is akin to a word search in a natural language; our motif lexicon speaks to this new time when we will increasingly rely upon DNA dictionaries that offer rich types of annotation. Indeed, the concept of "lexomics", introduced in this paper may be appropriate to the types of meta-analyses relevant to the deciphering of regulatory information. Currently supporting five genomes, our web-based lexicon allows users to look up motifs of interest and build user-defined result pages to include the following: (1) all base pair locations where a motif is found with links to further search the "neighborhoods" near each of these locations; whether each location of the motif is genic (within) a gene, intergenic, or a bridging sequence (overlapping a gene boundary) (2) NCBI hot-links to nearest upstream and downstream genes for each location (3) statistical information about the query (4) whether the motif is a certain type of repeat (5) links for the reverse, complement and reverse-complement of the motif of interest and (6) hot-links to PubMed abstracts which mention the motif of interest. A software framework facilitates the continual development of new annotation modules. The tool is located at: http://genomics.wheatoncollege.edu/cgi-bin/lexicon.exe.

Key words: genomes, genomic DNA, lexicon, motifs, lexomics, regulatory sequences



Introduction

When Charles Darwin set out on the HMS Beagle as ship's naturalist, he had been charged with the collection and cataloguing of all of the organisms that he could find during the lengthy voyage along the coasts of Central and South America. He did not set forth to collect evidence for natural selection and he was relatively unencumbered by hypotheses. The theory of natural selection emerged only years later, after the diverse organisms and observations had been properly classified and annotated. It was only then that underlying themes and trends suggesting natural selection revealed themselves. Collecting, cataloguing, and annotation have somewhat fallen from favor as research goals in and of themselves. They seem to fall more into the purview of "naturalists" rather than "scientists". However, the emerging field of genomics seems to be characterized, in part, by exactly that sort of free exploration and collection of information. The billions of base pairs of DNA sequences conveniently stored at GenBank at the National Center for Biotechnology Information (NCBI) represent uncharted territories. Gene identification is still an uncertain process, especially in navigating through the possibilities of alternative splicings and pseudogenes. However, the intergenic regions (the sequences between the genes) are truly mysterious terrains, loaded with information for gene regulation and as yet mostly undeciphered and unexplained.

The need for databases, catalogues or dictionaries of DNA motifs has been repeatedly acknowledged and many versions have been implemented. Leading the way were [Trifonov and Brendel, 1986], who built a book-format DNA motif dictionary, Gnomic, by laboriously searching 400 publications for mentions of sequences. More recent works online include searchable databases such as RegScan [Ponomarenko et al., 1999], the Eukaryotic Promoter Database [Praz et al., 2002], and TRANSFAC [Wingender et al., 2000]. Two of many examples of targeted databases for certain categories of motifs are ones for zinc finger binding [Bulyk et al., 2001] and for inverted repeats [LeBlanc et al., 2000]. Motifs of particular organisms are the focus of others such as Wormbase [Markstein et al., 2002] and DBTBS (for B. subtilis) [Ishii et al., 2001]. There have been many algorithmic approaches, including linguistic approaches, to the building of motif databases [Bussemaker et al., 2000; Liu et al., 2001], some of which may result in lexicon-like databases.

The approach we have taken toward the unraveling of information in the intergenic regions is one of collection and annotation, and, because DNA sequences (or motifs) can be thought of metaphorically as "words", the format of the results is presented as a DNA Motif Lexicon. The power of computers allows this to be done on grand scale, in that it is possible to locate (as well as to store and retrieve) all possible motifs of all possible sizes in a given genome. The annotation of those motifs can then be done in modular increments, also on a global scale. For example one might, with an inverted repeat module, locate all motifs in a genome that fit the particular criteria. The field of linguistics has guided some of our initial choices and designs for annotations. Indeed, the Oxford English Dictionary (OED) has influenced the format of our presentation. Annotations include "etymologies" in which PubMed abstracts are searched for all mentions of a particular motif and its alternative "spellings" (similar, possibly related, sequences) are presented in an OED-style sidebar for quick searching. The fact that all possible motifs can be located by starting base pair is roughly analogous to the OED's presentation of several literary quotations. However, the location module is closer to being a "concordance" in that it essentially shows all possible "quotations" of a particular motif in a genome.

The deciphering of regulatory information will require extensive meta-analyses facilitated, in part, by tools that provide annotation. We are referring to this concept as "lexomics", defined by us as:

The study of the texts of genomes with the goal of integrating all of the information coded within (the genes, the regulatory sequences, the topology, the dynamic changes in relationships) analogous to a literate reading of genomes with comprehension and appreciation for the complex, combinatorical, synergies required to build whole organisms.

Our motif lexicon enables users to perform genome-wide searches on five organisms: Caenorhabditis elegans, Saccharomyces cerevisiae, Escherichia coli, Pyrococcus furiosus, and Treponema pallidum. Motifs of interest are presented on customizable result pages, where results differ in the degree and extent of annotation desired. Given a motif of any length and a chromosome or base pair (bp) range of interest, result pages for the motif lexicon can include the following types of annotation:

  1. All base pair locations where the motif is found (via the web these are displayed ten at a time); the user has the option of downloading all the results in a comma-separated-value (.csv) file.
  2. For each location, whether this location of the motif is genic (within) a gene, intergenic, or a bridging sequence (overlapping a gene boundary). Options exist to limit a search to one or all of the areas (e. g., a user can easily limit a search of the motif to genic-only regions).
  3. For each location, hot-links to NCBI's detailed pages for the nearest upstream and downstream genes.
  4. Statistical information about the query and results including the total number of occurrences of the motif that were found vs. an expected number of occurrences of the motif if one were to search in a region of random sequence with similar A-C-G-T proportions.
  5. If the query motif is a repeat, the type of repeating sequence pattern, including Direct Repeats (DR, second half of the sequence is an exact repeat of the first half), Mirror Repeats (MR, the second half of the sequence is a mirror image of the first half), Inverted Repeats (IR, "palindromic", the second half of the sequence is the reverse-complement of the first half), Virtual Repeat (defined by us to be [Direct and Mirror, DM] or [Direct and Inverted, DI]) or none of these, defined by us as an Un-Repeat (UR).
  6. The reverse, complement, and reverse-complement of the motif of interest. All are fashioned as hot-links for immediate access to like-queries, e. g., clicking on the reverse motif will simulate a search for the original motif on the complementary strand.
  7. Hot-links for immediate access to alternative "spellings" (similar, possibly related, sequences).
  8. An option to obtain hot-links to each of the National Library of Medicine's PubMed abstracts that mention the motif of interest.
  9. An opportunity to zoom-in and search each motif location "neighborhood" of interest.



Materials and methods


Providing fast responses to queries for all the locations of a motif across an entire chromosome requires preprocess searching, storage of results in a server-side database, and a robust and efficient algorithm to return results quickly. In particular, to facilitate quick retrieval of all locations of any entered motif, we preprocess each chromosome by finding and saving the locations of all 4-, 5-, 6-, and 7-mers. Queries are then handled by partitioning the original query into concatenated segments of 7- through 4-mers. For example, a 9-mer query of ACGTACCGT is partitioned into a 5-mer (ACGTA) followed by a 4-mer (CCGT). A 17-mer is partitioned into three sections, a 7-mer, a 6-mer, and a 4-mer.


Preprocessing

The program "eLmer Jr." is a command-line application written in C++ that accepts FASTA format DNA (.fna) files as input and produces directories of files that contain the locations of all 4-, 5-, 6-, and 7-mers. DNA files for each of the five organisms were downloaded from NCBI (May 2002). For each organism, one chromosome at a time, eLmer Jr. opens the .fna file and records the location of each 4- through 7-mer in a file of that same name, for example, every location of the 4-mer AAAA in Yeast Chromosome I is stored in the file Yeast_I_AAAA.dat. For (4 < L < 7), this results in 4L files for each length, e.g., 45 or 1024 files to hold the 5-mers.


Database

At present, the output files from eLmer Jr. that contain the locations of every 4-through 7-mer for each chromosome of each organism are stored in a flat-file database on the server hosting the motif lexicon. This simple storage scheme provides fast access to the locations for the Common Gateway Interface (CGI) script that builds the customizable motif lexicon result pages.


Finding any motif

Our CGI program "eLmer" is a server-side script written in C++ that partitions a user query into segments of 7- through 4-mers, locates appropriate files in the database representing these segments, and concatenates ending and starting locations of all the segments to find all occurrences of the motif in question. For example, from a web browser, a user enters a motif of interest and selects an organism and chromosome, e. g., GTGACTCAC, C. elegans, and chromosome III.

The 9-mer, GTGACTCAC, is partitioned by eLmer into two 5-mer and 4-mer segments and opens the files in the database containing all locations of GTGAC and TCAC in chromosome III of C. elegans (GTGAC.dat and TCAC.dat, respectively). Each ending location of the 5-mer GTGAC (starting location + 5) is checked against the starting locations of the 4-mer TCAC. As shown in Figure 1, eLmer would find an instance of the 9-mer GTGACTCAC starting at bp 2,476,629 since an instance of the 5-mer GTGAC is found at bp 2,476,629 and an instance of the following 4-mer TCAC begins at bp 2,476,634. For motifs of greater lengths, this end-bp to start-bp matching continues for as many partitions as needed. In addition to finding all locations, eLmer caches all the results of this query on the server, so future queries of this motif on this chromosome of this organism will be immediately available.



Figure 1: Two files holding the results of preprocessing from eLmer Jr. showing known locations (bp) of every location of the 5-mer GTGAC and every location of the 4-mer TCAC in Chromosome III of C. elegans. The lexicon handles any user query by concatenating the locations of 4- through 7-mers. A 9-mer GTGACTCAC is known to start at bp 2,476,629 since the 4-mer TCAC is located immediately following the 5-mer GTGAC (2,476,629 + 5-mer = 2,476,634).



Efficient designs: minimising variance and maximising degrees of freedom

The DNA Motif Lexicon is a software framework written in C++ and Perl for the creation of customizable lexicon pages. The framework is a collection of extendable C++ classes that facilitates the modular addition of unique sources of annotation in a lexicon result page. Situated within HTML wrapper classes that hide the details of a lexicon result page, the framework allows developers to insert new sources of annotation in a module-by-module fashion (e.g., links to upstream and downstream genes, labels indicating if motifs are found in genic vs. intergenic regions of the chromosome, statistical information). New modules inherit common functionality from base classes in the framework and when requested by the user, modules participate in the creation of a lexicon result page.



Results and discussion

Assuming a server-side database of locations for all 4-mer through 7-mers of every chromosome in an organism, a web-based motif lexicon allows users to build customizable pages of results for any motif of interest. Our desire for personal choices in the types and extent of annotations was influenced by the online version of the Oxford English Dictionary. Like the OED Online, our motif lexicon does not presume a fixed set of annotations to display for a particular motif. Rather, our "DNA Dictionary" allows users to add or subtract types of annotation, where each type of annotation is implemented as a unique module.


Application

Figure 2 shows a result page for a query for the 10-mer motif (1) GTGACGTCAC on Chromosome I of C. elegans where the user requested (2) the bp locations of each (3) intergenic location, and NCBI links to the nearest (4) upstream and (5) downstream gene information. In addition, this customizable result page indicates (6) that this motif is an inverted repeat (IR) with (7) corresponding statistical information. Like-misspellings (8) of one substitution and alternate forms with one deletion or one insertion are provided, in a fashion resembling an online dictionary's listing of words that are alphabetically similar to a given word. The Etymology Module (9) produces a list of hot-links to all PubMed abstracts that mention this particular motif. A Neighborhood Module (10) allows a user to zoom into a particular region for further searches on a given location.



Figure 2: A sample result page when searching C.elegans Chromosome I for the motif GTGACGTCAC. (1) The requested motif GTGACGTCAC on Chromosome I of C. elegans where the user requested (2) the bp locations of each (3) intergenic (only) location, and NCBI links to the nearest (4) upstream and (5) downstream gene information. This motif is (6) an inverted repeat (IR) and (7) corresponding statistical information is an option. (8) Hot-links for like-misspellings of one substitution (e. g., the substitution that would make this motif a direct repeat, GTGACGTGAC) and alternate forms with one deletion or one insertion are shown as are (9) the option to request a list of hot-links to all PubMed abstracts that mention this particular motif and (10) opportunities to perform a more focused search of a particular neighborhood near a specific motif location.


Table 1 shows the results from the Etymology Module of the hot-links to all six abstracts in PubMed that resulted from the query GTGACGTCAC.


Table 1: A sample Etymology result page showing hot-links to each PubMed abstract that mentions the motif GTGACGTCAC.
Abstracts for GTGACGTCAC
There are 6 PubMed abstracts with this motif.
Author Title Citation
Wang Z., Deak M., Free S. A cis-acting region required for the regulated expression of grg-1, a Neurospora glucose-repressible gene. Two regulatory sites (CRE and NRS) are required to repress grg-1 expression. J Mol Biol. 1994 Mar 18; 237(1):65-74
Kitagawa Y., Shima H., Sasaki K., Nagao M. Identification of the promoter region of the rat protein phosphatase 2A alpha gene. Biochim Biophys Acta. 1991 Jul 23; 1089(3):339-44
Widen S., Wilson S. Mammalian beta-polymerase promoter: large-scale purification and properties of ATF/CREB palindrome binding protein from bovine testes. Biochemistry. 1991 Jun 25; 30(25):6296-305
Kedar P., Widen S., Englander E., Fornace A., Wilson S. The ATF/CREB transcription factor-binding site in the polymerase beta promoter mediates the positive effect of N-methyl-N'-nitro-N-nitrosoguanidine on transcription. Proc Natl Acad Sci U S A. 1991 May 1; 88(9):3729-33
Kedar P., Lowy D., Widen S., Wilson S. Transfected human beta-polymerase promoter contains a ras-responsive element. Mol Cell Biol. 1990 Jul;10(7):3852-6
Widen S., Kedar P., Wilson S. Human beta-polymerase gene. Structure of the 5'-flanking region and active promoter. J Biol Chem. 1988 Nov 15; 263(32):16992-8


More specifically, information revealed by queries for GTGACGTCAC in intergenic regions of Chromosome I of C. elegans includes:

  1. that the motif, an inverted repeat, is expected to be found 1 or fewer times in a random dataset with the same base pair composition of Chromosome I. However, this search reveals 100 intergenic locations for GTGACGTCAC.
  2. that the nearest downstream genes to the motif are mostly of unknown identity (as per links to NCBI) however in some cases, genes for C2H2-type zinc finger proteins are downstream. In one case, the promoter region of the gene for NCBI protein ID#2088759 has four copies of the motif about 500-1500 bp upstream, three copies of which are spaced about 100 bps apart.
  3. that GTGACGTCAC is mentioned verbatim in six PubMed abstracts. GTGACGTCAC has been identified as being a "CRE" (cyclic AMP-responsive element), found upstream of certain housekeeping genes such as the gene for beta-polymerase.
  4. that searches of the complement of the motif reveal only 5 incidences in intergenic regions of Chromosome I of C. elegans, all upstream of genes of unknown function.

Neighborhoods allow a user to browse specific areas or "neighborhoods" of DNA sequence. The Neighborhood tool is both linked from the Motif Lexicon and can be used (de novo) to browse areas of interest.

Short (10 bp or less) highly conserved motifs are the focus of many computational analyses of promoter regions. Such motifs are known to bind transcription factors but it is often acknowledged that there are many more to be found and annotated and that computational methods will facilitate the search. For example, Mariño-Ramírez et al., 2004, used computational methods to find overrepresented 8-mer motifs in promoter regions of the human genome. Then they used the TRANSFAC database of known transcription factors and binding sites [Wingender et al., 2000] to begin to annotate and decipher the meanings or putative meanings for their 8-mers. In a similar study, Cliften et al., 2003, searched the promoters of six species of Saccharomyces and found over 75 motifs in the upstream regions of sets of genes with similar functions. These motifs were of length eight or less and were considered to be likely candidates for further study to determine whether they function as transcription factor binding sites. Quick, flexible annotation of short motifs, in a searchable format like the DNA Motif Lexicon, is an important extension or supplement to studies such as these. To demonstrate, we used the advanced search mode of the Motif Lexicion to pursue one of the 8-mer motifs considered by Cliften et al., 2003, to be of interest: The 8-mer, CTAAACGA, was found by Cliften et al. upstream of a significant number of genes involved with lipid, fatty acid or isoprenoid synthesis. According to the DNA Motif Lexicon, this 8-mer appears 45 times in the intergenic regions of Saccharomyces. There are, therefore, at least 90 up- or downstream genes for each of these intergenic regions. According to the links from the DNA Motif Lexicon to NCBI , ten out of a possible 90 up- or downstream genes have descriptions suggesting a role in lipid, fatty acid, or isoprenoid synthesis. These include a lipase, a squalene synthetase, and a gene product involved with isoprenoid and sterol biosynthesis. Thus the function suggested by Cliften et al. is supported. However, the Motif Lexicon further revealed that three genes up-or down-stream of intergenic motif CTAAACGA code for ribosomal proteins (two for 60S and one for 40S) and three genes code for ABC (ATP-binding cassette) transporters. Note that there are sixteen genes for ABC transporters described for Saccharomyces cerevisiae [Rogers et al., 2001]. Are these appearances of CTAAACGA in association with ABC transporters and ribosome genes also significant? Might some ABC transporters be involved in lipid synthesis? The extent of the importance of context either with other motifs or with other genes is still relatively unknown for any small promoter motif. It is also unknown, whether over-representation of a motif in a promoter or in a selected set of promoters is the key. Subtle regulation may require relatively rare motifs used in special combinations. Therefore, under-representation may be intriguing. For example, the DNA Motif Lexicon found none of CTAAAACGA in the intergenic regions of S. cerevisiae chromosomes 11 and 13. At this point, more annotation is better, even if it seems to complicate results. The study of putative transcription binding sites is still greatly benefited by collecting all possible descriptions rather than a search for simple conclusions.

Annotated dictionaries and concordances are everyday tools for the study of natural languages. Similarly, genomic and proteomic dictionaries are needed to facilitate exploratory work in intergenic regions as well as coding regions, and to include investigations of small motifs, e. g., those starting with 4-mers. Our DNA Motif Lexicon provides annotation for users with chromosome-wide queries of particular motifs of interest such as base pair locations (context), uniqueness (statistics), and links to the published literature that mention the motif (etymology). A software framework suggests a mechanism for developers of DNA dictionaries and concordances whereby the user controls the types of annotation for a particular result. New modules "snap into place" once implemented. Our most recent annotations include access to a motif's "neighborhood" or "context" (analogous to knowing a street address (bp) and asking who lives on this street or knowing a name and asking who lives near this person). Future annotations include (i) more advanced statistical models to compute the expected number of occurrences of a motif [e. g. Sinha and Tompa, 2002], and (ii) a module with a more sophisticated scheme for finding like-spellings of the motif of interest.



Availability

The Motif Lexicon is fully available at http://genomics.wheatoncollege.edu/cgi-bin/lexicon.exe.



Acknowledgements

We thank our colleague Robert Obar of Critical Therapeutics, Inc. for helpful insight during design and development and for past and present members of the Wheaton Genomics Group. Partial support for this work was provided by the National Science Foundation’s Course, Curriculum, and Laboratory Improvement program (CCLI-EMD) under grant 0340761, Wheaton alumna Anne Neilson, and the Wheaton College Mars Fund.




References