| In Silico Biology 6, 0023 (2006); ©2006, Bioinformation Systems e.V. |
Institut für Genetik, Technische Universität Braunschweig
Spielmannstr. 7, D-38106 Braunschweig, Germany
1 present address: Software Systems Engineering, Technische Universität Braunschweig
Mühlenpfordtstr. 23, D-38106 Braunschweig, Germany
* Corresponding author
Email: R.Hehl@tu-braunschweig.de
phone: +49-531-391 5772; fax: +49-531-391 5765
Edited by E. Wingender; received January 10, 2006; revised March 23 & April 07, 2006; accepted April 09, 2006; published April 21, 2006
AthaMap generates a map for cis-regulatory sequences for the whole Arabidopsis thaliana genome. AthaMap was initially developed by matrix-based detection of putative transcription factor binding sites (TFBS) mostly determined from random binding site selection experiments. Now, also experimentally verified TFBS have been included for 48 different Arabidopsis thaliana transcription factors (TF). Based on these sequences, 89,416 very similar putative TFBS were determined within the genome of A. thaliana and annotated to AthaMap. Matrix- and single sequence-based binding sites can be included in colocalization analysis for the identification of combinatorial cis-regulatory elements. As an example, putative target genes of the WRKY18 transcription factor that is involved in plant-pathogen interaction were determined. New functions of AthaMap include descriptions for all annotated Arabidopsis thaliana genes and direct links to TAIR, TIGR and MIPS. Transcription factors used in the binding site determination are linked to TAIR and TRANSFAC® databases. AthaMap is freely available at http://www.athamap.de.
Keywords: Arabidopsis thaliana, database, gene expression, plant, pathogen, transcription factor
Positional information on transcription factor binding sites in whole genomes is useful to identify target genes of specific TFs. Furthermore, such information is helpful to generate models on the regulation of genes that are investigated. AthaMap generates a positional map for TFBS in the Arabidopsis thaliana genome [1]. It was developed with publicly available binding sites that were mostly identified by random binding site selection experiments. The sites of these random binding site selection experiments were used to generate alignment matrices which are employed by the program PATSER to identify genomic positions of TFBS within the genome of A. thaliana [1, 2]. The matrix-based searches were performed with transcription factors from many different plant species, based on the rationale that sequence recognition is not species-specific but similar for members of the same plant TF family. Positional information was imported into the AthaMap database and can be displayed online by entering either a specific chromosomal position or the commonly used gene model number (AGI) that can be found in the TAIR database [3]. The genomic sequence around the position entered and all putative TFBS identified in this region are displayed online. The last version of AthaMap contained more than 7.4 × 106 putative binding sites for 36 different transcription factors representing 16 different TF families [4]. Furthermore, more than 1.8 × 105 combinatorial cis-regulatory elements were annotated to the database [4].
A significant improvement of AthaMap constitutes a transcription factor binding site map for Arabidopsis thaliana that is also based on in vivo and experimentally verified binding sites in target genes. Towards these ends, AthaMap was now complemented with 89,416 TFBS based on publications describing experimentally determined sites for 48 Arabidopsis thaliana TFs that comprise 13 TF families. Furthermore, all annotated genes in AthaMap have been linked to TAIR, TIGR and MIPS [3 , 5] since these databases constitute the most important information resources for A. thaliana genes. In addition, for the study of plant-pathogen interactions and to identify target genes regulated by plant pathogens, links have also been established from the PathoPlant® database to AthaMap [6]. All transcription factors used in the binding site determination are linked to TAIR and TRANSFAC® databases [3, 7, 8].
For the annotation of TFBS, publications on transcription factor binding studies with A. thaliana factors with at least one single experimentally verified binding site were screened and sequences were extracted. In those cases where the binding site directly corresponds to an A. thaliana sequence, these published sequences were used to identify the sequence in the genome. The length of the employed screening sequence permitted only the detection of the single binding site within the target gene. To identify additional putative sites, all binding sites were shortened around the core sequence of the TFBS to yield sequences for genomic screenings.
It is highly likely that shorter sequences identify additional binding sites because in many experimental setups short oligonucleotides will bind the respective TF at least in vitro. For example, DREB1A target sites have been identified by comparing regulatory regions in genes upregulated in A. thaliana overexpressing DREB1A [9]. A conserved cis-acting sequence was identified and experimentally verified in vitro as a binding site for DREB1A in the rd29A gene promoter. An 8 bp long double-stranded oligonucleotide (ACCGACAT) was used for competition experiments showing that this oligonucleotide can compete for binding in an electrophoretic mobility shift assay [9]. Therefore, this shorter sequence is a putative binding site at all genomic positions matching this sequence. To identify these positions, a screening sequence was employed that covers the region of the experimentally determined binding site together with two more nucleotides from the rd29A promoter at either side of the core sequence (CTACCGACAT, Tab. 1). With this screening sequence, 70 additional genomic positions were identified.
This low number of predicted binding sites in the above example demonstrates the high specificity when employing a screening sequence with a length of 10 bp. A 10 bp screening sequence with a 50% GC content theoretically detects only 151.2 sites in the A. thaliana genome. This binding sequence shortening was performed for all TFBS to identify additional putative binding sites. For those binding sequences that contain a 3 or 5 bp conserved core sequence, a 9 bp screening sequence was employed to maintain symmetry around the core sequence. A 9 bp screening sequence (55.6% GC-content) theoretically detects only 472.7 binding sites in the A. thaliana genome.
The high specificity of this screening method may not uncover all putative sites. However, using these parameters, sensitivity was still high enough to detect functional W-boxes of WRKY binding sites in many genes as demonstrated in the example given below.
To detect binding sites, a Perl script was written to perform pattern-based screenings of the Arabidopsis thaliana genome (TIGR release 5.0, January 21, 2004). Both strands of the annotated genome were screened resulting in records harboring absolute positional information and orientation. Tab. 1 shows a compilation of all A. thaliana TFs with experimentally verified binding sites that have been annotated to the AthaMap database. The sequences used in the pattern-based screening are indicated with the corresponding core sequences being underlined. The most current name and earlier synonyms for the factors are displayed. All factors were assigned to a specific TF family according to Riechmann et al. [10]. The number of sites detected in the Arabidopsis thaliana genome, the AGI number and the reference are listed. All positional information determined with the TFBS of these factors was imported into the AthaMap database. It is important to note that overlapping sites were not eliminated. All TFs that bind or putatively bind a site are shown on the AthaMap web site [4]. This is very important because TFs themselves are regulated and expression of two factors that recognize the same sequence can be spatially or temporally different. For example, DREB1A and DREB2A bind to the same target site but are either upregulated by low temperature (DREB1A) or by NaCl (DREB2A) [11]. This illustrates the importance to identify all TFs that can potentially bind to the same target site.
| Table 1: | Arabidopsis thaliana transcription factors and screening sequences, with the corresponding core sequences being underlined, used for binding site determination by pattern-based screenings and numbers of predicted sites annotated to the AthaMap database. |
| Family | Factor | Synonyms | AGI | Screening sequences | No. of sites | Reference |
| ABI3/VP1 | ABI3 | At3g24650 | GCATGCATTA CCATGCAAAT GCATGCATGG |
912 | [18] | |
| FUS3 | At3g26790 | CCATGCATGC GCATGCATTA CCATGCAAAT GCATGCATGG |
1,163 | [18] | ||
| AP2/EREBP | AtERF-1 | At4g17500 | GAGCCGCCA TAGCCGCCA |
649 | [19] | |
| AtERF-2 | At5g47220 | GAGCCGCCA TAGCCGCCA |
649 | [19] | ||
| AtERF-3 | At1g50640 | GAGCCGCCA GTGCCGCCA GAGCTGCCA GAGCCGTCA TAGCCGCCA |
1,809 | [19] | ||
| AtERF-4 | At3g15210 | GAGCCGCCA GTGCCGCCA GAGCTGCCA GAGCCGTCA GAGCCGCTA TAGCCGCCA |
1,983 | [19] | ||
| AtERF-5 | At5g47230 | GAGCCGCCA TAGCCGCCA |
649 | [19] | ||
| DREB1A | CBF3 | At4g25480 | CTACCGACAT AAGCCGACAC TGGCCGACCT |
213 | [9, 20, 21] | |
| DREB1B | CBF1 | At4g25490 | TGGCCGACCT CTACCGACAT |
150 | [21, 22] | |
| DREB1C | CBF2 | At4g25470 | TGGCCGACCT CTACCGACAT |
150 | [21] | |
| DREB2A | At5g05410 | CTACCGACAT AAGCCGACAC |
134 | [20] | ||
| bZIP | ABI5 | GIA1, EEL, DPBF1 | At2g36270 | CAACGTGTCA CCACGTAGCA GACACGTGGC TATACGTCAG |
686 | [23, 24] |
| AREB1 | ABF2 | At1g45249 | CATACGTGTC | 82 | [20] | |
| AREB2 | ABF4 | At3g19290 | CATACGTGTC | 82 | [20] | |
| bZIP12 | EEL, DPBF4 | At2g41070 | CAACGTGTCA CCACGTAGCA |
181 | [23] | |
| HY5 | TED5 | At5g11260 | TCCACGTGGC GACACGTGGC CCCACGTGTC |
820 | [25] | |
| C2C2(Zn) GATA | GATA-1 | At3g24050 | GTGGATTGA GTGGATTCA ATAGATAAA AGAGATCTA TATGATAAGG ATGGATCGCG CTCGATTTCA GTGGATTTCA TATTATCGTC GGGTATCGAA |
9,894 | [26] | |
| GATA-2 | At2g45050 | GTGGATTGA GTGGATTCA AGAGATCTA TATGATAAGG |
4,290 | [26] | ||
| GATA-3 | At4g34680 | GTGGATTGA GTGGATTCA AGAGATCTA TATGATAAGG |
4,290 | [26] | ||
| GATA-4 | At3g60530 | GTGGATTGA GTGGATTCA AGAGATCTA TATGATAAGG |
4,290 | [26] | ||
| C2H2(Zn) | SUP | FLO10, FON1 | At3g23130 | GACAGTGTC | 501 | [27] |
| E2F/DP | E2Fa | E2F3 | At2g36010 | TTTTCCCGCG AGCGGGAAAA ATTCCCGCCAAT |
396 | [28, 29] |
| E2Fb | E2F1 | At5g22220 | ATTTCCCGCT ATTTCCCGCC TTTTCCCGCG ATTCCCGCCAAT |
605 | [28-30] | |
| E2Fc | E2F2 | At1g47870 | CGCGCCAAA CCCGCCAAA TTTTCCCGCG AGCGGGAAAA ATTCCCGCCAAT |
2,752 | [28, 29, 31] | |
| E2Fd | E2L1, DEL2 | At5g14960 | CGCGCCAAA CCCGCCAAA TTTTCCCGCG AGCGGGAAAA ATTCCCGCCAAT |
2,754 | [28, 29, 31] | |
| E2Fe | E2L3, DEL1 | At3g48160 | TTTTCCCGCG AGCGGGAAAA ATTCCCGCCAAT |
396 | [28, 29] | |
| E2Ff | E2L2, DEL3 | At3g01330 | TTTTCCCGCG | 114 | [28] | |
| GARP/ARR-B | ARR1 | At3g16857 | TANGATTGT TAGGATYGT |
8,752 | [32] | |
| ARR2 | At4g16110 | TANGATTGT TAGGATYGT TTTGATTGT |
13,767 | [32, 33] | ||
| HD-Zip | ATML1 | At4g21750 | GTAAATGCAC | 130 | [34] | |
| PDF2 | At4g04890 | GTAAATGCAC | 130 | [35] | ||
| MYB | AtMYB44 | AtMYBR1 | At5g67300 | TCAGTTAGGG AGTTAGTTAC |
485 | [36] |
| MYB1 | At3g09230 | CCTAACTGA TCTAACTGC |
962 | [37] | ||
| MYB2 | At2g47190 | GAAAACCAA AGCAACGCC CCTAACTGA TCTAACTGC |
5,400 | [36, 37, 38] | ||
| NAC | ANAC019 | At1g52890 | TAACACGCAT | 104 | [39] | |
| ANAC055 | NAC3 | At3g15500 | TAACACGCAT | 104 | [39] | |
| ANAC072 | RD26 | At4g27410 | TAACACGCAT | 104 | [39] | |
| NAM | At1g52880 | AAGGGATGA | 982 | [40] | ||
| SBP | SPL1 | At2g47070 | CCGTACAAT | 382 | [41] | |
| SPL3 | At2g33810 | CCGTACAAT TCGTACAAC |
772 | [41, 42] | ||
| SPL4 | At1g53160 | CCGTACAAC CCGTACAAT |
717 | [41, 43] | ||
| SPL5 | At3g15270 | CCGTACAAT | 382 | [41] | ||
| SPL7 | At5g18830 | CCGTACAAC | 335 | [43] | ||
| Trihelix | GT-1 | At1g13450 | TGGTTAATA AGGTAAATC AATGATATAG |
3,702 | [44] | |
| GT-2 | At1g76890 | CGGTAATTA | 513 | [45] | ||
| GT-3b | At2g38250 | AAGAAAAATA | 4,914 | [46] | ||
| WRKY(Zn) | WRKY18 | At4g31800 | TTTTGACAG CATTGACGA CCTTGACTT TTGACTTGAC TTGACNNTTGAC |
5,063 | [12, 16, 47] | |
| WRKY6 | At1g62300 | GTTGACTAT | 1,122 | [48] | ||
| Total: | 89,416 |
Fig. 1 displays a web interface screen shot showing binding sites of one of the new AthaMap database entries, WRKY18, within the sequence window in the region of the NPR1 gene. AthaMap identifies three WRKY18 binding sites that had previously been determined experimentally [12]. The transcribed region is underlined. The gene is encoded on the bottom strand. As a new feature, a short description of the gene shown in the sequence window together with links for additional information leading to the corresponding records in the external databases TIGR, TAIR, and MIPS are provided below the sequence window (Fig. 1).
|
Figure 1: Screen shot of the AthaMap web interface showing a specific search result together with selected links. The screen shot was generated by entering the AGI of the NPR1 gene (At1g64280) in the search field and by entering 50 to restrict the display to highly conserved binding sites [4]. The screen shot includes a tool tip box displaying the position of binding sites and a pop-up window for a transcription factor database entry (WRKY18). |
Exact positional information of the individual binding site is shown in a tool tip box that opens by moving the mouse over the arrow heads which indicate the orientation of the sites (Fig. 1). General information on the transcription factor is provided in a separate pop-up window that opens by clicking on the factor's name. In this window (Fig. 1), the factor family, binding and screening sequences, and references are displayed. For further information, external links (AGI, TRANSFAC ID) to the corresponding records in the TAIR and TRANSFAC databases are provided [3, 8].
In addition to the newly annotated binding sites determined by pattern search, 638,144 predicted matrix-based transcription factor binding sites for 4 new transcription factors representing the NAC, MYB, GARP/ARR-B, and AP2/EREBP families were determined and have been imported into AthaMap. Matrix-based searches were performed as described earlier [1]. Tab. 2 lists the factors, the factor families and the references from which the sequences were extracted.
| Table 2: | New transcription factor binding sites predicted by matrix-based screenings annotated to the AthaMap database. |
| Factor | Family | Species | No. of sites | Reference for alignment matrix |
| TaNAC69 | NAC | Triticum aestivum | 114 | [49] |
| TaMYB80 | MYB | T. aestivum | 19,023 | [49] |
| ARR10a | GARP/ARR-B | A. thaliana | 153,308 | [50] |
| NtERF2 | AP2/EREBP | Nicotiana tabacum | 465,699 | [51] |
| Total: | 638,144 |
| a AGI: At4g31920 |
The new data presented here increases the number of transcription factors in the database from previously 36 to 88. These belong to 21 different families and detect more than 8 × 106 TFBS in the Arabidopsis thaliana genome.
The screen shot in Fig. 1 shows in vivo binding sites of WRKY18, a member of the WRKY transcription factor family. Several plant WRKY transcription factor genes are known to be induced upon pathogen infection, elicitors, or by treatment with salicylic acid (SA) [13-16]. WRKY18 from A. thaliana is involved in the induction of defense-related genes like NPR1 [12]. NPR1 is a key regulator of SA-dependent systemic PR-protein induction and is regulated by binding of WRKY18 to multiple W-boxes present in the NPR1 gene (At1g64280) (Fig. 1). Therefore, a colocalization analysis of WRKY18 binding sites harboring W-boxes was performed in AthaMap. This analysis results in 61 colocalizations of at least two WRKY18 binding sites with a maximum distance of 50 bp (data not shown). The colocalizations are in the vicinity of 51 individual genes. In 30 of these genes, colocalizations are present in the upstream region of the translation start. Many of these genes are directly involved in plant defence responses and/or signal transduction and gene regulation. Tab. 3 shows a list of these genes with WRKY18 binding site colocalizations upstream of the translation start. Four genes contained more than two colocalizing WRKY18 binding sites in their upstream regions, i.e. NPR1 (At1g64280), RLK4 (At4g23180), an undefined expressed protein (At3g24065), and the WRKY18 gene itself (At4g31800). The RLK4 gene had previously been shown to be induced by bacterial pathogens and SA treatment and to be regulated by WRKY18 [17]. This example demonstrates the use of the AthaMap database resource as a tool to predict putative target genes of specific TFs in A. thaliana.
| Table 3: | Putative target genes of WRKY18 determined by colocalization analysis of WRKY18 binding sites. |
| AGI | Function | No. of colocalizing WRKY18 binding sites |
| At1g07530 | scarecrow-like transcription factor 14 (SCL14) | 2 |
| At1g29720 | protein kinase family protein | 2 |
| At1g43150 | non-LTR retrotransposon family | 2 |
| At1g52680 | late embryogenesis abundant protein-related / LEA protein-related | 2 |
| At1g63740 | disease resistance protein (TIR-NBS-LRR class) | 2 |
| At1g63750 | disease resistance protein (TIR-NBS-LRR class) | 2 |
| At1g64280 | regulatory protein (NPR1), nonexpresser of PR genes 1 | 3 |
| At1g64440 | UDP-glucose 4-epimerase | 2 |
| At1g66910 | protein kinase, putative similar to receptor serine/threonine kinase PR5K | 2 |
| At1g68740 | EXS family protein / ERD1/XPR1/SYG1 family protein | 2 |
| At1g76260 | transducin family protein / WD-40 repeat family protein contains 6 WD-40 repeats | 2 |
| At2g22490 | cyclin delta-2 (CYCD2) | 2 |
| At2g29010 | pseudogene, receptor protein kinase | 2 |
| At3g24065 | expressed protein ; expression supported by MPSS | 3 |
| At3g46280 | protein kinase-related | 2 |
| At3g50150 | expressed protein, plant protein of unknown function; expression supported by MPSS | 2 |
| At3g60630 | scarecrow transcription factor family protein scarecrow-like 6 | 2 |
| At4g06631 | pseudogene, hypothetical protein | 2 |
| At4g15520 | tRNA/rRNA methyltransferase (SpoU) family protein | 2 |
| At4g23000 | calcineurin-like phosphoesterase family protein | 2 |
| At4g23180 | receptor-like protein kinase 4 (RLK4) | 4 |
| At4g31800 | WRKY 18, WRKY family transcription factor | 3 |
| At4g34180 | cyclase family protein | 2 |
| At4g35310 | calcium-dependent protein kinase, putative / CDPK | 2 |
| At5g39480 | F-box family protein | 2 |
| At5g41140 | expressed protein | 2 |
| At5g45730 | DC1 domain-containing protein | 2 |
| At5g54230 | myb family transcription factor (MYB49) | 2 |
| At5g55040 | DNA-binding bromodomain-containing protein | 2 |
| At5g64360 | DNAJ heat shock N-terminal domain-containing protein | 2 |
The AthaMap resources are freely available for non-commercial users at http://www.athamap.de.
We would like to thank Gülsen Okunakul for help with the literature screening and data extraction. This work was supported by the German Ministry of Education and Research (BMBF grant no. 031U110C/031U210C) and was carried out in the Intergenomics Center at Braunschweig.