Suzuki et al., In Silico Biol. 4, 0036 (2004), Supplement



Supplement: Database description of DBTSS for comparative studies

Search database by simple queries

As shown in Supplementary Information Figure S1A, fields for querying appear on the top page of the DBTSS. In order to retrieve the TSSs/promoters of the gene of interest, users can use simple queries such as RefSeq IDs (NMs), Ensembl IDs (ENSTs), gene definitions and so on for either human or mouse genes. Alternatively users can submit a sequence for the BLAST search. For the human genes, users can search for the gene within a particular distance from the SNPs of interest, which should be a useful way for the identification of functional SNP candidates located in promoter regions (regulatory SNPs; rSNPs; for further information on this issue, see Brookes (1999) Gene 234, 177-186; Ponomarenko et al. (2002) Hum. Mutat. 20, 239-248).



Figure S1: Screen shots form the search (A-C) pages of DBTSS. Forms that are used for search by queries (A), by putative TF binding sites and their combinations (B), and from the gene list (C) are illustrated. As exemplified in (B), the fields to specify the TF binding sites are represented by red, yellow and blue boxes (Factor 1-3). Using these boxes, users can retrieve target promoters that include all of the Factors 1, 2 and 3. For each of the boxes, users can choose the search method between exact sequence match and matrix search using PWMs. As exemplified in the case of "Factor 3", users can also specify TF binding sites, either of which should be contained in the targets, by creating additional boxes. This search can be done for either human or mouse promoters individually or for the promoter elements conserved between human and mouse
(High resolution figure: 193 KB)


Clicking on the "search" will bring up two graphical views of the result as illustrated in Supplementary Information Figure S2D. The search results will bring up as a genomic view of the gene, separated in two panels. In the first panel, the overview of the genomic organization of the gene hit is displayed in terms of the exon-intron structure of the corresponding RefSeq, Ensembls and mapped full-length cDNAs. The annotated positions of the protein coding regions are also illustrated. In order to simplify the view, the items which should be displayed can be selected in the "viewer controller". In the second panel, the exact sequence around the TSSs is displayed. TF binding sites, if there are any, characterized by previous experiments and SNPs registered in public databases are displayed according to the information recorded in TRANSFAC Public (ver. 6.0) and dbSNP [Sherry et al. (2001) Nucleic Acids Res. 29, 308-311], respectively. Also, the promoter sequence of the arbitrary length from the arbitrarily designated standard point can be retrieved as a text in this viewer.

For the comparative analysis of the promoters, users can enter the "comparative view of the promoters" page from either human or mouse promoter viewers described above. Whenever information on a mouse/human counterpart is available, the "Go mouse/human counterpart" button appears in the upper left corner of the first panel. Alternatively, users directly enter the comparative promoter viewer of the genes of interest by specifying the human and mouse gene pairs from the correlation tables (Supplementary Information Figure S1C).



Comparative viewer of the promoters

The results of the sequence comparison of the promoters between human and mouse counterparts can be browsed as shown in Supplementary Information Figure S2D. In this page, the sequence alignment calculated using LALIGN is displayed. The positions of the aligned sequences are represented by boxes and each of the corresponding nucleotides is connected by lines. The TSSs identified by the full-length cDNAs and the 5'-ends of the RefSeqs are represented by red and blue arrows on the human and mouse promoters, respectively. In the lower panel, the sequence match is displayed and the TSSs identified by the full-length cDNAs and the 5'-ends of RefSeqs are marked on the nucleotides. Also, users can dynamically change the standard positions of the alignment by specifying the TSSs of the users' choice. A default alignment is provided using LALIGN, which is a local alignment program, but this can be switched to ClustalW, which is designed for global alignment [Thompson et al. (1994) Nucleic Acids Res. 22, 4673-4680].



Figure S2: Screen shots form the result (A-D) pages of DBTSS. Results of the search will be displayed as the genomic viewer (A) and the sequence viewer (B). Examples from the comparative viewer of the promoters are also displayed in (C) and (D).
(High resolution figure: 187 KB)


Search database by putative TF binding sites and their combinations

The most important feature added to this version of DBTSS is the engine which enables the search for the promoters by putative TF-binding sites. The combination of this implementation and the comparative promoter data should provide experimental biologists with the most practical and powerful usage for the promoter analysis. Users can search the promoters with the search keys like "promoters containing a putative TF binding site(s) of particular kinds, which is conserved between human and mouse". In order to narrow down the targets, users can perform combinatorial searches of the TF sites.

For this search, users can create arbitrary number/combinations of the search field for putative TF-binding sites. For each of the position weight matrices (PWMs), which define the consensus sequence of the TF binding sites [Matys et al. (2003) Nucleic Acids Res. 31, 374-378], users can specify arbitrary cut-offs, target regions and strand of the search (default parameters are set as "minSUM64.prf", which is documented for the Match tool of TRANSFAC in TRANSFAC as to minimize both false negatives and false positives). Users can also choose the exact sequence match for the query instead of the PWMs, so that the users can search target sites of newly discovered sites of the TFs or consensus sequences for which pre-existing PWMs are less reliable.

The results of the search by putative TF binding sites can be browsed as exemplified in Supplementary Information Figure S1B. Since in most cases, the hits are expected to be multiple, users can overview the results in a list, in which the hits are shown by their gene names. When users choose the hit of interest from the list, the viewer of the exact sequences around the TSSs and the positions of predicted TF-binding sites appear. When the search has been performed with "human and mouse conserved" option, the "Go mouse/human comparison" button appears next to the information table of the predicted TF binding sites. From this, users can refer to the comparative promoter viewer to directly examine the conservation of the predicted TF binding sites between human and mouse at the sequence level.

Table S1: List of the predicted TF binding sites.
Matrix TF definition Matrix sim. Core sim. Human Mouse Conserved
V$AFP1_Q6 AFP1 1 0.947 35 25 0
V$AHR_01 AhR 1 0.958 2 0 0
V$AMEF2_Q6 aMEF-2 1 0.928 44 36 0
V$AML_Q6 AML 1 1 405 386 23
V$AP1_C AP-1 0.989 0.991 678 638 64
V$AP4_01 AP-4 1 0.954 47 39 1
V$AR_Q2 AR 1 0.955 7 7 0
V$ATF_B ATF 1 0.985 447 327 82
V$BACH2_01 Bach2 1 0.987 97 44 0
V$CDP_01 CDP 0.829 0.832 55 35 0
V$CDX2_Q5 Cdx-2 1 0.982 17 10 0
V$COUP_01 COUP-TF / HNF-4 0.988 0.964 30 34 2
V$COUP_DR1_Q6 COUP direct repeat 1 1 0.951 33 43 2
V$CP2_01 CP2 0.987 0.992 125 95 1
V$CRX_Q4 Crx 1 0.961 3480 1415 368
V$E2F_01 E2F 1 0.884 84 45 4
V$E4F1_Q6 E4F1 1 0.985 64 48 4
V$EGR1_01 Egr-1 0.885 0.854 111 74 2
V$ER_Q6 ER 1 0.963 120 103 2
V$FXR_Q3 FXR 1 0.971 1 2 0
V$GRE_C GR 1 0.87 86 81 0
V$HFH4_01 HFH-4 1 0.906 170 185 3
V$HIF1_Q5 HIF-1 1 0.968 191 106 14
V$HNF1_01 HNF-1 1 0.935 125 80 9
V$HNF3ALPHA_Q6 HNF-3alpha 0.972 0.962 1473 1982 113
V$HNF6_Q6 HNF-6 1 0.991 54 35 2
V$HOX13_01 Hox-1.3 1 0.924 5 2 0
V$HP1SITEFACTOR_Q6 HP1 site factor 0.941 0.944 55 39 2
V$HSF_Q6 HSF 1 0.986 3 5 1
V$IPF1_Q4 IPF1 1 0.965 218 176 3
V$IRF7_01 IRF-7 0.976 0.958 181 112 14
V$ISRE_01 ISRE 1 0.988 5 1 0
V$LEF1_Q6 LEF-1 1 0.95 695 717 53
V$LHX3_01 Lhx3 1 0.979 318 236 5
V$LXR_Q3 LXR 1 0.902 44 26 0
V$MAF_Q6 MAF 1 0.977 2 0 0
V$MTF1_Q4 MTF-1 1 0.961 40 21 1
V$MYCMAX_B c-Myc/Max 1 0.966 990 772 114
V$MYOD_01 MyoD 1 0.979 163 107 12
V$NF1_Q6 NF-1 1 0.986 483 468 32
V$NFE2_01 NF-E2 1 1 22 32 2
V$NFKB_C NF-kappaB 0.973 0.972 102 70 7
V$NFMUE1_Q6 NF-muE1 1 1 87 48 8
V$NFY_Q6 NF-Y 1 0.978 323 336 28
V$NRF1_Q6 Nrf-1 1 0.991 411 276 49
V$OCT1_Q6 Oct-1 1 0.994 14 22 1
V$OLF1_01 Olf-1 0.99 0.963 9 7 0
V$P53_01 p53 0.664 0.753 0 0 0
V$PITX2_Q2 PITX2 1 0.976 2588 736 187
V$POU1F1_Q6 POU1F1 0.985 0.979 329 202 10
V$PPARA_01 PPARalpha/RXR-alpha 0.899 0.863 6 4 0
V$PTF1BETA_Q6 PTF1-beta 1 0.991 9 8 0
V$RORA1_01 RORalpha1 1 0.969 502 614 29
V$SF1_Q6 SF-1 1 1 604 495 48
V$SMAD4_Q6 SMAD-4 0.99 0.917 317 256 10
V$SP1_Q6 Sp1 1 0.969 5937 3316 1486
V$SREBP1_02 SREBP-1 1 0.992 20 11 0
V$SRF_Q6 SRF 0.99 0.983 36 40 3
V$STAT_01 STATx 1 0.976 515 421 40
V$TEF_Q6 TEF 1 0.881 640 448 25
V$TEL2_Q6 Tel-2 1 1 83 62 5
V$TFIIA_Q6 TFIIA 0.961 0.959 252 251 9
V$TFIII_Q6 TFII-I 1 1 2225 1627 246
V$USF_Q6 USF 1 0.959 1127 780 239
V$YY1_02 YY1 1 0.94 174 117 18
Using the created dataset of the promoters, putative TF binding sites were searched by MATCH with the corresponding cut-offs (the third and the fourth columns). The numbers of hits detected in human and mouse promoters are show in the fifth and the sixth columns. The numbers of hits detected in both human and corresponding mouse promoters are shown in the seventh column. For the cut-offs, we used very strict values. The statistical significance of the cut-off for each matrix is described in Kel et al. (2003) Nucleic Acids Res. 31, 3576-3579.