ISB Home

- Article -

Volume 4

Full article

In Silico Biology 4, 0036 (2004); ©2004, Bioinformation Systems e.V.  

Large-scale collection and characterization of promoters of human and mouse genes

Yutaka Suzuki1*, Riu Yamashita1, Matsuyuki Shirota1, Yuta Sakakibara1,2, Joe Chiba2, Junko Mizushima-Sugano1, Alexander E. Kel3, Takahiro Arakawa4, Piero Carninci4,5, Jun Kawai4,5, Yoshihide Hayashizaki4, 5, Toshihisa Takagi1, Kenta Nakai1 and Sumio Sugano1

1 Human Genome Center, The Institute of Medical Science, The University of Tokyo: 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan;
2 Department of Biological Science and Technology, Science University of Tokyo, 2641 Yamazaki, Noda-shi, Chiba, 278-8510, Japan;
3 BIOBASE GmbH, Halchtersche Str. 33, D-38304 Wolfenbüttel, Germany;
4 Genome Science Laboratory, Discovery and Research Institute, RIKEN Wako Main Campus, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan;
5 Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan

* corresponding author; email:

Edited by E. Wingender; received June 10, 2004; revised and accepted July 20, 2004; published July 23, 2004


We report the generation and initial characterization of a large-scale collection of sequences of putative promoter regions (PPRs) of human and mouse genes. Based on our unique collection of 400,225 and 580,209 human and mouse full-length cDNAs, we determined exact transcriptional start sites (TSSs). Using positional information of the TSSs, we could retrieve adjacent sequences as PPRs for 8,793 and 6,875 human and mouse genes, respectively. The positions of the PPRs were 4 kb upstream to previously reported 5'-ends of cDNAs on average, demonstrating that full-length cDNA information is indispensable for this purpose. Among those PPRs supported by experimentally validated TSSs, 3,324 could be paired as mutually homologous genes between human and mouse and were used for the comprehensive comparative studies. The sequence identities in the proximal regions of the TSSs were 45% on average, and 22,794 putative transcription factor binding sites that are conserved between human and mouse were identified. The data resource created in the present work and the results of the sequences' initial characterization should lay the firm foundation for deciphering the transcriptional modulations of human genes. All the data were deposited and made available through a database for comparative studies, DBTSS.

Key words: full-length cDNA, promoter, comparative genomics, transcriptional start sites