| In Silico Biology 4, 0033 (2004); ©2004, Bioinformation Systems e.V. |
1 Aventis Pharmaceuticals Inc., Route 202/206, Bridgewater, NJ 08807 USA
2 Portland State University, Portland, Oregon 97207 USA
3 Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724 USA
* corresponding author
Email: fan.long@aventis.com
Edited by E. Wingender; received April 30, 2004; revised May 27 and June 15, 2004; accepted June 16, 2004; published June 26, 2004
DNA-binding transcription factors play a central role in transcription regulation, and the annotation of transcription-factor binding sites in upstream regions of human genes is essential for building a genome-wide regulatory network. We describe methodology to accurately predict the transcription-factor binding sites in the proximal-promoter region of function-specific genes. In order to increase the accuracy of transcription factor binding-site prediction, we rely on recent genome sequence data, known transcription factor binding-site matrices, and Gene Ontology biological-function-based gene classification. Using TRANSFAC position-frequency matrices, we detected individual and cooperating transcription-factor binding sites in proximal promoters of ENSEMBL annotated human genes. We used the over representation of detected binding sites in the proximal promoters as compared to the second exons to control specificity. We confirmed the majority of transcription-factor binding sites predicted in proximal promoters of immune-response genes with evidence from existing literature. We validated the predicted cooperation between transcription factors NF-κB and IRF in the regulation of gene expression with microarray transcript profiling data and literature-derived protein-protein interaction network. We also identified over-represented individual and pairs of transcription-factor binding sites in the proximal promoters of each Gene Ontology biological-process gene group. Our tools and analysis provide a new resource for deciphering transcription regulation in different biological paradigms.
Keywords: transcription regulation, transcription factor
binding site, Gene Ontology
Transcription regulation through differential action of factors bound to cis-acting elements is a key mechanism of gene-expression regulation. The identification of transcription-factor binding sites (TFBSs) in the cis-regulatory regions (promoter and enhancer regions) of genes is critical for understanding their transcription regulation patterns, and allows to predict gene regulatory network behavior during development, homeostasis, and disease. Genome-wide annotation of transcription regulation modules was successfully undertaken in the Drosophila [Berman et al., 2002] and the yeast [Vilo et al., 2000; Pilpel et al., 2001] model systems. However, such studies of the human genome remain challenging, mainly because of the difficulty to accurately determine the transcription start site (TSS) location for many genes and because TFBSs in human promoters are highly degenerate.
Genes that are involved in the same biological process are often regulated by similar transcription machinery, and their proximal promoters are likely to contain similar TFBSs [Wasserman and Fickett, 1998; Pilpel et al., 2001; Sudarsanam et al., 2002; Elkon et al., 2003]. Coding sequences make for about 3% of the human genome [Birney et al., 2001], but most TFBSs lie in the proximal promoter region within 1 kb upstream and surrounding a TSS [Zhang, 1998]. For these reasons, a priory knowledge about gene-co-regulation from microarray gene expression data can be used to increase the quality of TFBS prediction [van Helden et al., 1998; Roth et al., 1998; Curran et al., 2003]. However, since not all co-regulated gene promoters share common TFBSs, and since the same TFBS can be found in the promoter regions of genes that are weakly co-regulated, there is no invariant correlation between co-regulated genes and TFBSs in their proximal promoters. Since transcription regulation is complex and condition-dependent, microarray analysis can only capture snapshots of its operation.
In order to better increase the accuracy of TFBS detection in the entire human genome, we used gene classification as given by Gene Ontology (GO) [Ashburner et al., 2000] to group genes by their biological function. GO classifies genes according to three separate categories of function: molecular function, biological process and cellular component. The biological process category agrees best with the hypothesis that similar expression patterns indicate a functional relationship [Hvidsten et al., 2003; Brown et al., 2000]. We identified statistically over-represented individual and cooperating TFBSs in the upstream sequence set of each biological-process gene group in GO. Focusing on immune-response genes, we cross validated the biological significance of one of the statistically over-represented cooperating TFBS pairs in the immune-response gene group with microarray transcript profiling data and the protein-protein interaction network inferred from existing literature.
Data sources
TFBSs are commonly modeled using position frequency matrices (PFMs) and position weight matrices (PWMs). TRANSFAC [Wingender et al., 2000] has a large collection of in vitro identified TFBS PFMs and their corresponding PWMs. We extracted all 610 TFBS matrices from TRANSFAC matrix table release 6.4, December 2002. Among them 442 were vertebrate derived. Each gene in the ENSEMBL gene collection is supported by sequence-homology evidence including blast matches to proteins, vertebrate mRNA and UniGene clusters [Hubbard et al., 2002]. We used ENSEMBL TSS prediction to extract 1kb upstream sequences of human genes, and ENSEMBL second exon prediction to extract second exons. The 1kb sequences upstream from TSS and the second exon sequences of human genes and their GO annotations were obtained from ENSEMBL release 10 databases, March 2003. These include 33715 unique proximal-promoter sequences and 26746 second exon sequences; of these, 13033 genes had GO annotations. The mapping of genes to probes on the Affymetrix gene chips was also given by ENSEMBL.
TFBS identification method
Given a PWM P of length l, both strands of a sequence were scanned by sliding a window of length l along the sequence. At each position of the window, a similarity score was computed between P and the corresponding subsequence using the MatInspector algorithm [Quandt et al., 1995] implemented in Perl (v5.6.0). We scanned the upstream sequences for the presence of all the vertebrate derived TFBSs. The second exon sequences, believed to have few intrinsic biological relevant cis-regulatory elements, were used to establish a cutoff score for the presence/absence call of a TFBS within a searched sequence for the following two reasons: first, 2nd exon sequences are representation of gene coding sequences in contrast to the random genomic sequences, and second, almost all known TFBSs are either upstream of the TSS, or located in introns or other non-coding regions of the genome, especially enhancer sequences. For each TFBS matrix, we first generated a similarity score at each position of the second exon sequences, then we sorted all the similarity scores generated from the highest to the lowest and chose the top 0.01% highest score as the cutoff score for this matrix. However, for a given matrix, if the top 0.01% highest score in the second exon sequences is less than 0.75, a low cutoff score that has been used in literature [Marino-Ramirez et al., 2004; Loots et al., 2002; Sharan et al., 2003], we chose 0.75 as the cutoff score for this matrix. Fifteen out of the 442 matrices fell into the latter case and their cutoff scores were defined as 0.75. With these established cutoff scores for each matrix, we searched for their presences in the upstream sequences. A TFBS was called present in the upstream sequence if the two following criteria were met, (1) the matrix similarity score was greater than its cutoff score, and (2) instead of using MatInspector core cutoff, the experimentally determined completely conserved nucleotides (with frequency = 1 in the PWM) were conserved in the searched sequence.
Definition of TFBS pair
TFBS pair is a pair of TFBSs which are present individually on the same sequence (e. g. passed the cutoff and conservation criteria respectively), the distance from the 3' end of the left TFBS to the 5' end of the right TFBS is less than 50bp, and their overlap is less than half of the length of the shorter TFBS if the two TFBSs overlaps.

Identification of statistically over-represented individual and pairs of TFBSs
We used the Fisher exact probability test implemented in SAS to identify individual and pairs of TFBS that are over-represented in one group of sequences against another group. We used the test with one-sided probability cutoff of 0.05 to compare the number of occurrences and non-occurrences in the proximal promoter sequences against the second exon sequences. We used the test with one-sided probability cutoff of 0.001 to compare the number of genes containing individual TFBSs and TFBS pairs in the upstream sequences of genes belonging to one group of GO category against genes from all other categories.
Assessment of the biological significance of the NF-κB-IRF (or ISRE) TFBS pair
We used following two public microarray data sets to obtain IFN stimulated genes and NF-κB regulated genes:
We used the Fisher exact probability test to assess the statistical significance of the enrichment of IFN or NF-κB regulated genes in genes containing the NF-κB-IRF (or ISRE) TFBS pair in the upstream sequences, compared to genes not containing the pair; see Table 1 for an illustration of the 2x2 contingency table used in the Fisher test.
| Table 1: | 2x2 contingency table for Fisher test. |
| Number of genes regulated by IFN or NF-κB | Number of genes not regulated by IFN or NF-κB | |
| Genes containing NF-κB-IRF (or ISRE) TFBS pair | 33 | 61 |
| Genes not containing NF-κB-IRF (or ISRE) TFBS pair | 2318 | 10214 |
We used Ingenuity Pathway Analysis (http://www.ingenuity.com/products/IPADataSheet.pdf) to explore significant biological networks of genes containing the NF-κB - IRF (or ISRE) TFBS pair in their upstream sequences. The Ingenuity Pathway Analysis is a web-delivered application that takes a list of genes as input and dynamically computes a set of subnetworks of interacting genes (proteins) to describe the inter-relationships between the genes and their possible biological relevance. It is based on protein-protein functional or physical interaction data derived from a large curated collection of individually modeled relationships between proteins (genes) in literature. Each subnetwork is centered on the input genes and includes other genes that connect input genes to each other or specifically interact with multiple input genes. Since not all content in a subnetwork is necessarily directly relevant to a single biological pathway, and to facilitate management and user interaction, Ingenuity limited each subnetwork to 35 genes. Each subnetwork is assigned with a score that indicates the likelihood of the input genes in a network being found together due to random chance.
Comparison of TFBS occurrences in the upstream 1kb sequences and the second exon sequences
For a PWM of length l, MatInspector assigns a score ranged from 0 to 1 to a substring of length l of an input sequence. The better the match between the matrix and the substring, the higher the score. In order to filter out random matches without eliminating real sites, we established a cutoff score for each matrix as described in the Materials and Methods section. With these established cutoff scores, we compared the frequency of positive occurrences in the upstream 1kb sequences of all the human genes to the frequency of positive occurrences in the second exon sequences relative to the total length of the searched sequences (Figure 1). Of the 442 vertebrate TFBS matrices, some motifs occur more frequently in the upstream sequences, while some other motifs occur more frequently in the second exon sequences. In order to identify TFBSs that are present in the upstream sequences with statistical significance, we performed Fisher exact probability tests on the 442 TFBSs occurrences in the upstream sequence set and the second exon sequence set (Figure 2). We found that 207 TFBSs have statistically over-represented occurrences in the upstream 1kb sequences compared to the second exon sequences with a one-sided probability cutoff of 0.001. In order to include TFBS occurrences corresponding to matrices for well-known transcription factors such as NF-κB 65, STAT1, STAT5, and cETS p54, we raised the probability cutoff to 0.05. Under this less stringent cutoff, occurrences of 243 TFBS matrices were identified as statistically over-represented in the upstream sequences. We used these putative TFBSs for all the subsequent analysis.
TFBS matrices with the most significantly over-represented occurrences in the upstream sequences in comparison with the second exon sequences (e. g. with the lowest one-sided probability from Fisher's test) include GC box, TATA box and other GC rich TFBS matrices such as MAZ, Sp1 and major T antigen binding site (Figure 3). The abundance of GC-and TATA-boxes in promoter sequences is consistent with the literature [Kel-Margoulis et al., 2003; Marino-Ramirez et al., 2004]. In our experiments GC-rich TFBSs and AT-rich TFBSs were the most abundant across sequences and had the largest total count. The top 10 frequently appearing TFBSs are listed in Table 2.
|
Figure 1: Distribution of TFBS frequencies in all human upstream 1kb sequences and all second exon sequences. The frequency is relative to the total length of searched sequences. |
|
Figure 2: Distribution of one-sided probabilities from Fisher's test of TFBS frequencies in all human upstream sequences against all second exon sequences. |
The on-line table lists the 442 motifs' occurrence frequencies in the upstream sequences and the second exon sequences, and the one-sided probabilities from the Fisher's test.
| Table 2: | Top 10 most frequent TFBSs in upstream sequences. |
| Motif name(ID) | % of upstream sequences containing the motif | Motif base characteristics |
| GC_01(M00255) | 43.57% | NRGGGGCGGGGNK |
| SP1_Q6(M00196) | 43.19% | NGGGGGCGGGGYN |
| MAZ_Q6(M00649) | 40.39% | GGGGAGGG |
| SP1_01(M00008) | 38.50% | GRGGCRGGGW |
| MAZR_01(M00491) | 38.03% | NSGGGGGGGGMCN |
| HOXA4_Q2(M00640) | 34.98% | AWAATTRG |
| CRX_Q4(M00623) | 34.28% | YNNYTAATCYCMN |
| PAX4_02(M00377) | 34.19% | NAAWAATTANS |
| PITX2_Q2(M00482) | 33.58% | WNTAATCCCAR |
| TANTIGEN_B(M00330) | 32.96% | RGGAGGCAGAGGCAGGYGG |
Some transcription factors are associated with more than one TFBS matrices in TRANSFAC, while some TFBS matrices for different transcription factors are very similar. For example, SP1_01(M00008) and SP1_Q6(M00196) are TFBS matrices for transcription factor Sp1; TFBS matrices for transcription factor STAT1, STAT3, STAT5 are very similar and usually identify same occurrences. In order to preserve the specificity of each TFBS matrices and the corresponding transcription factor, we did not attempt to correct for the dependence between
TFBS matrices albeit our analysis results may contain some redundancy.
TFBSs enriched in the upstream sequences of immune-response genes
Dysfunctions in the immune response are known to be involved in various autoimmune diseases. The optimal functioning of the immune system involves the coordinated efforts of several types of cells, mainly cells of hematopoietic origin. Identification of TFBSs enriched in immune-response-related genes could help indicate roles of the corresponding transcription factors in the regulatory network of the immune system. Among the 13033 GO-annotated human genes in ENSEMBL, 397 genes were annotated as immune-response genes. We found 15 TFBS matrices whose occurrences are enriched in immune-response gene upstream sequences, compared to the upstream sequences of all other annotated genes (Table 3). As expected, most of the transcription factors associated with these TFBS matrices are known to regulate immune-response genes.
| Table 3: | Enriched TFBS matrices in the upstream sequences of immune-response genes. |
| Motif name(ID) | One-sided probability from Fisher's test | Functions related to immune response |
| AP1_Q6(M00174) | 7.66E-04 | AP-1 is a transcription factor that mediates a number of genes including growth factors and cytokines that are central to the inflammatory response. [Macian et al., 2001] |
| BRN2_01(M00145) | 7.61E-04 | The POU (Pit-Oct-Unc) family of transcription factors (including Pit1, Oct1, Oct2, Unc86, Brn2, etc.) plays critical roles in the development and functioning of the nervous system. Oct1 site has been found in the promoters of B cell specific Immunoglobulin (Ig) genes and in the promoters/enhancers of some T cell specific cytokines such as IL2, IL5 and IL4, and regulates the transcription of these genes. [Fujii and Hamada, 1993; Mordvinov et al., 1999; Albright et al., 1995; Lins et al., 2003] |
| OCT1_05(M00161) | 0.001 | |
| OCT_C(M00210) | 2.14E-04 | |
| FOXJ2_02(M00423) | 7.82E-04 | The forkhead transcription factor FoxO regulates transcription in response to IL-2, which plays a critical role in T cell proliferation and survival [Stahl et al., 2002]. XFD2 is a fork head related gene in Xenopus laevis embryos. |
| FOXO3_01(M00477) | 8.27E-04 | |
| XFD2_01(M00268) | 7.65E-04 | |
| GATA1_06(M00347) | 9.56E-04 | The GATA transcription factor family has been found to participate directly in several signal-transduction pathways. GATA3 is involved in T helper cell differentiation and GATA4 is required for normal cardiac morphogenesis. [Zhou and Ouyang, 2003; Suzuki and Evans, 2004] |
| ISRE_01(M00258) | 8.79E-04 | Interferon regulatory factors (IRFs) are a family of transcription factors that play important role in interferon mediated immune response. Both the Interferon-Stimulated Response Element (ISRE) binding protein and Interferon Consensus Sequence Binding Protein (ICSBP or IRF-8) belong to the IRF family. [Barnes et al., 2002; Tamura and Ozato, 2002] |
| ICSBP_Q6(M00699) | 2.60E-04 | |
| NFKAPPAB65_01(M00052) | 7.42E-04 | NF-κB plays a central role in regulating transcription of immunoregulatory genes such as cytokines/chemokines and growth factors, immunereceptors, adhesion molecules and other mediators. [Li and Stark, 2002] |
| NFKAPPAB_01(M00054) | 9.92E-04 | |
| NKX62_Q2(M00489) | 9.90E-05 | Analysis of the expression pattern of Nkx6-2 gene, together with DNA binding assays, suggests that this gene product might be important for differentiated oligodendrocyte function and in the regulation of myelin gene expression. [Lee et al., 2001] |
| SRY_02(M00160) | 1.26E-04 | Sox-4, a gene with homology to the HMG box region of the sex determining gene SRY, is a transcriptional activator in lymphocytes. [van de Wetering et al., 1993] |
| STAT5A_03(M00493) | 7.42E-05 | Signal transducers and activators of transcription (STATs) are crucial intracellular signaling molecules involved in cytokine-dependent and growth factor-dependent signaling, and T helper cell differentiation. [Calo et al., 2003; Kuo and Leiden, 1999] |
Distribution of TFBS pairs in the upstream sequences of immune-response genes
Individual TFBS detection in large sequences such as whole genomes suffers from unacceptably low specificity. Searching for cooperating TFBSs in proximal-promoter sets corresponding to related genes can dramatically increase specificity [Kel et al., 1999]. For each possible pairing of our 243 putative TFBS matrices, we extracted all the genes whose upstream sequences contain a TFBS pair. The number of genes containing each TFBS pair is displayed as a heat map in Figure 4. Four percent of the 29,646 possible TFBS pairs have occurrences in more than five percent of all the genes. TFBS pairs with occurrences in more than ten percent (up to 23.5%) of all the genes (e. g. colored as red in the heat map of Figure 4) are different combinations of GC and AT-rich TFBS matrices.
We found that occurrences of 154 TFBS pairs are enriched in the upstream sequences of immune-response genes with a one-sided probability threshold of 0.001 from Fisher's test. Transcription factor pairs that correspond to some of our putative TFBS pairs such as NF-κB-IRF (or ISRE), NF-κB-HMG and IRF-STAT are known to regulate immune-response genes. For example, Thanos and Maniatis, 1992, showed that both NF-κB and HMG I(Y) are required for viral induction of the human IFNß gene. HMG I(Y) was shown to stimulate the binding of NF-κB to the IFNß promoter, and the cooperation between two NF-κB complexes mediated by HMG I(Y) is essential for cytokine-induced expression of the E-selectin that plays a critical role in mediating adherence of leukocytes to endothelium at sites of inflammation [Lewis et al., 1994]. IFN-α induces the human IL-10 gene via a module consisting of interdependent IRF1 and STAT3 motifs [Ziegler-Heitbrock et al., 2003].
We also checked the enrichment of the occurrences of the 154 TFBS pairs in the sub-processes of the immune response biological process and found that 22 are statistically over-represented in 7 sub-processes of the immune response (Figure 5). Four of the 22 TFBS pairs are made of binding sites matrices for NF-κB and IRF. Our result indicates that NF-κB-IRF (or ISRE) and LYF1-NF-κB are enriched in antigen presentation/processing of endogenous antigen via MHC class I biological processes, while RREB1-NF-κB is enriched in antigen presentation/processing of exogenous antigen via MHC class II biological processes. The finding that NF-κB synergizes with different transcription factor partners is consistent with the theory that it is one of the central mediators of the immune response and plays an important role in cell proliferation and differentiation [Li et al., 2002].
Biological significance of the NF-κB-IRF (or ISRE) TFBS pair
We focused our analysis on the NF-κB-IRF (or ISRE) TFBS pair because of the strong experimental evidence that the corresponding transcription factors regulate gene transcription in the immune response. NF-κB (nuclear factor κB) is a nuclear transcription factor that regulates expression of a large number of genes, including cytokines, chemokines, growth factors, immune-receptors, cellular ligands, and adhesion molecules. IRF (interferon regulatory factors) are a family of transcription factors that play an important role in IFN mediated immune response. The ISRE (interferon-stimulated response element) was originally defined as an IFN-responsive sequence found in a number of IFN-inducible genes. The products of these genes either singly or coordinately mediate the antiviral, growth inhibitory or immunoregulatory activities attributed to IFN.
Ohmori and Hamilton, 1995, showed that the cooperative interaction of ISRE and an NF-κB site mediates synergistic induction of murine IP-10 gene transcription by IFN and TNF. MHC class I gene HLA-F is also reported to be inducible by NF-κB through the κB1 site and is responsive to IFN through the ISRE [Gobin and van den Elsen, 2000]. Sanceau et al., 2002, reported that interferons inhibit TNF-mediated matrix metalloptoteinase-9 activation via IRF1 binding competition with NF-κB, and a recent study [Naschberger et al., 2004] showed that NF-κB and ISRE cooperate in the activation of guanylate binding protein-1 expression by inflammatory cytokines in endothelial cell. These experiments suggest that the co-occurrence of IRF (or ISRE) and NF-κB TFBS in the promoter region of a gene may imply either a co-induction or a competitive inhibition response to IFN and TNF.
We identified genes whose upstream regions contain the binding sites of NF-κB and IRF (or ISRE), and tested the responsiveness of these genes to NF-κB and IFN using existing transcript-profiling data. We opted for the inclusive approach of including both NF-κB regulated genes and IRF regulated genes in the absence of data that directly identifies genes regulated by both NF-κB and IRF. We found that genes containing the TFBS pair have a higher probability to be regulated by the corresponding transcription factors than genes not containing the pair. We retrieved 134 genes whose upstream sequences contain the NF-κB-IRF (or ISRE) TFBS pair. We then compared these genes to the genes that were identified as IFN-stimulated or NF-κB regulated in two public microarray profiling experiments (see Materials and Methods). Among the 134 genes, 94 are mapped to the genes used in the microarray studies, and 33 of the 94 genes (35%) are IFN-stimulated or NF-κB regulated. In total, only 2351 of all 12626 genes (18.6%) on the chips are IFN-stimulated or NF-κB regulated. Thus our selected gene set has a 2-fold enrichment, corresponding to a one-sided probability of 1.09E-04 from Fisher's test. Considering that these two microarray profiling experiments are both based on a single cell type and one time point, and that it is usually difficult to detect the transcription regulation pattern for low-level expressed or regulated genes in microarray experiments, we think that the confirmation rate obtained is very high.
The greater the number of interactions among a set of genes, the more likely those genes act together as a functional unit (a biologically coordinated program). In order to examine whether a specific TFBS pair is over-represented in a specific biological process, we used the Ingenuity Pathway Analysis tool to find whether there are biological networks among the selected 94 genes. Among all the subnetworks generated by Ingenuity, we focused on the subnetwork with the highest score, 25 (Figure 6), e. g. 1E-25 chance that the input genes are together in this subnetwork due to random chance. Ingenuity showed that most genes in this subnetwork are involved in immune response including activation of T lymphocytes, activation of macrophages, antiviral response of tumor cell lines, etc. The other subnetworks with scores higher than 10 were also immune response related (data not shown). This again suggests the functional involvement of NF-κB-IRF (or ISRE) motif pair in the biological process of immune response.
In the subnetwork displayed in Figure 6, there are 17 input genes. Among the 17 genes containing the NF-κB-IRF (or ISRE) TFBS pair, 9 genes (shown in green) are confirmed as IFN or NF-κB regulated genes from the two microarray transcript-profiling datasets. Although the transcription regulations of the other 8 genes (shown in red) are not confirmed by the microarray data, there is evidence to suggest that they directly or indirectly interact with the confirmed 9 genes. Most notably, TP73L (circled in blue) has direct physical interaction with B2M (circled in blue) [Simister and Mostov, 1989]. The two microarray experiments show no evidence for the regulation of IFNB1 by IFN or NF-κB, but IRF3 and NF-κB have been shown to cooperatively enhance the expression of IFNB1 (boxed in pink) [Schafer et al., 1998]. This evidence supports the stipulation that microarray profiling techniques have low sensitivity and are condition specific, further adding value to using in silico TFBS identification as supplemental methods for building gene-regulation networks.
|
Figure 6: A protein-protein physical/functional interaction network generated by Ingenuity Pathway Analysis tool from genes containing the NF-κB-IRF (or ISRE) TFBS pair. Colored boxes are genes containing the NF-κB-IRF (or ISRE) pair with green indicating genes confirmed and red indicating genes not confirmed by the two microarray datasets. TP73L and B2M (circled in blue) have direct physical interaction. IRF3, cooperating with NF-κB, stimulates the expression of IFNB1 (boxed in pink). |
Individual and pairs of TFBSs enriched in various gene classes defined by Gene Ontology
We identified GO-defined gene categories whose upstream sequences contain statistically over-represented occurrences of TFBSs. With one-sided probability threshold as 0.001 from Fisher's test, we found 310 GO gene groups that have at least 1 enriched putative TFBS in their upstream sequences. Many putative TFBSs identified and their corresponding biological processes are supported by existing experimental data. For example, E2F is over-represented in the upstream sequences sets that correspond to cell proliferation, regulation of cell cycle, and DNA replication GO biological-function gene groups. This agrees with the experimental studies showing that the E2F family of transcription factors plays a central role in regulating cellular proliferation by controlling the expression of both the genes required for cell cycle progression, particularly DNA synthesis, and the genes involved in regulation of apoptosis [Müller 1995]. We identified ZF5 binding sites in the upstream regions of oncogenesis and cell cycle genes. This agrees with the evidence that ZF5 is a transcription repressor of c-myc proto-oncogene exhibiting growth-suppressive activity in mouse cell lines [Numoto et al., 1993]. NF-κB is known to regulate various genes that are associated with several biological functions [Li and Stark, 2002]. We found that NF-κB TFBS matrices are enriched in the upstream sequences of genes specific to apoptosis, axon guidance, nucleotide biosynthesis, and several immune-related biological processes.
To reduce the false-positive rate and improve specificity, for each pair of the 243 putative TFBSs, we checked the representations of these pairs in genes belonging to different GO biological processes. We found 140 GO biological process gene groups of which at least one putative TFBS pair is enriched in the upstream sequences with one-sided probability threshold as 0.001 from Fisher's test. Different biological process gene groups have different enriched pairs, which may infer the specific function of a pair in a specific group of genes. Some pairs were identified as over-represented in more than one GO biological process gene group. For example, an experimentally well-validated TFBS pair Sp1-NF-κB [Buttner et al., 2004; Wang et al., 2001] is over-represented in the upstream sequences of 3 GO gene groups: protein phosphorylation, response to stress and ubiquitin-dependent protein catabolism. Thus, regulatory modules can confer a common function to promoters belonging to different gene classes as argued by Klingenhoff et al., 2002.
The accumulation of genome sequence data and experimentally determined transcription factor DNA-binding motif data allow for increasingly more accurate computational identification of TFBSs. TFBS identification remains a challenging problem due to the low specificity of the individual TFBS. The binding affinity of a factor to a binding site is not only determined by the sequence, but is also influenced by other functional contexts such as DNA structure characteristics and the presence of binding sites for other transcription factors. PWM-based searches are considered to be more sensitive, although a major drawback of using PWM directly for identifying TFBSs is its high false-positive rate. While probabilities are used in the generation of PWM, the similarity score itself cannot be interpreted statistically. This leads to the general difficulty of choosing the score cutoff values for each matrix. Various methods have been developed to choose cutoff scores based on the estimation of false positive rate and false negative rate with the use of different background sets including the 2nd or the 3rd exon sequences [Kel et al., 2003], random genomic sequences or randomly generated sequences [Xue et al., 2004; Marino-Ramirez et al., 2004], local genomic contexts [Huang et al., 2004], etc. However, it is difficult to decide which background is better without experimental verification. Due to the low specificity of most TFBSs, it is almost impossible to find an optimal cutoff which would find all "real" binding sites and filter out all random matches.
We described a framework strategy for identifying individual and pairs of TFBSs by scoring the statistical significance of their frequency of occurrences in a family of promoters of genes defined by Gene Ontology. In our study, we applied several procedures to increase the prediction specificity: first, we used genome wide second exon sequences to establish the cutoff values for each matrix; second, we scored the statistical significance of their occurrences in the promoter sequences against the second exon sequences and focused only on those motifs that are statistically over-represented in the promoter sequences; third, we used statistical determination of the enrichment of binding sites in a given set of genes compared to a background set, which can drastically increase specificity. Although different genes can be found as containing a TFBS or not depending on the cutoff values, the over-represented TFBSs in each gene group were identified through comparison of one gene group to all the other genes, so that these identified TFBSs may be of great importance for the regulation of at least part of the genes in the group. In addition, finding motif combinations in close proximity will further ensure a lower rate of false positives in defining genes controlled by each motif.
We found that about 45% of vertebrate TFBS matrices from TRANSFAC are not over-represented in the upstream sequences. This result agrees with Xue's finding [Xue et al., 2004] that in genomes of higher organism such as the Drosophila less than 50% of the potential transcription regulatory sites are over-represented in the upstream sequences. This low detection rate of TFBS matrices may be attributed to the low specificity of the matrices which leads to the non-significant difference in the detections in the promoter sequences and the second exons, and to the fact that in higher organisms TFBSs could appear in the more abundant and larger intronic or intergenic regions away from immediate upstream of the TSS. For example, the amino acid response element (AARE), an enhancer of system A amino acid transporter (SNAT2) gene, is located in the first intron and confers amino acid-dependent regulation of the SNAT2 promoter. It functions in an orientation and position independent manner [Palii et al., 2004]. Our study also confirms that GC- and TATA-boxes are the two most over-represented elements in the upstream promoter sequences [Marino-Ramirez et al., 2004].
Searching different lengths of the upstream regions will yield different results. To reduce noise, we detected putative TFBSs in the upstream 1kb sequences because most experimentally identified TFBSs lie within 1kb upstream or surrounding the TSS. Our analysis also revealed that 61% of all the genes predicted containing the NF-κB-IRF (or ISRE) pair and 84% of the microarray confirmed genes contain the pair within upstream 500 bp (data not shown). However, in human genome, remote regulators, such as enhancer or repressor, could be located in great distance from the TSS. In addition, some transcription factors may cooperate in regulating gene expression even though their primary distance is greater than 50 bp. In our analysis, we used the second exon sequences as negative control mainly because the second exon sequences are most likely gene coding sequences while almost all known experimentally identified TFBSs are located in non-coding regions of the genome [Kel et al., 2003]. Applying other background sets which are even less validated, such as randomly generated sequences, or randomly selected genomic sequences, may result in difference in the identification of putative TFBSs in the upstream sequence of each gene. Nonetheless, even though the results of our analysis for the over-represented TFBSs in different biological functional gene groups are not all-inclusive, and other methods may identify additional TFBSs, these results are valid and merit further confirmation by additional experiments. In conclusion, we describe a systematic search for TFBSs that are over-represented in proximal promoters of genes with common biological function in the human genome. We hope that our results will aid in the design of future experiments and help open up new views on the global regulatory network of transcription.
We thank for the supports from the colleagues in the Immunology Platform, Aventis. P. Sumazin is supported by the NSF grant DBI-0306152.