| In Silico Biology 2, 0025 (2002); ©2002, Bioinformation Systems e.V. |
| G C B ' 0 1 |
1Department of Life Science, 2Department of Computer Science and Information Engineering,
National Central University, Taiwan
3Institute of Biochemistry, 4Bioinformatics Center,
National Yang-Ming University, Taiwan
E-mail: horng@db.csie.ncu.edu.tw
Edited by E. Wingender; received November 30, 2001; revised and accepted April 04, 2002; published May 26, 2002
The availability of genome-wide gene expression data provides a unique set of genes from which we can decipher the mechanisms underlying the common transcriptional response. Transcription factors, which can bind to specific DNA sites, cooperatively regulate the transcription of genes. This study attempts to mine putative binding sites to investigate how combinations of the sites predicted from known sites and over-represented repetitive elements are distributed in the promoter regions of groups of functionally related genes. The over-represented repetitive elements appearing in the associations are possible transcription factor binding sites. The deduced association rules would facilitate to predict putative regulatory elements and to identify genes which are potentially co-regulated by the putative regulatory elements. Our proposed approach is applied to Saccharomyces cerevisiae and the promoter regions of yeast ORFs.
Key words: regulatory elements, data mining, promoter, repetitive oligonucleotide
Identification of transcriptional regulatory elements within promoter regions is of striking interest for biologists since these elements govern the regulation of gene expression. Transcription factors, which are proteins, play a major role in gene regulation of eukaryotic organisms. The factors can bind to specific sites, termed transcription factor binding sites or regulatory sites, in the promoter region of particular genes and interact with RNA polymerase and other factors to regulate the transcription of a gene. Transcription factors are said to cooperatively regulate the transcription of genes.
Many experimentally identified transcription regulatory sites have been collected in TRANSFAC [Wingender et al., 2001] which is the most complete and well-maintained database on transcription factors, their genomic binding sites and DNA-binding profiles [Wingender et al., 2001]. Notably, consensus patterns or nucleotide distribution matrices can be used to describe transcription factor binding sites. While describing binding sites, Brazma et al. [1997] stated "The matrix representation is generally considered as the best available means for representing the consensus, however, at present most consensus descriptions are unreliable in the sense that they tend to give many false positives when compared against the genome sequences of even modest length". Despite of these limitations, this study describes the binding sites using consensus patterns. Brazma et al. [1997, 1998] developed a general software tool to find and analyze combinations of transcription factor binding sites that occur often in gene upstream regions in the yeast genome. In addition to analyzing the association rules in the combinations, their work analyzed the appearance of these combinations in promoter and random regions. Their tool can find all the combinations satisfying the given parameters with respect to the given set of gene promoter regions, its counterset, and the chosen set of sites.
All repetitive sequences in this study are obtained from RSDB [Horng et al., 2001a]. To find the repetitive elements located within the promoter regions of a set of genes, our approach analyzes the frequencies of the repetitive sequences with a length between four to twenty-five nucleotides.
To handle large amounts of data, data mining plays a prominent role in knowledge extraction. Frequently used data mining approaches include association rules, statistics, neural network, clustering, classification, genetic algorithms, etc. Srikant et al. [Srikant et al., 1995] introduced the problem of mining association rules over basket data. The data mining techniques might mine an enormous number of associations which makes it extremely difficult to identify the useful or interesting ones. Chi-square test is one of the approaches to remove insignificant ones. In statistics, Chi-square test statistics (
2) is extensively applied for testing independencies and correlations [Liu et al., 1999].
This study initially identifies the combinations of the sites predicted from known sites from TRANSFAC and over-represented repetitive oligonucleotides from RSDB in the promoter regions of a particular set of selected genes. The data mining approach, mining association rules, is then applied to discover the most interesting associations from the combinations of over-represented repeats and sites predicted from known ones. We then prune the associations statistically by Chi-square test to find significant correlations of sites. Those repetitive sequences in the significant association rules which are over-represented [van Helden et al., 1998] and correlated to homologs to known sites [Horng et al., 2001b] seem to be candidates of putative regulatory sites.
We first preprocess the set of promoter sequences to find the combinations of homologs to known sites and over-represented repetitive oligonucleotides in the promoter regions of the groups of functionally related genes. Next, the AprioriAll algorithm [Srikant et al., 1995] is applied to mine the association rules by combining the predicted sites and over-represented repeats. Chi-square test is then used to select certain interesting and significant rules. Finally, the over-represented repeats, which are mapping to the items in the association rules, are selected as putative regulatory sites [Horng et al., 2001b].
Materials
Before analysis of the associations of known binding sites and over-represented repeats located in promoter regions, the whole sequence of yeast genome and the gene annotations are obtained from NCBI. The experimentally identified transcription factor binding sites can be obtained from TRANSFAC. TRANSFAC database (professional 5.4) [Wingender et al., 2001] contains 11,537 site sequences in total, while the number of yeast sites is 285. Most sites are also consensus patterns. The data in TRANSFAC has the following features. A transcription factor binding site accession number may have different consensus sequences. Different binding site accession numbers may have the same consensus sequence. Wild characters such as 'M' or 'W' used in TRANSFAC make the sequences cover multiple sequences. Small consensus sequences may appear in larger ones. The repetitive sequence of the target genome, i. e. yeast, can be obtained from the repetitive sequence database (RSDB) [Horng et al., 2001a]. 6,350 yeast genes and ORFs are documented in MIPS [Mewes et al., 1999], and 3,529 genes are classified into at least one functional catalogue.
Preprocessing and mapping
The transcription factor binding sites categorized in yeast from TRANSFAC and repetitive oligonucleotides in RSDB are first prepared. For each group of functionally related genes, all of the known site homologs in yeast as well as the repetitive oligonucleotides are directly located into the promoter regions from 0 to -800 bps. The occurrences of each homolog to known sites and repeats are calculated and subjected to statistical analysis. The occurrences of all combinations of the known site homologs and repeats within each promoter region are stored for the data mining process.
Statistical analysis of over-represented repetitive oligonucleotides
Nucleotide succession is not random, and some oligonucleotides are clearly over-represented, noticeably chains of poly(A), poly(T), and poly(AT). An additional bias results from the fact that oligonucleotides are differently represented in coding regions versus non-coding sequences [van Helden et al., 1998]. A specific expected frequency has thus to be used for each oligonucleotide sequence. Van Helden et al. [1998] proposed a statistical method to estimate the statistical significance by computing the probability to observe exactly a certain number of occurrences of an oligonucleotide within promoter regions of a gene family [van Helden et al., 1998]. The lowest probability value, i.e., the highest statistical significance, indicates the most over-represented oligomer. The advantage of the estimation of the statistical significance is that the probability can be selected by exceeding a threshold value and is interpreted independently of oligonucleotide size, upstream sequence size, and number of genes within the family. The over-represented repetitive sequences of yeast are obtained by applying the statistical method in [Horng et al., 2001a]. The repetitive oligonucleotides which exhibit values significantly exceeding the threshold, are selected as significantly over-represented ones.
Mining associations
In the following, we describe how to mine associations from the combinations of the transcription factor binding sites and over-represented repetitive sequences. Consider a large database with transactions, where each transaction consists of a set of items. An association rule is an expression as A=>B where A and B are the sets of items. The mining of an association rule is that a transaction in the database that contains A also tends to contain B. For example, 90% of the people who purchase beer also purchase diapers. Herein, 90% is called the confidence of the rule. The support of the rule A=>B given herein is the percentage of transactions that contain both A and B.
The formal statement of the problem is described below. Let I = {i1, i2, ... , im} be a set of sites, called item set. Let D be a set of repeat sequences where each repeat sequence S corresponding to a transaction contains a set of items such that S
I. Let S = {s1, s2, ... , sm} be a set of transcription factor binding sites in TRANSFAC and R = {r1, r2, ... , rn} be a set of over-represented repetitive sequences from RSDB. The union of the sets S and R is called 'item set'. Let G = {g1, g2, ... , gm} be a group of functionally related genes. Each promoter region of a gene corresponding to a transaction contains a set of transcription factor binding sites and over-represented repeats, also called items.
Assume that a promoter region S contains A, a set of items of I, if A
S. An association rule is an implicate of the form A=>B, where A
I, B
I, and A
B = 0. The rule A=>B holds in the set of promoter regions D with confidence conf if c% of transactions in D contains A and also B. The rule A=>B has support sup in the repetitive sequence set D if s% of promoter regions in D contained A
B. In our experiments, the minimum support is set to 10%. The association rules are generated if the rule has a higher support and confidence than specified by the user. Apriori and AprioriTid [Srikant et al., 1995] are then applied to mine association rules.
Table 1 shows the detailed information of transcription factor binding sites in TRANSFAC, and over-represented repeats in RSDB among different functional categories in the yeast genome. For example, the first row in Table 1 indicates that 693 over-represented repeats are selected after applying statistical analysis in the functional category of "Glycolysis and gluconeogenesis". Besides, 39 sites predicted from known sites in TRANSFAC can be located to the gene promoter regions in this category. We then try to mine the associations from the combinations of these over-represented repeats and known regulatory sites.
Table 1: The amount of known site homologs and over-represented repetitive oligonucleotides located in the promoter regions of each gene functional category.
| MIPS functional category | MIPS in numeric | Amount | |
| Over-represented repeats | Known site homologs | ||
| Glycolysis and gluconeogenesis | 02.01 | 693 | 39 |
| DNA synthesis and replication | 03.25 | 382 | 36 |
| Ribosomal RNA synthesis | 04.01.01 | 562 | 42 |
| tRNA synthesis | 04.03.01 | 307 | 35 |
| Ion transporters | 07.04 | 1,092 | 44 |
| Purine and pyrimidine transporters | 07.16 | 157 | 34 |
| ABC transporters | 07.25 | 345 | 36 |
| Drug transporters | 07.28 | 604 | 39 |
Table 2 shows the associations mined by our proposed approach in each group of functionally related genes. The minimum support and confidence are set to 60%. As shown in Table 2, 2459 associations are discovered in 34 promoter regions in the function category of "Glycolysis and gluconeogenesis", where 214 predicted TF sites or over-represented repeats in average are located in each promoter region and the maximum number is up to 415. After pruning by statistical test, 272 significant associations are found.
Table 2: The associations of predicted sites and repeats mined in each functional category. The "Average" and "Maximum" indicate the average and maximum numbers of TFs or over-represented repeats in the promoter regions, respectively.
| MIPS functional category | Amount | ||||
| ORFs | Average | Maximum | Associations (before pruning) |
Significant associations | |
| DNA synthesis and replication | 32 | 160.25 | 188 | 189 | 81 |
| Ribosomal RNA synthesis | 39 | 204.46 | 246 | 558 | 180 |
| tRNA synthesis | 24 | 136.92 | 164 | 180 | 92 |
| Ion transporters | 75 | 302.59 | 346 | 96 | 47 |
| Purine and pyrimidine transporters | 15 | 72.53 | 85 | 247 | 103 |
| ABC transporters | 28 | 141.04 | 171 | 491 | 163 |
| Drug transporters | 35 | 193.11 | 284 | 730 | 128 |
Figure 1 shows an example of the occurrence of the association, e. g., "aatgta gtataa UAS 1 => ataaat". UAS 1 symbolizes the sequence CCGA found in the binding site of RAF within the CYC1 gene (TRANSFAC entry Y$CYC1_03, R00256). The gene YCR019W with the annotation of "MAK32 sugar kinase" is categorized as "TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS" in MIPS [Mewes et al., 1999]. The association rule shown in Figure 1 consists of three repeats, i.e., "aatgta", "gtataa", and "ataaat", and one known site homolog, i. e., "CCGA//Y$CYC1_03". The site occurrence positions from 5' to 3' in the promoter region are "aatgta" at -335, which is followed by "gtataa" with a distance of 22 bps. The underlined "aatgta" is meant to be located on the reverse strand of the promoter region.
|
Figure 1: An illustrative example of prediction of putative regulatory elements. The gene YCR019W with the annotation of "MAK32 sugar kinase" is categorized in "TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS" in MIPS [Mewes et al., 1999]. The association rule, "aatgta gtataa Y$CYC1_03 => ataaat", consists of one known site homolog, "CCGA/Y$CYC1_03", and three repeats, i.e., "aatgta", "gtataa", and "ataaat". The site occurrence positions from 5' to 3' in the promoter region are "aatgta" at -335 bps, which is followed by "gtataa" with a distance of 22 bps. The underlined "aatgta" is meant to be located on the reverse strand of the promoter region. |
Several interesting and significant association rules in each functional catalogue are shown in Table 3. The first column in Table 3 lists the functional categories defined by MIPS [Mewes et al., 1999]; the second one shows the associations between regulatory sites which are homologous to known ones in uppercase and over-represented repetitive oligomers in lowercase; the third one is the confidence of the association; the fourth one is the support value; the fifth one is the Chi-square value; the last one is similar to the second one but gives the transcription factor names. For instance, the association of "CCGA => ttgaaa" is discovered in the category of "Ribosomal RNA synthesis", where the support value is 0.744, confidence value is 0.853, and
2 value is 5.938. "CCGA" is a known site with the symbol UAS 1 (TRANSFAC SITE entry Y$CYC1_03, R00256), and "ttgaaa" is a significantly over-represented repetitive oligonucleotide.
Table 3: Partial significant associations mined in each functional category in MIPS [Mewes et al., 1999]. The second column shows the associations of known site homologs (in uppercase alphabet) and over-represented repeats (in lowercase alphabet). The third column, fourth column, and fifth one are the confidence values, support values, and the Chi-square values, respectively. The last one is similar to the second one except that the site designations with TRANSFAC site identifiers are given instead of sequences of known site homologs.
| MIPS functional category | Site combinations | conf | sup | 2 |
Site combinations (names of homologous known site with TRANSFAC ID) |
| Ribosomal RNA synthesis | CCGA=>ttgaaa | 0.853 | 0.744 | 5.938 | Y$CYC1_03 => ttgaaa |
| CCGA=>attgaa | 0.765 | 0.667 | 6.190 | Y$CYC1_03 => attgaa | |
| TTATC=>attgaa | 0.758 | 0.641 | 7.690 | Y$ARS1_05 =>attgaa | |
| TATAAA=>TTATC | 0.889 | 0.615 | 12.362 | Y$CUP1_07 => Y$ARS1_05 | |
| GATAA=>GGGG | 0.767 | 0.590 | 8.044 | Y$GAL1_09 => Y$GAL1_11 | |
| GAGGA=>ttgaaa | 0.920 | 0.590 | 16.891 | Y$GAL1_07 => ttgaaa | |
| tRNA synthesis | aaaggc=>ttgaaa | 1.000 | 0.667 | 9.600 | aaaggc => ttgaaa |
| CCGA=>ttgaaa | 0.842 | 0.667 | 4.491 | Y$CYC1_03 => ttgaaa | |
| CCGA=>aaattt | 0.842 | 0.667 | 4.491 | Y$CYC1_03 => aaattt | |
| ATATAA=>ttgaaa | 0.824 | 0.583 | 8.000 | Y$GAL1_12 => ttgaaa | |
| TTATC=>attgaa | 0.778 | 0.583 | 6.159 | Y$ARS1_05 => attgaa | |
| GATAA=>attgaa | 0.778 | 0.583 | 6.159 | Y$GAL1_09 => attgaa | |
| GGGG=>TATAAA | 0.823 | 0.583 | 5.818 | Y$GAL1_11 => Y$CUP1_07 | |
| Purine and pyrimidine transporters | TATAAA=>ctttga | 0.833 | 0.667 | 7.500 | Y$CUP1_07 => ctttga |
| tacata=>GGGG | 0.900 | 0.600 | 7.350 | tacata => Y$GAL1_11 | |
| agaaat,ataaag=>CCGA | 1.000 | 0.600 | 5.625 | agaaat, ataaag => Y$CYC1_03 | |
| TATAAA,ctttga=>TTATC | 0.900 | 0.600 | 5.625 | Y$CUP1_07, ctttga => Y$ARS1_05 | |
| TATAAA,ctttga=>aaatag | 0.900 | 0.600 | 5.625 | Y$CUP1_07, ctttga => aaatag | |
| GATAA,ataaag=>CCGA | 1.000 | 0.600 | 5.625 | Y$GAL1_09, ataaag => Y$CYC1_03 | |
| ABC transporters | TTATC=>aaattg | 0.792 | 0.679 | 4.929 | Y$ARS1_05 => aaattg |
| taattg=>acatat | 0.800 | 0.571 | 7.529 | taattg => acatat | |
| aatata=>TATAAA | 0.944 | 0.607 | 5.200 | aatata => Y$CUP1_07 | |
| TATAAA=>aatata | 0.739 | 0.607 | 5.200 | Y$CUP1_07 => aatata | |
| GGGG=>tacgaa | 0.714 | 0.536 | 7.000 | Y$GAL1_11 => tacgaa | |
| aatttc=>agatat | 0.790 | 0.534 | 15.304 | aatttc => agatat | |
| Drug transporters | taaata=>CCGA | 0.917 | 0.629 | 4.172 | taaata => Y$CYC1_03 |
| TTATC=>aatttc | 0.658 | 0.600 | 4.921 | Y$ARS1_05 => aatttc | |
| TTATC=>atttca | 0.656 | 0.600 | 4.910 | Y$ARS1_05 => atttca | |
| TTATC=>atatta | 0.594 | 0.543 | 3.894 | Y$ARS1_05 => atatta | |
| TTATC=>atatag | 0.594 | 0.543 | 3.894 | Y$ARS1_05 => atatag |
The detailed positions of known site homologs and putative regulatory sites in the association "Y$CYC1_03, Y$ARS1_05 =>ataaag" (equivalent to "CCGA, TTATC => ataaag") in a set of ORFs are shown in Table 4. The first column gives the ORFs of yeast in the set and the second one shows the detailed positions of known sites and putative regulatory sites in each ORF. For example, the first row in Table 4, "YER056C" is the ORFs name, and
"[511]-TTATC-[96]-TTATC-[111]-TTATC-[52]-TTATC-[1]-ataaag-[20]-CCGA-[60]-CCGA-[11]-TTATC-[117]-CCGA-"
is the composition of the sites predicted from known site and putative regulatory sites. The first number "[511]" denotes the offset of the site "TTATC" from the start position of coding region either on the direct or reverse strand, and the distance between the first and the second "TTATC" is "[96]" bps, and so on.
Table 4: The occurrence of known and putative regulatory sites in the association "Y$CYC1_03, Y$ARS1_05 => ataaag" (or "CCGA, TTATC => ataaag") in the functional category of "Purine and pyrimidine transporters." The ORFs in yeast involving the association are shown in the first column "ORFs". The occurrence positions of known site homologs and repeats are shown in column two. The first number denotes the leftmost occurrence of the sites, the number between adjacent sites is the distance between them.
| ORFs | Occurrences of sites in "UAS 1, SITE II => ataaag" |
| YER056C | [511]-TTATC-[96]-TTATC-[111]-TTATC-[52]-TTATC-[1]-ataaag-[20]-CCGA-[60]-CCGA-[11]-TTATC-[117]-CCGA- |
| YMR056C | [535]-TTATC-[107]-CCGA-[78]-CCGA-[2]-TTATC-[1]-ataaag- |
| YBL030 | [376]-CCGA-[21]-CCGA-[2]-TTATC-[240]-TTATC-[47]-ataaag- |
| YGL186C | [568]-TTATC-[12]-ataaag-[121]-TTATC-[12]-CCGA-[10]-CCGA-[379]-ataaag-[20]-CCGA |
| YBR192W | [507]-CCGA-[137]-ataaag-[90]-CCGA-[106]-ataaag-[66]-CCGA-[100]-TTATC-[1]-ataaag |
| YGR096W | [587]-CCGA-[37]-TTATC-[65]-TTATC-[336]-CCGA-[59]-ataaag-[53]-ataaag- |
| YER060W | [522]-ataaag-[14]-TTATC-[23]-ataaag-[51]-TTATC-[54]-TTATC-[69]-TTATC-[17]-CCGA-[78]-TTATC-[178]-CCGA- |
| YER060W | [523]-CCGA-[204]-TTATC-[25]-TTATC-[14]-TTATC-[38]-CCGA-[17]-ataaag-[33]-TTATC-[105]-TTATC- |
| YPL134C | [546]-TTATC-[1]-ataaag-[58]-CCGA-[180]-CCGA-[113]-CCGA-[123]-TTATC- |
We also applied our approach to the previously characterized regulatory families in [Blaiseau et al., 1997, Hinnebusch, 1992, Oshima et al., 1996], which were also investigated in [van Helden et al., 1998]. For each family in Table 5, we extract 600 bps of each upstream sequence and performed our proposed approach to discover the associations of known site homologs and over-represented repeats. As shown in Table 6, Ms, i. e., matching sequences, denotes the number of genes from the family containing at least one occurrence of the site; Occ denotes the number of occurrences of the site in all promoter regions from the family; Exp is the expected number of occurrences; Sig denotes significance index, as calculated in [van Helden et al., 1998]. For instance, the Sig values of the sites "ATATAA", "GATAAG", and "ATAAGA" in the NIT family are -0.20, 8.40, and 1.13, respectively. The consensus sequence "AKATAAGA" is deduced from these three aligned sites and is similar to the previously characterized consensus "GATAAG" [Magasanik, 1992]. Similarly, the consensus "CGCACG" is also derived from putative sites "CGCAC" and "CGCACG" in the PHO family and is similar to the Pho4p consensus "GCACGTGGG" characterized in [Oshima et al., 1996]. Moreover, other consensus sites such as "GGCACA", and "TGTGCC" are discovered in the NIT family as well as the consensus "ACGTATATA" is discovered in the PHO family.
Table 5: The regulatory families and their regulatory property [12].
| Family | Genes | Common regulatory property | Reference |
| NIT | DAL5, DAL80, GAP1, MEP1, MEP2, MEP3, PUT4 | Repressed when good nitrogen sources (glutamine glutamate, ammonia) are present in the medium | Magasanik (1992) |
| PHO | PHO5, PHO11, PHO8, PHO84, PHO81 | Repressed by Pi | Oshima et al. (1996) |
| MET | MET3, MET2, MET14, MET6, SAM1, SAM2, MET1, MET30, MUP3 | Repressed by methionine | Hinnebusch (1992), Blaiseau et al. (1997) |
Table 6: Detection of regulatory sites by combinations of known site homologs and over-represented repeats. For each family, the repeats are indicated and significance values higher than 0 are highlighted in bold. The last two columns show the sites previously characterized. The symbol Ms, i. e., matching sequences, denotes the number of genes from the family containing at least one occurrence of the site; Occ denotes the number of occurrences of the site in all promoter regions from the family; Exp is the expected number of occurrences; Sig denotes significant index.
| GeneFamily | Putative regulatory elements | Ms | Occ | Exp | Sig | Consensus | Site previously characterized | |
| Consensus | Bound factors | |||||||
| NIT | ATATAA | 6 | 10 | 12.05 | -0.20 | AKATAAGA | GATAAG | Gln3p, Nillp, Gzf3p, Uga43p (Zn finger) |
| GATAAG | 6 | 25 | 4.04 | 8.40 | ||||
| ATAAGA | 6 | 19 | 6.34 | 1.13 | ||||
| GGCAC | 5 | 10 | 6.89 | -0.91 | GGCACA | -- | -- | |
| GCACA | 5 | 11 | 9.46 | -0.26 | ||||
| TGTGC | 5 | 11 | 9.46 | -0.29 | TGTGCC | -- | -- | |
| GTGCC | 5 | 10 | 6.89 | -0.92 | ||||
| PHO | CGCAC | 4 | 9 | 3.015 | -0.43 | CGCACG | GCACGTGGG | Pho4p (bHLH) |
| CGCACG | 4 | 5 | 0.52 | 0.37 | ||||
| ACGTATA | 4 | 6 | 0.72 | 0.07 | ACGTATATA | -- | -- | |
| ACGTATATA | 4 | 4 | 0.13 | -0.14 | ||||
| MET | TCACGT | 8 | 17 | 2.71 | 5.00 | TCACGTGA | TCACGTG | Cbflp-Met4p-Met28p complex (Zn finger) |
| TCACG | 8 | 21 | 8.80 | 0.77 | ||||
| CACGTG | 8 | 11 | 0.83 | 5.51 | ||||
| CACGT | 8 | 23 | 8.83 | 1.59 | ||||
| ACGTGA | 8 | 17 | 2.71 | 5.00 | ||||
| CGTGA | 8 | 21 | 8.80 | 0.77 | ||||
This study identified combinations of known site homologs and over-represented repetitive oligonucleotides located within the promoter regions of groups of functionally related genes. Each promoter region is mapped to a "transaction"; known site homologs and over-represented repetitive oligonucleotides are mapped to items of a transaction. The data mining techniques are then applied to mine the associations. The enormous number of associations makes it extremely difficult to identify those which are interesting and useful ones. Finally, the redundant rules are pruned and putative regulatory elements are obtained from the rest of the associations.
Our proposed approach can mine putative regulatory elements of any complete genome such as yeast in this study. The parameters to identify over-represented repetitive sequences within promoter regions of genes can be specified by users according to their needs. The discovered associations of known site homologs and putative regulatory elements can also provide effective information to researchers studying the mechanisms of gene transcriptional regulation.
It is noteworthy that the occurrences of repetitive sequences in association with homologs predicted from known TF binding sites reveals these repetitive elements to be putative regulatory elements because groups of transcription factors usually occur in combination and act cooperatively, and some of them could be correlated to known site homologs. However, we find several associations that do not have any known site homologs. The meanings and functionalities of these signals are interesting and necessary to be verified by experimental work. The future work of this study will consider longer sequences as well as the site matrices.
The authors would like to thank the National Science Council of the Republic of China for financially supporting this research under Contract No. NSC 89-2213-E-008-061. In addition, we would like to thank Professors Cheng-Yen Kao at National Taiwan Univ. and Chi-Gong Tong at National Central Univ. for their helpful suggestions. We would also like to thank our referees for their helpful comments and suggestions.