| In Silico Biology 2, 0024 (2002); ©2002, Bioinformation Systems e.V. |
| G C B ' 0 1 |
1 Sobolev Institute of Mathematics SB RAS, Acad. Koptyug prospect, 4,
Novosibirsk, 630090, Russia
E-mail: vityaev@math.nsc.ru
2Institute of Cytology and Genetics SB RAS, Acad. Lavrentiev ave., 10,
Novosibirsk, 630090, Russia
E-mail: orlov@bionet.nsc.ru,
oleg@bionet.nsc.ru,
mike@bionet.nsc.ru,
kol@bionet.nsc.ru
Edited by E. Wingender; received December 20, 2001; revised and accepted February 08, 2002; published March 15, 2002
This paper presents implementation of Data Mining and Knowledge Discovery techniques for searching for regularities in tables of context features of DNA sequences involved in regulation of transcription. The goal is to discover regularities that relate nucleotide sequences to the functional classes of these sequences. The search patterns for regularities have been constructed in the first-order logic augmented by probabilistic estimates. To this aim, the PC software system "Gene Discovery" has been designed. This system accepts molecular-genetical data retrieved from a database by using SQL queries. Nucleotide sequences of promoters of several functional systems were extracted from the TRRD database (http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/) and analysed. The data include nucleotide sequences of erythroid-specific gene promoters, endocrine system gene promoters, promoter regions of the genes controlling cell cycle, promoter of genes regulating lipid metabolism, and muscle-specific gene promoters. Several regularities that relate the nucleotide sequences in the regulatory DNA and their location relative to the transcription start with each functional class have been found.
Key words: Machine learning, knowledge discovery, data mining, bioinformatics, eukaryotic promoter recognition, transcription factors binding sites, oligonucleotide patterns
Analysis of promoter structure is of great interest for understanding molecular mechanisms of gene transcription. The presence and location of transcription factor binding sites in 5' regulatory regions of genes correspond to the tissue- and stage-specific features of gene expression in an organism. The control of eukaryotic gene expression is primarily determined by relatively short sequences (signal/motif) in the region surrounding a gene. These sequences vary in length, position, redundancy, orientation in DNA chain, and bases. Eukaryotic promoters are characterised by the absence of exact localisation of context signals and the weakness of such signals [1]. Diversity of promoters is the main difficulty for developing of recognition programs [2]. During the last years, such techniques as the large-scale data mining, knowledge discovery, and other computational approaches of Machine Learning were intensively used in bioinformatics [3, 4, 5]. Recently several computational approaches have been suggested to address challenges of combinatorial regulation of transcription [6, 7]. In particular, they concern computer selection of specific oligonucleotides [8] and mining associations between them [9].
Our approach based on Data Mining methods selects specific oligonucleotide pattern selection for description of the functional class of a gene [10]. The program is developed on the basis of the training sample of nucleotide sequences of promoter region. It is hard to describe all eukaryotic promoter sequences by a common pattern due to a huge variability of different transcription factor binding sites. To overcome this difficulty, the sets of promoters of genes performing the similar function were extracted from the TRRD database (http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/) [11]. However, even such functional sets lack a single oligonucleotide pattern describing all sequences. Distinctive feature of the algorithm is the usage of specific feature patterns describing a subgroup of the training set. The search patterns for regularities are constructed in the first-order logic augmented by probabilistic estimates.
The demo-version of the program is available upon the request addressed to the authors (vityaev@bionet.nsc.ru).
The computer program "Gene Discovery" was developed for analysis of structural organisation of eukaryotic promoters using information of experimentally proved and computer-predicted sites. The system "Gene Discovery" is an adaptation of the system "Discovery" [12, 13, 14] to the tasks of molecular biology. "Gene Discovery" consists of three main modules:
(1) the module for on-line representation of the context signals from DNA sequence in a standard table form;
(2) the module "Discovery" aimed at searching for regularities;
(3) the module for recognition of the sequence class by using the regularities found. The program is written in C++ and it is supplied by a user-friendly interface.
A machine learning method and the system "Discovery" [14] reveals statistically significant first-order logic rules for the functional annotation of regulatory gene regions. Learning systems based on the first-order representations have been successfully applied to solving many problems in psychology, physics, medicine, finance, and others [12, 13] see also "Scientific Discovery" web-site: http://www.math.nsc.ru/LBRT/logic/vityaev/, section "applications". Since this technique is based on the logic rules, it allows to deduce human-readable forecasting rules, which could be interpretable into biological language and, hence, also support a promoter recognition (functional annotation) [15]. An expert in biology may evaluate both the correctness of the recognition and that of the rules themselves.
An example of oligonucleotide motif in 15-lettered alphabet is CWGNRGCN. Let us consider the example of the forecasting rule:
| If CWGNRGCN<NGSYMTAM<MAGKSHCN |
| Then: Sequence class = promoter. |
The symbol "<" here designates that positions of corresponding oligonucleotides are ordered relative to the transcription start.
This rule means: if motifs CWGNRGCN and NGSYMTAM and MAGKSHCN present in sequence under analysis, and their non-overlapping mutual location is fixed, then the sequence under analysis contains promoter of the gene of an endocrine system. In such a way, all the statistically significant oligonucleotide patterns are constructed in the form S1& S2& S3...&Sk, where k>1. The program automatically defines the number of the signals in such a pattern [10].
The computer system "Gene Discovery" implements the methods described above to the analysis of nucleotide sequences of regulatory regions. The principal scheme is given in Figure 1.
The learning sample of nucleotide sequences of two alternative classes is used as input to the system. The learning sample consists of the sequences of promoters specific to the functional system (positive set) and some random sequences (negative set). The latter is a computer-generated random set of sequences with the same nucleotide frequencies or the set of real sequences within the neighbouring regions, which do not perform the particular regulatory function.
In what follows, we consider degenerate oligonucleotides as context signals specific for promoters.
As an example, let us describe an analysis of the endocrine system gene promoters. The sample of 40 sequences was extracted from the database TRRD (http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/). The sequences were of 120 bp in length (from -100 bp to +20 bp relative to the transcription start). The level of homology between any pair did not exceed 60%.
With the program developed, one may analyse any input sequence set in the FASTA format. A functional sample could be extracted from the EPD, TRANSFAC (http://www.gene-regulation.de/), or TRRD databases.
The program developed can use as input any sequence set in FASTA format. A functional sample could be extracted from TRANSFAC [16] (http://www.gene-regulation.de/), TRRD (http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/), EpoDB (http://www.cbil.upenn.edu/EpoDB/release/version_2.2/epodb.html).
The program ARGO (http://wwwmgs.bionet.nsc.ru/mgs/programs/argo/) was used to select the specific oligonucleotides of 8 bp in length [17]. The term "degenerate oligonucleotides" is used to denote 15-lettered IUPAC coding for nucleotides.
Analogously, other functional sets of promoters extracted from the TRRD database were analysed, including erythroid-specific gene promoters, promoter regions for the cell cycle controlling genes, promoters of genes controlling lipid metabolism, and promoters of genes expressed in muscle.
The great number of regularities for joint appearance of the context signals in the promoter regions was found as a result of the "Gene Discovery" search. The number of regularities depends on the user-defined parameters of the search - level of conditional probability (greater than 0.5) and confidence level for Fisher criterion (less than 0.05). The regularities found could be analysed as unique complex signals (patterns of oligonucleotide motifs) significant for proper promoter functioning.
Let us consider selected signal CWGNRGCN<NGSYMTAM<MAGKSHCN. Here the symbol "<" means that positions of corresponding oligonucleotides are ordered relative to the transcription start. An example of the location of this complex signal is presented in Figure 2.
|
Figure 2: Schematic localisation of the complex signal CWGNRGCN<NGSYMTAM<MAGKSHCN in promoters of genes of endocrine system. |
The promoter sequences are aligned relative to the transcription start (position +1 bp), indicated by arrows. The EMBL identifiers of promoters studied are given in parentheses on the left. The eight-bp oligonucleotide motifs composing the complex signal are shown as shaded green rectangles; positions of the first nucleotides are indicated relative to the transcription start. Red rectangles mark positions of the TATA-boxes, indicated in the TRRD database; positions of its first and last nucleotides are italicised. Interestingly, only a single oligonucleotide in the complex signal corresponds to the real annotated site, whereas the others could correspond to potential transcription factor binding sites or to the double-stranded DNA regions with specific physicochemical properties.
Thus, the system "Gene Discovery" developed enables to find out complex signals in promoter regions. All these signals may be recognized using knowledge about DNA properties and the consensus scheme based on experimental data stored in specialized databases. In a similar way, any sample of nucleotide sequences could be analysed. The functional meaning of the signal could be treated in terms of transcription factors binding sites or conformational properties of DNA [7, 18].
We have considered several tasks: (i) promoter analysis and recognition using specific degenerate oligonucleotides as signals; (ii) transcription factor binding site analysis using short oligonucleotide and separate nucleotide bases; (iii) donor splice sites recognition using separate nucleotide bases.
To estimate the accuracy of this approach, we have performed the sliding control, taking 80% sites as a training set and the rest as a test set. Given the set of regularities, the system can estimate the weight of each object in the set. Several regularities could be applied to an unknown sequence. Otherwise a sequence could has no applicable regularities at all. Based on the set of regularities found we construct recognition function as weight of regularities applied to it. The prediction procedure based only on oligonucleotide motifs was described in [19].
We developed the program and applied it to donor splice sites prediction in user-defined set of nucleotide sequences.
Given regularities weights for all objects in the training and control sets, the first and second type errors can be estimated for the training and test sets, respectively. For the donor splice sites, the first and second type errors for the test data equal to 4.4% and 4.0%, respectively.
Distinctive feature of the algorithm is the usage of a subset of sequences carrying a complex signal. Thus, prediction is applicable only for sequences with homology to oligonucleotide pattern. Lengths of the gaps in pattern are not fixed. The sequences themselves could have very weak pair homology.
The authors are grateful to A. S. Belenok and B. K. Kovalerchuk for help in research and to G.V.Orlova for translation of the article. The work was supported by the RFBR 00-04-49229, 00-07-90337, 02-07-90355, 01-07-90376, 00-04-49255) and the grant by Siberian Division of RAS (Integration grant N65). Y.O. was supported by INTAS grant (YSF 00-178).