Program "Gene Discovery" for Pattern Matching in Promoter Sequences

Yu. L. Orlov1,*, E. E. Vityaev2, O. V. Vishnevsky1, A. S. Belenok1, B. K. Kovalerchuk3, M. A. Pozdnyakov1 and N. A. Kolchanov1




1Institute of Cytology and Genetics SB RAS,
Acad. Lavrentiev ave., 10,
Novosibirsk, 630090, Russia.
2Sobolev Institute of Mathematics SB RAS,
Acad. Koptyug prospect, 4,
Novosibirsk, 630090, Russia.
3Computer Science Department, Central Washington University,
Ellensburg, WA, 98926-7520, USA.
*Corresponding author
Phone: +7(3832)332971
Fax: +7(3832)331278
E-mail: orlov@bionet.nsc.ru






Keywords: Machine Learning, Knowledge Discovery, Data Mining, bioinformatics, eukaryotic promoter recognition, transcription factors binding sites



Methods of Data Mining and Knowledge Discovery were implemented for the search of regularities in tables of context features of DNA sequences involved in transcription regulation. The task was to retrieve regularities connecting nucleotide sequences with the functional class of those sequences. The search patterns were constructed in first-order logic with probability. For discovering regularities (pattern matching) a PC program "Gene Discovery" was designed. The program accepts molecular-genetics data by SQL queries. Sequences of erythroid-specific promoters and promoters of genes of endocrine system from TRRD database [Kolchanov N.A. et al., 2000] were analysed by this system. Regularities connecting the nucleotide sequences in regulatory DNA and its location relative to the start of transcription and functional class were found. The recognition method of regulatory DNA promoter class based on these regularities was developed.

Analysis of the promoter structure is of great interest for understanding of molecular mechanisms of gene transcription. The core (basal) promoter is the main element of the gene regulatory region necessary for transcription initiation. Promoters in eukaryotic organisms act as the molecular "switches" that turn genes on and off. Each gene has at least one promoter upstream of the protein encoding part of the gene. Promoter contains transcription factor binding sites - short stretches of DNA, sufficiently conserved to allow specific recognition by the corresponding protein. The presence and location of the transcription factor binding sites in 5' regulatory regions of genes corresponds to tissue- and stage-specific features of gene expression in organism. One gene can contain several promoters to define expression of different protein products or proteins with different levels of specific functional activity. Moreover these eukaryotic promoters are characterised by the absence of exact localisation of context signals and the weakness of such signals. This diversity is the main difficulty for the developing of the recognition programs.

The computer program "Gene Discovery" was developed for analysis of structural organization of eukaryotic promoters using information of experimentally proved and computer-predicted sites. System "Gene Discovery" is an adaptation of the system "Discovery" [Kovalerchuk B. et al., 2001] to molecular biology tasks. "Gene Discovery" consists of three main modules: (1) the module for on-line representation of context signals from DNA sequence in standard table form; (2) the module "Discovery" for regularities search; (3) the module of recognition of the sequence class using the regularities found. The program is written in C++ and it has user-friendly interface.

The teaching sample of nucleotide sequences of two alternative classes is used as input to the system. The teaching sample consists of the sequences of promoters specific to the functional system and some random sequences. It could be computer-generated random sequences with the same nucleotide frequencies or real sequences of neighbouring regions not corresponding to this regulatory function such as exons.

There is the program block to search for the context signals in the sequences of these two classes. The signal could be: context (user-defined short nucleotide word (oligonucleotide) or functional site, presented in the specialised molecular-biology database TRRD); conformation (DNA region is characterised by peculiarities of physico-chemical properties, for example easily melting DNA region, curved DNA etc.); structural (Z-DNA, RNA hairpin). All these signals may be recognised using knowledge about DNA properties and the consensus scheme based on experimental data stored in specialised databases.

Here we will consider degenerate oligonucleotides as context signals specific to promoters. The great number of regularities for joint appearance of the context signals in the promoter regions was found as a result of the "Gene Discovery" search. The number of regularities depends on the user-defined parameters of this search.

The regularities found could be analysed by a molecular biology expert as unique complex signals which are significant for proper promoter functioning. Let us consider selected rules for simultaneous presence of oligonucleotides in promoter as large complex signals. The additional circumstances were used to select subset of complex signals:

  1. the oligonucleotides in the complex signal are not overlapped on the promoter sequence;
  2. the observed number N of promoters possessing the complex signal is greater than the expected number N*, N>N*.

So, system found out a group of oligonucleotide motifs displaying a certain pattern of relative location in promoter sequences (complex signal). The simplest complex signal (S1,S2) is formed by a pair of oligonucleotides and specified as follows: (S1,S2) = (Position(S1)<Position(S2)). The presence of such complex signal in query sequence could be used as the rule for ascribing this sequence to the defined promoter class.

It should be noted that the system does not over-learn on the training samples. In a similar way any samples of nucleotide sequences could be analysed. Promoter recognition on the basis of regularities found is a further topic for discussion.

The functional meaning of the signal can be proved by experts in biology. It could be treated in terms of the transcription factors binding sites or the conformational properties of DNA [Kondrakhin Yu.V. et al., 1995; Klingenhoff A. et al., 1999].


REFERENCES