Construction of stochastic context trees for genetical texts

Yu. L. Orlov1,*, V. P. Filippov1, V. N. Potapov2 and N. A. Kolchanov1




1Institute of Cytology and Genetics SB RAS,
Acad. Lavrentiev ave., 10,
Novosibirsk, 630090, Russia.
2Sobolev Institute of Mathematics SB RAS,
Acad. Koptyug prospect, 4,
Novosibirsk, 630090, Russia.
*Corresponding author
Phone: +7(3832)332971
Fax: +7(3832)331278
E-mail: orlov@bionet.nsc.ru






The method was developed for constructing the source-tree model (the variant of the hidden Markov model) of symbol sequences of DNA and proteins. Estimation of stochastic complexity of the data in the frames of a model serves as a criterion for the model's ascertainment. The software realization of this algorithm enables to reveal statistical properties of sequences, which are informative for functional site recognition. The data extracted from the "Samples" database (http://wwwmgs.bionet.nsc.ru/cgi-bin/mgs/nsamples/) were analyzed by the algorithm suggested. The program developed is available via Internet - http://wwwmgs.bionet.nsc.ru/mgs/programs/.

Let us consider a DNA sequence as a consequence of letters occurring with a definite probability. It is well known that in real sequences, the probability of a letter's occurrence depends upon the preceding set of symbols, that is, genetical texts possess by Markov property. In this connection, hidden Markov models (HMM) are intensively applied for gene structure analysis [Durbin R. et al., 1998]. By describing genetical sequence within the frames of particular Markov model, we may characterize the set of functionally significant regions more flexibly and precisely in comparison with the models applying consensus or weight matrices. We can use context-sensitive grammars to describe sequence more precisely. Let M will be a set of non-overlapping contexts (states of Markov chain), - set of probability distributions to generate symbol X in these states. A stochastic grammar model <M,> generates different strings Xn = X1X2...Xn with probabilities

The probability P(Xi|Sj) of the letter Xi{A,T,G,C} occurrence in the context SjS is defined by the set of parameters as follows:

The set of sources with the similar context tree is called a model M. The sources within a model differ by the values of the set of parameters . Thus, each source is determined by the pair <M,>. The probability of generating the message Xn by the source <M,> is given by the equation = P(X1|S1) P(X2|S2)...P(Xn|Sn), where is a short context containing the letter Xi (to the right) in the word Xn, and the values P(Xi|Sj) are determined by the set of parameters .

Bernoulli sources and Markov sources of the finite order are the particular cases of the source- trees. The model M0 of the set of Bernoulli sources with the alphabet D has a tree consisting of a single root (the sources possess by a single state).

The high order models describe the data set with the most accuracy, however, increase in the order of a model generates numerous excessive parameters. The problem of choosing a model better fitting to the data is equivalent to the question "How many preceding letters really influence the probability of the next in turn letter occurrence?" Here we suggest an approach enabling to reveal significant parameters of Markov model for an adequate description of the sequence under study.

It has been known that the problems of analysis of statistical properties for a consequence of letters are tightly related to the data coding (or data compression) problem. We suggest to apply the method, used extensively in the theory of sources coding, for DNA sequence studying. The method suggested is based on the algorithm "Context" developed by J. Rissanen [Rissanen J., 1999].

It is convenient to represent Markov dependencies in a form of the trees generating the sequences, which have a singled-out root and suspended vertices (leaves of a tree). In such a tree, each direction corresponds to a particular symbol and each vertex - to a definite context. As a context we denote a set of nucleotides occurring on the path connecting the suspended vertex with the root of a tree. An example of the context tree presentation in standard (contexts are from down leaves to the top) and in round-tree forms (from circle leaves to center) are given in the Figure.


Figure: An example of a generative source-tree. The trees are constructed for Sequence of human beta globin region on chromosome 11, 73308 bp (EMBL ID: HSHBB). Each path from a leaf to the root corresponds to the context and has own set of probabilities to generate next symbol in a DNA sequence.

Following J.Rissanen, as a criterion of choosing a model better fitting to corresponding data, we suggest the stochastic complexity of data in a chosen model, which we evaluate numerically [Orlov Yu.L., Potapov V.N., 2000]. The less is complexity of the data, the better these data fit to a model.

Under complexity of a message (DNA sequence) relatively known distribution of probabilities, we set the module of the logarithm of the probability of a message to the base 2, as it is accepted in the theory of the coding sources. A stochastic complexity of the data relatively the model is defined as the minimal sum of complexities of messages, with the known distribution, set by the model's parameters, and complexity of these parameters description. Stochastic complexity of the sequence in Bernoulli model equals to the minimal number of bits, by means of which we may write the frequencies of symbols in a sequence and the sequence itself if the frequencies of symbols are known. The calculation of stochastic complexity allows to discriminate the data, which mostly satisfy to Bernoulli model (e.g., random sequences, non-functional regions), from the data better fitting to a Markov models (functional sites). Moreover, it is possible to discriminate the data corresponding to different Markov models (e.g., functional sites differing by their nature).

The method suggested permits to analyze not only statistical model of a sequence in a whole, but to find statistical heterogeneity within the sequence. Along with choosing of a model for the whole sequence, we will determine suitable models and stochastic complexities for its constituents. This will enable to evaluate statistical homogeneity of a sequence and to make conclusions about functional potential of its constitutive parts.

It is natural to account that the model M at most fits to the data Xn, in case the stochastic complexity of the data Xn in the model M is minimal. Thus, we can estimate it statistically. The graphical presentations are in the Figure. We have analyzed the several data sets of nucleotide sequences stored in the "Samples" database (http://wwwmgs.bionet.nsc.ru/cgi-bin/mgs/nsamples/).

The sequences with the similar functions represented in each of the sets were considered as generated by the same source, the structure of which (generative source-tree) is unique for realization of the function given. The presence of such a tree enables to classify genetic sequences by the groups with the common sources. For example, promoter gene regions are the objects organized in a complex mode accordingly their context. They include different functional signals, e.g., regulatory proteins binding sites, and, in general, they are the result of super-position of several degenerate genetical codes, for example, the nucleosome code. Various regions of extensive regulatory sequence should possess by different physical-chemical properties that could be revealed statistically.

The algorithm for calculation the complexity and constructing the tree is designed for an arbitrary symbolic alphabet. The letters of the alphabet Ak are coded by the numbers k. Practically, we can analise any texr, e.g. in natural language. By applying various coding for nucleotide and amino acid sequences, we may conclude about relative significance of different regularities determining the concrete structure of the source-tree for the data studied.

Empirical approach of such detection of non-random contexts supplements the estimates of under- and over-representation of oligonucleotides (http://www.cs.purdue.edu/homes/stelo/Verbumculus/), which are characteristic for functional genome regions and genomes of various organisms. Besides, all non-random contexts can be detected, not by calculating relative occurrence of the nucleotides, but by means of ordering local organization of symbols after these contexts (relative distribution of symbols' occurrence frequencies).

Determining of dependency upon the context by the method suggested provides a basis for construction of methods necessary for recognition of particular functional regions of regulatory sequence. It was established that Markov chain (dependency upon the preceding context) for functional sites in regulatory regions has, as a rule, the first-second order.


REFERENCES