In Silico Biology 2, 0022 (2002); ©2002, Bioinformation Systems e.V.  
G C B ' 0 1

Construction of stochastic context trees for genetic texts

Yury L. Orlov1, Vladimir P. Filippov1, Vladimir N. Potapov2, Nikolay A. Kolchanov1





1 Institute of Cytology and Genetics SB RAS,
Acad. Lavrentiev ave., 10,
Novosibirsk, 630090, Russia
Tel.: +7(3832)332971
Fax: +7(3832)331278

2 Sobolev Institute of Mathematics SB RAS,
Acad. Koptyug prospect, 4,
Novosibirsk, 630090, Russia.

E-mail: orlov@bionet.nsc.ru, filippov@narod.ru, vpotapov@math.nsc.ru, kol@bionet.nsc.ru





Edited by E. Wingender; received December 10, 2001; revised and accepted March 25, 2002; published April 11, 2002


Abstract

A method has been developed for constructing a tree source model for genetic text generation. Model visualisation in the form of suffix (context) trees provides a new way of context analysis of symbol sequences. Estimation of the stochastic complexity of the data in the frame of the model serves as a criterion for the model's ascertainment. The model and complexity values are used for analysis of genetic texts. The software realisation of this algorithm enables to reveal statistical properties of genetic sequences based on an information measure. The program developed is available via Internet at http://wwwmgs.bionet.nsc.ru/mgs/programs/complexity/.

Key words: complexity, information measure, suffix tree visualisation, variable memory Markov model, genetic texts, statistical modelling



Introduction

Functional annotation of unknown DNA regions based on sequence content is an important task of molecular biology. If a sequence under analysis has no similarity to annotated sequences from databases, then determination of its relationship to known functional classes could be based upon statistical models of these classes. Thus, it is important to model genetic text without incorporating any prior information. Such modelling is tightly connected with information measures of genetic texts.

There have been a considerable number of theoretical studies on information measures and complexity for genetic macromolecules [Ebeling and Jimenez-Montano, 1980]. Compression algorithms applied to DNA sequences were discussed by Grumbach and Tahi [1994]. Entropies of DNA sequences have been discussed by various authors [Mantegna et al., 1994; Schmitt and Herzel, 1997]. The mutual information, a measure intimately related to entropies, has been successfully used to predict protein coding regions in DNA [Grosse et al., 2000]. A more systematic approach is based on the reconstruction of the word frequencies using frequencies of other, shorter sub-words [Gorban et al., 2000]. Calculations made on several genomes detected no significant differences of reconstructed frequencies from real ones. Estimation of algorithmic complexity by Lempel-Ziv was used as a measure of repeat richness for nucleotide sequences in [Gusev et al., 1999]. This algorithmic approach was used also for complexity profile construction (http://wwwmgs.bionet.nsc.ru/mgs/programs/gc_net/).

Similar informational estimations have been done for proteins. Per residue entropy in the SwissProt have been estimated in [Rani and Mitra, 1996; Mitra and Arusharka, 2000]. The complexity of large sets of non-redundant protein sequences was measured in [Weiss et al., 2000]; there it was shown that proteins are fairly close to random sequences.

Another goal of our modelling is to visualize the structure of genetic text in a new way [Dutta and Das, 1992; Zhang, 1997]. New modelling approaches suggest to reveal subtle evolutionary constraints in genetic texts, i. e. different genetic codes as described by Trifonov [Trifonov, 1989; Trifonov, 1997]. Such modelling supports a new way for DNA structure analysis and structure prediction, for example Z-curve application for gene-finding [Yan et al., 1998]. The form of a suffix tree for visual presentation of statistically significant words in symbol sequences was described in [Apostolico et al., 2000] (see also Verbumculus Sequence Analyser: http://wwwdbl.dei.unipd.it/Verbumculus/). Suffix (context) trees incorporating non-overlapping words could be used as source for model text generation.

We provide an Internet-available tool for such modelling based on information measure and suffix tree presentation. The program is an integral part of the system GeneExpress 2.1 (http://wwwmgs.bionet.nsc.ru/mgs/gnw/) for data navigation, data analysis, and analysis of dependencies in the field of gene expression regulation.

Several sets of nucleotide sequences of different functional classes were analysed by the method suggested. These are coding and non-coding sequences, gene regulatory regions extracted from the "Samples" database (http://wwwmgs.bionet.nsc.ru/cgi-bin/mgs/nsamples/) and full genomes of microorganisms. Also amino acid sequences of protein domains extracted from the SCOP database (http://scop.mrc-lmb.cam.ac.uk/scop/) were analysed. The visualisation of context trees presented in this paper could be reconstructed in GIF-format by the program developed (http://wwwmgs.bionet.nsc.ru/mgs/programs/complexity/).




Variable Memory Markov Model for genetic texts

Let us consider a DNA sequence as a sequence of letters occurring with a definite probability. We could postulate that the probability of a letter's occurrence depends upon the preceding set of symbols. Indeed, we suppose that nucleotide distribution in real genetic sequences does not depend on the position within a sequence, but only on the current local context, that is, genetic texts possess stationary properties. This assumption for long DNA sequences allows describing them by Markov models. More complex hidden Markov models (HMM) are intensively applied for gene structure analysis [Durbin et al., 1998].

The problem is to choose the proper oligonucleotide length for a Markov model description. What should be preferred as preceding contexts - dinucleotides, trinucleotides or only nucleotides A, T, G, C? Longer contexts mean a redundant number of model parameter, but shorter ones may not lead to an accurate data description. To solve this problem it is possible to use several models simultaneously as Dirichlet mixtures [Durbin et al., 1998]. We suggest a suffix tree to determine an unambiguous set of preceding contexts. Such contexts may vary in length, and neither of them could be the ending of another.

Suffix trees as a basis for text generation were studied for proteins. This alternative to the HMM was proposed in [Ron et al., 1996] in the form of a sub-class of probabilistic finite automata, the variable memory Markov models. These models capture longer correlations and higher order statistics of the sequence. It was shown that an optimal model could be developed using a construction called Prediction Suffix Tree (PST) [Ron et al., 1996]. Linear time algorithms allow to construct suffix trees efficiently [Apostolico et al., 2000]. The topology and complexity are determined by the data. The ability to model protein families by this technique has been demonstrated [Bejerano and Yona, 2001].

We will consider Variable Memory Markov Models for the generation of symbols based on a stationary source. The local context defines the current state of the Markov model independent of the position in the text. Long-range correlations in DNA are not considered by the method. It is known that the problems of analysis of statistical properties for a sequence of letters are tightly related to the source coding (data compression) problem. In order to study dependencies of the symbols upon the preceding context in DNA sequences, we have used the method of tree source [Orlov and Potapov, 2000], which was previously developed for the data compression theory [Barron et al., 1998; Rissanen, 1999].

As mentioned above, it is convenient to represent Markov dependencies in a form of the trees generating the sequences, which have a singled-out root and suspended vertices (leaves of a tree). In such a tree each vertex corresponds to a definite context. As a context we denote a set of nucleotides occurring on the path connecting the suspended vertex with the root of a tree. An example of the context tree presentation in standard form is given in Figure 1. The contexts are from the leaves (down) to the top (root).

Figure 1: An example of a generative tree source. The tree is constructed by the program for the sequence of the human -globin region on chromosome 11, 73308 bp (EMBL ID: HSHBB). Each path from a leaf to the root corresponds to the context and has an own set of probabilities to generate the next symbol in a DNA sequence.

The round form for the same tree is given in Figure 2. The contexts are from the circle leaves to the centre.

Figure 2: Round form for generative tree source. The topology corresponds to standard form presented in Figure 1.

The tree shown in Figures 1 and 2 presents only all contexts in variable memory Markov model for text generation. Each context corresponds to a state of the Markov model and defines the probability distribution for the generation of the next symbol. The technical task is to choose the tree that is optimal for the data from the set of all possible trees. Maximal context length is limited by the parameter d. For computational purposes the maximal length d should be less than 10.

As a criterion for choosing a model which optimally fits the corresponding data, we suggest the stochastic complexity of data in a chosen model which we evaluate numerically [Orlov and Potapov, 2000]. The less the complexity of the data estimated by the model is, the better fit these data to this model.

Under complexity of a message (DNA sequence) relative to the known distribution of probabilities, we set the module of the logarithm of the probability of a message to the base 2, as it is accepted in the theory of the coding sources. The stochastic complexity of the data relative to the model is defined as the minimal sum of complexities of messages, with the known distribution, set by the model's parameters, and complexity of these parameters description. Stochastic complexity of the sequence in a Bernoulli model equals to the minimal number of bits, by means of which we may write the frequencies of symbols in a sequence and the sequence itself if the frequencies of symbols are known.

The calculation of stochastic complexity allows to discriminate the data which satisfy best a Bernoulli model (e. g., random sequences, non-functional regions), from the data better fitting to a Markov model (functional sites). Moreover, it is possible to discriminate the data corresponding to different Markov models (e. g., functional sites differing by their nature).



Formal description of suffix tree

According to our statistical model, the probability of the next in turn symbol occurring in a message is determined by the preceding context. The set of probabilities generating the next symbol is determined by the state of a chain, which unambiguously corresponds to the preceding context. Such contexts may vary in length, and neither of them could be the ending of another. A suffix tree is a suitable form for the presentation of such contexts. The term "suffix" refers to the organisation of context in a model. Since we analyse the preceding context in a DNA sequence, one should keep in mind that this is actually a "prefix" tree.

By a message, we denote a sequence (word) X1...Xn = Xn, where Xi {A,T,G,C}. The sub-sequence of letters S = Xm-t...Xm-1 is called the context with the length t of the letter Xm in the word Xn (t<m<n). We consider the DNA sequences as the communications generated by some stationary discrete source. For the Markov model of the k-th order constructed for DNA sequences, we have altogether 4k generating contexts (or states of a chain), which determine the distribution of probabilities of the letters' occurrences after these contexts:
Sj Dk, Di {A,T,G,C}, j = 1,...4k.

For example, for a Markov model of second order k=2 and the total number of possible contexts equals to 16 = 42.

Let M be a suffix set of contexts (states of Markov model), and a set of parameters. We define the probability of message Xn = X1X2...Xn as

P(Xn) = P(X1|S1) P(X2|S2)...P(Xn|Sn),

where the probability P(Xi|Sj) of the occurrence of letter Xi {A,T,G,C} in the context Sj is defined by the set of parameters as follows:

P(Xi|Sj)=ij , and ij  = 1.

A Markov source generates different messages with probabilities P(Xn). A set of sources with equal set of contexts M is called a variable memory Markov model <M, > (further on shortly M).

The maximal length of the context should be limited due to computational restrictions. The maximal context was set to 10. Our experience shows that this is sufficient for real nucleotide sequences.

Bernoulli sources and Markov sources of finite order are particular cases of the tree sources. The model M0 of the set of Bernoulli sources has a tree consisting of a single root (the sources possess by a single state).

The high order models describe the data set with the most accuracy; however, increase in the order of a model generates numerous excessive parameters. The problem of choosing a model better fitting to the data is equivalent to the question "How many preceding letters really influence the probability of the next letter occurrence?". Here we suggest an approach enabling to reveal the significant parameters of a Markov model for an adequate description of the sequence under study.

Let us represent the set of states of the Markov source of k-th order as the leaves of the tree T. Each leaf corresponds to the context with the length k (see Figure 3, k = 3). If the leaves (suspended vertices), corresponding to the states of a source, are the brothers (i. e., they are connected with the same vertex) and if their distributions of probabilities of generating the symbol coincide (the states are equivalent), then we can unify these states into a single one (Figure 4). By unifying the sets of equivalent states, we obtain the minimal tree of states T'. Here, the vertices at the higher levels correspond to shorter contexts. The generating tree source obtained may occur to be not Markov. However, each tree source may be supplied to a Markov source by the reverse procedure.

Figure 3: Markov source of the 3-rd order.
Figure 4: Context tree source.

For example, if a model M has a tree as that illustrated in Fig. 4, then the probability of the next letter occurrence depends only upon the preceding letter in case this letter is A, T or C, and upon the pair of letters AG, TG, GG, CG, in case the preceding letter is G. The contexts are determined by the letters from the routes directed bottom-up, from suspended vertices to the route of the tree.


Procedure for Tree Construction

As complexity of the message Xn relative to the known distribution of probabilities , we set the module of the logarithm of the probability of this message to the base 2,

L(Xn) = – log P(Xn) = log 1/ P(Xn).

In particular, for Bernoulli's model M0, the complexity equals to
L(Xn, , M0) =  Ci log 1/i= – Ci log P(Di), (1)

where Ci is the occurrence of letters Di in the sequence Xn, with Di {A,T,G,C}, P(Di) is the probability of letter Di, and  = {P(Di)} are the parameters of the model.

If to evaluate the probabilities via the frequencies, by setting P(Di) = Ci /n, then the complexity with the accuracy up to normality condition for the length of a sequence n will be equal to Shannon entropy (– p log p), or the sum of products of the symbol probabilities to the logarithm of these probabilities. C. Shannon has demonstrated that the module of the logarithm of the probability of a communication is the length of the shortest (on average) binary code of a communication [Shannon, 1948].

If M is a model with a tree T, then

L(Xn,, M) =  Ci(Sj) log 1/ij (2)

where Ci(Sj) is deleted: a number of the occurrence of the letter Di with the context Sj in the word Xn and ij = P(Di|Sj); |T| is the number of all contexts in the tree T.

As stochastic data complexity L0 relative to the model M, we denote the minimal sum of complexities L of message Xn, under some set parameters of a distribution model , and of the complexity of the description of these parameters Hn (complexity of a model M):
L0(Xn, M) = L(Xn,, M) + Hn(M) (3)

J. Rissanen has obtained an estimate of the complexity Hn (M0) of the description of the model's parameters, this estimate depending upon the sequence length and the alphabet's size [Barron et al., 1998; Rissanen, 1999], i. e.

Hn =|M|/2 log n + Const,

where |M| is the dimension of the model M.

In particular,
Hn(M0) = (k-1)/2 log n/2 + log ( k/2) / (k/2) + o(1) (4)

For a four-lettered alphabet, it equals to

Hn(M0) = 3/2 log n + 1/2 log () – 3/2 (5)

Following J. Rissanen [Rissanen, 1999], we suggest to consider that the model which corresponds best to the data is the one where the data in this model exhibit minimal stochastical complexity. The limitation for the usage of contexts with larger length is the growth of complexity Hn in the description of added parameters.

The algorithm suggested is the following recurrent procedure. Let T be the maximal tree containing all the vertices with the depth of at least d. For all the contexts S corresponding to suspended vertices, we set J(S)= L0(X(S), M0). For inner vertices S, we determine

J(S) = min { L0(X(S), M0), J(DiS) }

If the first element in the brackets is less or equal to the second one, then we remove those vertices from the tree T which are the offsprings of vertex S. (The tree vertex offspring are those vertices, which are connected with this vertex and located downwards). The values L0(X(S), M0) are calculated by formulas (1)-(5). The values J(S) are determined recurrently from suspended vertices to the root of the tree T by removing the vertices corresponding to non-informative contexts. The resulting tree T' corresponds to the required model M of sequence Xn. The final complexity J(R) is determined in the root R corresponding to empty context (lack of dependency upon the preceding symbols). As indicated by Barron in his review [Barron et al., 1998], the value J(R) is close to the minimum of the stochastic complexity of a sequence among all the models with contextual trees with the depth of maximally d.

It is natural that the model M fits best to the data Xn in whichthe stochastic complexity of the data Xn is minimal. Thus, we can estimate the model statistically by frequencies of oligonucleotides.

The maximal order of Markov dependency d, from which the construction of a tree begins, does not influence the structure of a tree if all branches (contexts) are shorter than d. By this procedure, theoretically the contexts with high lengths (more than 10 nucleotides prior to the position analysed) could not be considered. We failed to meet such dependencies during the analysis of large DNA sequences (up to 100 kb). But for super-large sequences and full genomes this limitation remains.

In practice the program demands only frequencies of oligonucleotides by fixed maximal length. Therefore the operation time shows a linear dependence on the sequence length. The time of context suffix tree calculation does not depend on the sequence length.

An important detail is how to add pseudo-counts for absent contexts. If we do not find a context in the training sequence it is quite natural to suppose at least a low non-zero probability. We use a value of 0.5 as pseudo-count for any absent contexts. The program could prune long branches in the resulting tree by using larger pseudo-counts.

An alphabet for sequence analysis could be another but a four-lettered one. For example, we could consider a DNA sequence in a two-lettered alphabet to investigate G/C content. In this case a binary tree will be constructed. The corresponding penalty for parameter description Hn(M0) in formula (4) is changed automatically for k = 2. This method also could be applied to amino acid sequences. Theoretically we can construct a context source tree in any alphabet: 2-lettered, 3-, 4-, 5-, 20-lettered one. But for correct estimation of frequencies it is necessary to have available a sufficiently large data set. Therefore a generalised hydrophobic-hydrophilic alphabet was used for proteins. The program developed (http://wwwmgs.bionet.nsc.ru/mgs/programs/complexity/) supports the following variants of grouping of 20 residues: by hydrophobicity/hydrophilicity, by charge, by location (inner positioning or the surface one). A user may order an own variant of partitioning by inserting in an appropriate window the line indicating how to group the symbols.


Results for DNA sequences

Genetic sequences of several types were analysed: (1) full microbial genomes, (2) coding and non-coding DNA-sequences, (3) regulatory regions extracted from the "Samples" database (http://wwwmgs.bionet.nsc.ru/cgi-bin/mgs/nsamples/), (4) nucleosome binding sites [Levitsky et al., 1999], (5) amino acid sequences of protein domains extracted from the SCOP database (http://scop.mrc-lmb.cam.ac.uk/scop/) [Murzin et al., 1995].

For full genome sequences a high degree of context dependence was found. The mean length of significant contexts is about 5 nucleotides. An integrated graphical presentation of such a large scheme is a separate technical problem, because the whole image neither fits onto a printed page nor on a computer screen. A compact presentation for Borrelia burgdorferi complete genome is shown in Figure 5.

Figure 5: Context tree source for Borrelia burgdorferi complete genome. Letters are not indicated for contexts longer 4.

If we compare the stochastic complexity L0(Xn, M) with a direct calculation of the normalised classical measure – Ci log(Ci /n) + Hn(M0), then we obtain some reduction in complexity value because of the complicated model. Surprisingly, such entropy reduction for DNA sequences due to correlations is only about 1-2%. This result is in accordance with the review of complexity studies for genetic texts (especially proteins) from [Weiss et al., 2000].

The sequences with similar functions represented in each of the sets were considered as generated by the same source, the structure of which (generative tree source) is unique for the realisation of the function given. The presence of such a tree enables to classify genetic sequences by the groups with common sources. For example, promoter gene regions are the objects organised in a complex mode according to their context. They include different functional signals, e. g., regulatory protein binding sites, and, in general, they are the result of super-position of several degenerate genetic codes, for example, the nucleosome positioning code. Various regions of an extended regulatory sequence should possess different physico-chemical properties that could be revealed statistically.

By the method suggested, both individual sequences and groups of sequences may be analysed, under the supposition that they are generated by a single source generating the symbols. It is interesting to analyse sequences without clearly expressed homology but performing one and the same function. In this case, they should be similar by oligonucleotide content. Such short oligonucleotides (contexts) could form a text repeat. On the other hand, if a text repeat is common for most of the sequences in a set, then we will obtain all the contexts from such a repeat in the suffix tree.

We have found that non-random nucleotide contexts, which are present in all the sequences of the sample, may play a structural role for DNA sequences of the type under analysis.

An analysis has revealed that transcription factor binding sites in most of the samples studied are characterised by a different structure of the tree sources (Figure 6). They vary in the order of the Markov chain, in tree topology, and in the number of branches in the tree source. It is seen that for one and the same site, the contexts of various lengths can be important. These data give evidence about different organisation of the contexts of transcription factor binding sites relative to the source generating these sequences.

Figure 6: An example of generating tree sources for transcription factors binding sites stored in the «Samples» database (http://wwwmgs.bionet.nsc.ru/cgi-bin/mgs/nsamples/). The names of factors are given to the left of each tree.

In order to testify the method, we have generated samples of random nucleotide sequences with the same length and number as the sequences in the studied samples which were extracted from the «Samples» database. We have considered two variants of generating random sequences: 1) with fixed frequencies of single nucleotides; 2) with fixed frequencies of dinucleotides. The trees generated for these random sequences have no branches in the first case considered (no dependency upon preceding context), whereas in the second case, the trees have the branches only at a single level (dependency only from a single preceding nucleotide). Hence, "unsymmetrical" branches in the tree sources constructed really correspond to non-random contexts and, therefore, may be interpreted as the signals specific for the given type of sequences.

In general, the analysis of promoter regions has revealed a rather strong dependency upon the preceding context, that is at least two-three nucleotides are significant (Markov chain of the second-third order). The reasons why these dependencies are weaker in promoters of "housekeeping" genes (Figure 7) are still unknown. As can be seen from the Figure, the tree source of tissue-specific gene promoters is larger and almost totally incorporates the tree source of "housekeeping" gene promoters. We suggest that difference could be explained by the chromatin structure. "Housekeeping" gene promoters as well as regions of active gene transcription should be mainly free of nucleosome packing. Figure 7 presents the context tree for sequences of 400 bp which contain a nucleosome-binding site [Levitsky et al., 1999]. The tree for tissue-specific gene promoters (Figure 7, bottom) is more similar to the "nucleosomal" tree (Figure 8).

Figure 7: Generating tree sources for promoters of "housekeeping" genes (top) and tissue-specific genes (bottom). The similarity and discrepancies of the trees could be noted in dependence on oligonucleotides, which contain G nucleotide for the sets of promoters classified by tissue-specificity.

Figure 8: Generating tree sources for sequences, containing nucleosome-binding site.

Thus, the model generating the letters in DNA sequences containing the site binding a nucleosome is described by the set of contexts with the length of either 3 nucleotides (NAA, NAT, NGT, NTG), or 2 nucleotides (TA, GA, CA, TT, CT, AG, GG, CG), or 1 nucleotide (C). Here we set N = {A,T,G,C}.

Note that by the algorithm for construction of a DNA generating tree in 4-lettered alphabet, a vertex is connected with 4 downward vertices or not connected (is a leaf in the tree). Correspondingly, the length of the preceding context either increases by a unit or stays constant. Nevertheless, such a tree gives information about the statistical significance of oligonucleotides and enables to indicate significant contexts. Interestingly, AA and TT dinucleotides providing the stiffness of double-stranded DNA were found within the set of significant contexts [Trifonov, 1997].

The algorithm for calculating the complexity and constructing the tree is designed for an arbitrary symbolic alphabet. The letters of the alphabet Ak are encoded by the numbers k. Practically, we can analyse any texts, e. g. in natural language. By applying various coding for nucleotide and amino acid sequences, we may conclude the relative significance of different regularities determining the concrete structure of the tree source for the data studied.

Determining the dependency upon the context by the method suggested provides a basis for constructing methods necessary for the recognition of particular functional regions of a regulatory sequence. It was established that Markov chains (dependency upon the preceding context) for functional sites in regulatory regions are, as a rule, of first-second order.


Implementation for proteins

In protein sequences it is much harder to construct and visualize the context tree, as the list of symbols is longer. Since the sequences are shorter than in DNA we should analyse sets of proteins to obtain enough statistical data. Also we reduced the amino-acid alphabet to the two- and three-letter alphabets as shown in the legends to Figures 9 and 10. The complexity of sets of non-redundant protein sequences was measured in [Weiss et al., 2000]. The entropy reduction due to correlations is only about 1%. Compression algorithms led to very similar results. No significant dependency on the alphabet used was previously observed in [Weiss et al., 2000]. So we analysed mainly the topology of the tree for amino acid sequences of protein domains from SCOP database [Murzin et al., 1995]. For amino acid sequences the contexts revealed were interpreted in terms of protein secondary structure [Orlov et al., 2000].

Figure 9: Context tree sources for the -helical protein sequences for two-lettered alphabet O (Outer) ={R,N,D,C,Q,E,G,H,K,S,T,Y}, I (Inner) ={A,I,L,M,F,P,W,V}. The sample contained 262 sequences of different length.

Figure 10: Context tree source for the -helical protein sequences for three-lettered alphabet O (Outer)={R,N,D,Q,E,H,K}, A (Ambivalent)={A,C,G,P,S,T,W,Y}, I (Inner)={I,L,M,F,V}. The sequence set analysed is the same as for Figure 9.

The results of the analysis of the sample compiled of -helical amino acid sequences in two- and three-lettered alphabets are given in Figures 9 and 10.

As can be seen from the figure above, from the point of view of information theory, -helical proteins are characterised by sufficiently continuous repeated blocks of hydrophobic and hydrophilic residues. In -helices, hydrogen bounds are formed between the residues in positions (i, i+4). Thus, the length of a context on which the following symbol depends should be at least 4. This rule is supported by the contexts given in Fig. 9. It is known that for producing a hydrophobic core of a globule, the hydrophobic residues in -helices should alternate with the period of three-four residues. Such alternating patterns of residues could be detected separately. The method simultaneously detects all such patterns that influence the frequency of residues located to the right of them. Note that in a three-lettered alphabet, the -helical sequences are characterised only by dependencies of the second order. That is, in this case, the more letters the alphabet has, the less information can be obtained about the class of a protein.

Context trees for -structure protein sequences are given in Figure 11 and 12.

Figure 11: Context tree for -structure proteins for two-lettered alphabet. The sample contained 316 sequences. (Denotations are as in the text and in Fig. 9).

Figure 12: Context tree for -structure proteins for three-lettered alphabet. The sample contained 316 sequences. (Denotations are as in the text and in Fig. 10).


Discussion

The method suggested permits to analyse not only a statistical model of a sequence as a whole, but to find statistical heterogeneities within the sequence. Along with choosing of a model for the whole sequence, we will determine suitable models and stochastic complexities for its constituents. This will enable to evaluate the statistical homogeneity of a sequence and to draw conclusions about the functional potential of its constitutive parts.

The visualisation of the unique structure of context trees, analogous to those illustrated in Figures 1, 2, 5, 6, 7, and 8, is typical for the samples considered.

Genomic nucleotide sequences have been extensively characterised regarding their oligonucleotide composition [Nussinov, 1984; Karlin and Ladunga, 1994; Scherer et al., 1994] which proved to reflect the evolutionary relationships of organisms. Variations of short oligonucleotide distributions in the genomes and clustering of significant oligonucleotides was studied in [Haring and Kypr, 1999]. But there was no united way of presentation of oligonucleotides specific for genome DNA.

This empirical approach of such a detection of non-random contexts supplements the separately taken estimates of under- and over-representability of oligonucleotide frequencies [Karlin and Burge, 1995] which are characteristic for functional genome regions and genomes of various organisms. Besides of that, all non-random contexts can be detected, not by calculating the relative occurrence of the nucleotides but by means of ordering local organization of symbols according to these contexts (relative distribution of the frequencies of the symbols).

In the last years there has been an increasing number of publications studying statistical features of DNA sequences such as: distribution of dimer repeats [Dokholyan et al.,1997], patchiness [Karlin and Brendel, 1993], local nucleotide distribution in a sliding window [Almirantis, 1999]. Other properties of the biological text have also been investigated, like correlations in doublet distribution [Mani, 1992], imperfect periodicities and their contribution in the long-range order [Herzel and Grosse, 1997] or multiple scaling [Trifonov, 1989].

The task appears now to mark off the regulatory sequences. In this procedure, both the methods aimed to detect the signals (consensus) and to evaluate how local surroundings of these signals correspond to some predetermined statistical model are suitable. Among the examples of such models are Markov models, which are intensively applied [Peshkin and Gelfand, 1999; Yada, 1999]. The method suggested gives some theoretical background for estimation if the data are sufficient for HMM parameters training, nevertheless, the problem of training remains.

Another problem analysed by the method is chromatin structure packaging. This structure is organized hierarchically at several levels. At the first level, double-stranded DNA is packed around a nucleosome, or histone octamer. In some genome regions, DNA is more densely packed into chromatin structure, whereas in the others it is less compacted. The nucleotide sequences which are more preferred for nucleosome positioning can be revealed by statistical methods [Widlund, 1997]. In order to detect a nucleosome code, the search for periodicity of di- or tri-nucleotide occurrences and DNA conformational features were performed [Satchwell et al., 1986; Ioshikhes et al., 1996; Stein and Bina, 1999]. Previously, the ascertainment of such oligonucleotides and the estimation of significance of this ascertainment was made by different methods in various statistical models, but without accounting Markov properties.

There were a lot of approaches of sequence analysis and pattern description based on different DNA presentations [Zhang, 1997; Torney et al., 1999; Dodin et al., 2000]. Reversible binary encoding of sequences into the binary digits was used to study the statistical properties of DNA sequence classes [Torney et al., 1999]. Fourier transformation procedures display regular patterns in DNA sequences. A correlation function that compares each base in a DNA sequence to its various neighbours and which is subsequently processed by Fourier and wavelet transformations has been developed in [Dodin et al., 2000].

We present a new approach for modelling a group of related sequences without incorporating any prior information about the sequences. An empirical approach of detection of non-random contexts by suffix tree supplements the estimates of under- and over-representation of oligonucleotides (http://wwwdbl.dei.unipd.it/Verbumculus/) which are characteristic for functional genome regions and genomes of various organisms. Besides, all non-random signals can be detected, by calculating the relative occurrence of the nucleotides in a suffix tree as was described recently in [Pavesi et al., 2001]. Using relative distributions of the occurrence of symbols after selected contexts provides a model for statistical text analysis. It was shown that entropy reduction for DNA and amino acid sequences estimated by the method suggested is only about 1-2%. Similar results were reported earlier in [Weiss et al., 2000].

The program is included in tools for genomic sequence analysis at the REGSCAN module of the Internet-navigation system GeneExpress 2.1 (http://wwwmgs.bionet.nsc.ru/mgs/gnw/regscan/). We believe that the method suggested as a standard Internet tool for suffix tree presentation and modelling would give new results for protein sequence analysis and genome comparison.


Acknowledgments

The authors are grateful to Victor Levitsky, Eugenii Vityaev and Ralf Hofestaedt for scientific discussion. This work was supported in part by the RFBR (no. 01-07-90376, 00-07-90337, 02-07-90355, 00-04-49229, 02-01-00939), Russian Ministry of Industry, Sciences and Technologies (43.073.1.1.1501), Siberian Branch of Russian Academy of Sciences (Integration Projects 65). Y.O. was supported by INTAS (YSF 00-178).


References