Max-Planck-Institut fuer Chemische Oekologie
Abteilung Genetik und Evolution (Prof. Mitchell-Olds)
Carl-Zeiss-Promenade 10
D-07745 Jena
Tel: 3641-643610
Fax: 3641-643668
Email: goebel@stargate.ice.mpg.de
We present an algorithm which constructs a promoter model from a set of unaligned coregulated POLII promoters.
It rests on the following assumptions: DNA contact points of individual members of the transcription initiation complex are constrained in their ability to tolerate mutations and thus stand out as short (6-10 bp) conserved motifs. The arrangement of the proteins in the initiation complex is reflected by the pattern of the binding sites on the DNA, and it this pattern which really identifies the promoter. It, too, should be at least in part conserved in members of a family of promoters which are known to confer the same expression pattern. Another aspect which has been shown to be conserved at least in parts of POLII promoters is DNA structure, especially bendability and stiffness. Most probably the sequence conservation seen at transcription factor binding sites is just an extreme case of structural conservation (identical sequences have identical structures). It can well be that there are sites which have drifted apart on the sequence level in different members of a promoter family, while still being conserved with respect to some relevant structural property.
Our algorithm first constructs gap-free blocks of sequence segments from
the input sequences. A block can contain zero or multiple segments from
any input sequence. It is maximal with respect to the number of
segments, such that all pairs of segments in a block are SIMILAR.
In contrast to other existing algorithms, SIMILARITY is a relation which
can be freely defined, and in particular can refer to similarity with respect
to DNA structural parameters. In a second phase, the algorithm looks for
an arrangement pattern of these motifs which is common (with variations)
to all input sequences. Motifs which are part of such a pattern can not
only be more trusted to be truely biologically relevant, but the pattern
also constitutes a testable hypothesis ( a PROMOTER MODEL) about the
input family of promoter sequences.