Edited by E. Wingender; received December 22, 1999; revised February 28, 2000; accepted April 28, 2000
Currently, prediction of transcription factor binding sites is widely done using matrices collected from literature. This leads to several problems. We cannot actively control the conservation of the matrices, we cannot systematically use all binding sites available, we do not know which sites were used and which were discarded in matrix construction, we cannot compare and evaluate matrices easily, we cannot detect redundancy and we cannot control sensitivity and specificity. So we are lacking control during the identification process. In this paper I propose a method overcoming these problems. I assume that each binding site has an unknown context which determines its sequence. This leads to the idea of constructing specific matrices for each sequence we are analysing. To do so we have to regard identification of binding sites as a general process, starting at a dataset of known binding sites and ending with the identification of a potential new binding site. In this paper such a process is presented. Besides overcoming the mentioned problems, the implementation also reaches a significantly higher accuracy than current approaches. Evaluations are done analysing all binding sites of TRANSFAC 3.5 public. The resulting tool AliBaba2 is available at http://wwwiti.cs.uni-magdeburg.de/~grabe/alibaba2.
Keywords: bioinformatics, data mining, clustering, context specific prediction, multiple alignment, nucleotide matrix, promoter analysis, sequence modeling, specificity, sensitivity, transcription factor binding sites
Identification of transcription factor binding sites is a basic step for the analysis of gene regulatory networks. Currently it is done rather unsystematically by comparing matrices taken from literature to an unknown DNA sequence. There are several programs using these matrices for analysing unknown binding sites, e. g. Signal Scan [1], TESS [2], MatInspector [3] and MatrixSearch [4].The current matrix approach has several disadvantages. By using the predefined matrices, we cannot influence the conservation of the matrices. For some factors there are several very specific matrices available, for others we just have one single weak matrix. So we cannot say how accurate a prediction will be. Also, we often do not know how many binding sites were used for constructing a matrix. This is essential for supporting a prediction. As we do not know the construction process of the matrices, we do not know which binding sites are used and which are not. This means we do not know the criteria which led to the decisions taken. Also, we cannot measure the similarity of matrices, which leads to redundancy in predictions. So we can hardly interpret the results of a prediction program. This makes it impossible to process the results in further in silico experiments. And, absolutely important, we cannot actively control specificity and sensitivity of the predictions. To overcome all these questions arising from the current state of the art, I propose a new approach for the identification of binding sites: a context-specific process starting at the known binding sites and ending with the identification of a potential new site. The basic idea for this process is the idea of context.
Context
A context comprises all conditions which determine the DNA sequence where a transcription factor binds.
Context specific identification of sites
Context specific identification of binding sites means the construction of individual matrices for each sequence to be analysed.
Principal idea and algorithms presented in this paper are not restricted to the matrix approach, an oligonucleotide approach is possible as well. But the matrix approach is supported by the Berg and von Hippel theory which yields a clear distance measure as we will see later. There are several motivations leading to the assumption of a context. I do not want to define the semantics of context in detail. I just assume each binding site possesses a context given by the role it plays in gene regulation.
The method for a context specific process to identify binding sites consists of five steps. Steps 1 and 2 form a preprocessing phase. The context dependency begins with step 3.
Pairwise alignment
In this step the known binding sites are pairwise aligned to the unknown sequence to be analysed, u. Firstly, a score which constantly rises by h for runs of successive matching nucleotides is given. Secondly, the score of the last matching nucleotide is assigned to all nucleotides of the run. Each run of length l gets a score of h*(l-1). This ensures that a successive run of length l is always scored higher than two runs of length a and b with l = a + b. With n as the number of matching runs in a binding sequence s and the unknown sequence u we get:
pairwisesim(u, s) := åi=1..n h* length_of_run(i)
Example with h=2:
a a c a g a c g t t t
t a c a g t c g t c a
0 8 8 8 8 0 6 6 6 0 0
In step 1 binding sites have been expanded using EMBL to a length of, e.g., 20 bp. Only a relatively short part of the 20 bp sequence will be physically necessary for binding the factor. Therefore, the pairwise alignment is done considering only a window of the extracted binding sites of a predefined length (e.g. 12 bp). This window is called the carrier of the alignment. The length of the carriers is chosen equal to the length of the matrices constructed later on. Result of this step is a set P of pairwise aligned binding sites with pairwisesim exceeding a given threshold pairsim.
Segmentation
To construct matrices from P, we first
have to structure it by function and position. We can do this as follows.
TRANSFAC supports a classification of DNA-binding domains organised in six levels
(Tab. 1).
Table 1: Classification of Transcription Factors taken from [11]
| Level | Name | Criterion | Example |
| 1 | Superclass | General topology of DBD | Zinc-coordinating |
| 2 | Class | Structural blueprint of DBD | Zinc finger nucl. recept. |
| 3 | Family | Functional criteria | T3R/RAR |
| 4 | Subfamily | Sequence similarity of DBD | RAR |
| 5 | Genus | According to factor gene | RAR-ß |
| 6 | Factor "species" | Initiation/splice variants | RAR-ß1, RAR-ß2 |
So we can set up a relation between two domain classes c1 und c2 with regard to a considered level k:
c1 <k c2 Û c1 <k-1 c2 with c1 <1 c2 Û c1 < c2.
This enables us to sort P = {p} when a level k is given.We then build blocks B of P. Each block contains sites belonging to just one binding domain structure.
Example: A pairwise alignment p is described by a pair <site_id, site_class>. There is P = { <R0001, 4.2.3.1>, <R0002, 4.2.3.2.>, <R0003, 4.2.3.5.>, <R0004, 4.2.4.1.>, <R0005, 4.2.4.2.> }. The user of the program sets level k to 3. So we get the following blocks B:
B1 = { <R0001, 4.2.3.1>, <R0002, 4.2.3.2.>}
B2 = { <R0004, 4.2.4.1.>, <R0005, 4.2.4.2.> }.
Then we split each B into segments G
using the position of each p Î B
regarding the analysed sequence. We introduce a split when two p of B are
further apart than half the desired matrix width. So the segmentation step ends
with a set of Segments G. Each G contains p Î P from binding sites bound to similarly structured domains and found
at similar positions.
Matrix construction is done during a gap
free heuristic alignment. Gaps are not needed because only similar sites are
aligned to the sequence to be analyzed, u. The heuristic alignment is inevitable for
adequate performance. It aligns the binding sites of each segment G in the
order given by each site’s similarity to u. The problem now is that we cannot
simply construct one single matrix out of the sites found in one segment. As
explained above, it is possible that there are multiple sites in one segment. This
is the case when binding sites overlap. In such a case we have to construct
multiple matrices for our segment G. This is exceptionally the case when
analysing promoters because neighbouring or alternative binding sites are
essential for promoter function. So how can we deal with multiple sites in a
segment ?
The most obvious approach constructs one common matrix in the beginning and tries to
split it afterwards. This is illustrated in Fig. 1. Out of four binding sites we construct one common matrix. The common single matrix should
contain two distinct parts (shaded). Due to the overlapping of sites noise is
introduced. To avoid this kind of complications, we adopt the strategy of hierarchical clustering. We successively merge binding sites of G to an increasingly larger matrix until the
matrix’s quality is below our requirements.
The conservation can be easily calculated [12; 13] by the average information content. For similarity
calculation between matrix and u the Berg and von Hippel [6] measure gives best results (not shown here). In their
model Berg and von Hippel first calculate the relative binding energy r(u)
which is needed for a protein to bind to an unknown DNA sequence u.
Multiplication of relative binding energy with the factor’s binding constant
gives the total binding energy. This relative energy is calculated from a
matrix of base-pair probabilities plB and a correction parameter l.
The relative reduction in binding energy is
where elB is the energy needed for binding to
each single nucleotide. The probability plB is taken from the matrix
to which the unknown sequence u is compared. pl0 is the matrix entry
for the corresponding nucleotide from the unknown sequence.
We set l = 1 since we just process pure sequences. As we now know the
reduction of energy, we can calculate the resulting binding energy of the
factor binding to the sequence u: energy (u, m) := K0 * exp (- r(u) ) where K0 is the binding
constant of the transcription factor and the matrix m = (plB). We
have to set K0 = 1 because we have no information about it
in our sequence databases. We get our similarity measure promsim:
promsim (u, m) = energy(u, m) with K0 = 1 and l = 1. The following shows the resulting algorithm for matrix construction. For each site the algorithm decides whether
to use it in an already existing matrix or to create a new one. Poorly supported
matrices are discarded at the end. For matrix construction we have to specify
the width of each matrix. The algorithm presented uses a standard width given
by the parameter matwidth. This may
seem not appropriate since different factors possess binding sites of a
different length. But in the general case we do not know the ‘real’ length of
binding sites. With the parameter cons we
require a minimal conservation, which means we require a minimal amount of
information needed for a decision about the existence of a binding site. By
varying matwidth we also vary the
requested amount of information. We do this indirectly by reducing the amount
of information in the matrix. As we do not need two parameters for the same
thing, we can set matwidth constant
and vary only cons.
for all p Î G ownmatrix :=
false for all m Î already constructed matrices M if p already used for any M do nothing, continue with
next p m’
:= m merged with p if promsim (u, m’) < promsim ownmatrix := true else if avgconservation(m’) < cons ownmatrix
:= true else m
:= m’ // Keep merger continue with next p if ownmatrix M
:= M È initial_matrix_from(p) for all m Î M if support(m)
< minsupp // support = number of sites
a matrix is derived from discard
m Ñ
Fig. 2 illustrates the interaction of the
introduced concepts: binding sites, pairwise alignment, segments and matrices in
an example. There are 12 sites given in s1 to s12. They
lead to 12 pairwise alignments p1 to p12. These are
aggregated in two segments of which each contains two matrices.
The method was implemented as the program
AliBaba2 (alignments for analysis of binding sites) using C++ with the Standard
Template Library (STL) on Windows NT and Solaris. For convenient usage an
interactive web interface was developed
(http://wwwiti.cs.uni-magdeburg.de/~grabe/alibaba2). There is a two step user interface for viewing the prediction results. At first only the names of
segments are given as the prediction output. Then each segment can be clicked to show the sites used for the according alignment. All sites, factors, and classes are
linked to TRANSFAC and can be individually explored.
For evaluation
of the presented approach several tests have been done. As a source for binding
sites TRANSFAC 3.5 public has been used. EMBL Release 60 was used to
expand the sites. As a reference tool MatInspector 2.2 in the public version
was used for comparison. As a negative test set the exon sequences (AF009652, AFPYRG_5, AFUOXGENE_4) with length 2 kb have been used. To check if these sequences are a
representative sample, the sample as well as the following 40 kb of exon
sequences have been analysed
(AAD7712_8, AALFUL_9, ABPGKA_6, AF011341_6, AFMETALLO_9, AFMETPRO_8, AFU132504_10, AN03278_8, ANDNAUAY_6, ANGPDAA_13, ANNIIA_8, ANPACA_6, ANPGP1_8, ANPGP2_6,
ANPRNB1_4,
ANQUTR_3,
ANQUTR_5,
ANUAPA_10,
AOD895_7,
AOPGKA_8,
BCPGP1_6,
CCD11_5,
CCHTS1G_9,
CCY15418_7,
CHY15839_8,
CPGNRPA_7,
CSD540_9,
ED18SRE_5,
ED18SRR1_4,
EN1836_8,
ENAJ11563_8,
ENNIRA_5,
ENU62482_9,
GCGLN_8,
GFNIAD_7,
GFP450I_7,
GFU17243_8,
HSGAMP_9,
MCPYRG_11,
NC11565_8,
NCBLI4GEN_6,
NCCPC3_10,
NCPKNCHOM_5,
NCSPII_6,
NHPDA61_9,
PAFPR1_7,
PBAB3043_7,
PBR6073_15,
PCCBHI2A_6,
PCCELLUG_5,
PCCELLUG_11,
PCFACAG_5,
RNPYR4_9,
RRPAL_8,
SCPFY_13, SCSMIP_7, SMRAD9_8, SNO9826_11, SP18SRE_5, SPALP1_7 ).
The number of binding sites normalised to 500 bp predicted for the
sample and the 40 kb set are shown in Tab. 2 with varying
parameter conservation. The sample can thus be considered representative.
The data set resulting from steps 1
(expansion) and 2 (class assignment) results in 3196 sites grouped to 742
factors, where each factor contains at least 4 non redundant binding sequences.
All tests
are done in respect to the following reference set of parameters.
In the following the influence of the single parameters on the binding site recognition accuracy is evaluated. Recognition accuracy is described by a two dimensional
point with sensitivity and specificity as coordinates. Both, sensitivity and specificity are measured in absolute numbers to simulate a realistic application.
Fig. 3 shows the influence
of parameter pairsim on the
recognition accuracy. Specificity is given in logarithmic scale to address its
approximately linear correlation to sensitivity.
The influence of matrixwidth is checked in the
following. Reasonable values comprise 8, 10, 12 and 14 bp. We clearly see
that 10 and 12 bp give the best results (Fig. 5). The runtime is also influenced by matrixwidth. A matrixwidth of 12 bp is nearly 50 % more time consuming than a
width of 10 bp (Fig. 6). Still, if runtime is not most important, 12 bp should be chosen
as the standard parameter.
The last
parameter evaluated is the classification level. During segmentation binding
sites are aggregated by “function” using the domain classes of transcription
factors. Which class level yields optimal results? Fig. 8
compares the levels 3, 4, 5 and “3 to 5”. “3 to 5” means the whole
range of levels is chosen: AliBaba2 then constructs matrices for all levels. We
see that considering solely level 4 yields optimal results. Runtime correlation
is shown in Fig. 9. It can be seen that building models for multiple levels has
additive time costs where level 4 shows average runtime.
From the evaluation done here we derive an
optimal parameter set for AliBaba2 regarding the defined absolute sensitivity
and specificity.
To evaluate the algorithmic complexity of
the developed method pairwise alignment and matrix construction in combination with
segmentation have been investigated (Fig. 10). I constructed an artificial long promoter sequence
by concatenating 30 kb of promoters of class 6.1.2. from EPD Release 60. Class
6.1.2. stands for vertebrate genes / chromosomal genes / structure proteins.
The runtime of the pairwise alignment depends on the number of sites found
similar. The runtime of the matrix construction and segmentation depends on the
number of matrices built (Fig. 11). By normalising the runtime to an average promoter
length we get a constant runtime over the whole promoter sequence. Thus we see
a linear complexity referring to the length of the sequence analysed.
with n as number of sites and l as
sequence length. Space complexity depends naturally solely on the number of
processed sites:
To compare AliBaba2 with MatInspector
public those binding sites which are linked to a matrix are selected from
TRANSFAC 3.5 publilc. We get 1829 sites from 366 factors. Each site is
linked to a matrix which is contained in MatInspector’s library because
MatInspector uses the TRANSFAC 3.5 matrix library. Therefore MatInspector has
the potential to recognize all of the 1829 sites. MatInspector is mainly
controlled by matscore, the central
threshold for matrix comparisons. To perform automatic testing of the 1829
sites, a test bed has been constructed. In it, MatInspector is wrapped with an
adapter. The adapter writes a binding site to a file, starts MatInspector and
reads the results afterwards. Parsing the results, the adapter checks if the
right matrix has been assigned to the binding site tested. For each threshold
of matscore we get a pair of true
positives (TP) and false positive (FP) predictions. Varying matscore in a given range, we get a graph of sensitivity and specificity. The test bed is shown in the
Fig. 13. Fig. 14 compares sensitivity and specificity for
MatInspector and AliBaba2 with the reference parameters.
I have shown a new approach for
identification of transcription factor binding sites. The approach is the first
which sees the prediction of binding sites as a process starting directly at
the known binding sites and ending at newly identified potential binding sites.
So all binding sites available can be used in a systematic way. The main idea
of the approach shown here is to construct individual matrices for each binding site
predicted. The matrix construction algorithm clearly defines which sites are
used to do this. The main criterion for grouping sites for matrix construction
is the structure of the DNA-binding domains of the transcription factors. The construction
process ensures that for each matrix the given conservation will be reached.
There is a clearly assigned function to each matrix and thus redundancy can be
avoided. Finally it is possible to define a minimum number of sites for each
matrix. Thus, we get a maximum control over the identification process. I evaluated
the effect of all parameters on sensitivity and specificity which leads to an
optimal parameter set regarding the defined overall sensitivity and
specificity. Generally the selection of parameters is not critical. In
comparison to MatInspector public, AliBaba2 exhibited an approximately two-fold higher
sensitivity which means a large improvement. All determined experimental results rely
on the testing procedure. It is certainly possible to apply different measures
of success. In this paper the absolute number of sites recognized is counted.
An alternative measure is to count not the number of sites, but the number of potentially binding
factors. Such a measure would strengthen MatInspector because only
one single site for each matrix would have to be recognized. Results not shown
here indicate that the number of factors recognized by MatInspector and
AliBaba2 are similar. But the main task for programs like AliBaba2 and
MatInspector is to recognize a single binding
site with the highest accuracy possible. So it is sensible to measure how many sites can be recognized by a program.
There is also a second argument for measuring sites and not factors. The number
of sites known per factor can be expected to increase in future. So the task
will be not to suggest one factor in general but to recognize factor binding
in association with the organism or expression conditions of the corresponding
genes. This is the idea of context. So we need a site based approach and therefore a site based measurement of success. Concluding, I think the measure
of success used here is appropriate. All tests have been done based on the
freely available tools and data. There is also a commercial version of TRANSFAC
3.5 with many more sites and a commercial version of MatInspector with more
matrices and individual matrix thresholds available. Evaluations regarding
commercial TRANSFAC and MatInspector versions will be done in the future.
Thanks to Gesine Folkers for proof-reading.
Matrix Construction
Figure 1: Using overlapping sites in matrix construction leads to noise.
Each site that we merge to a matrix has
to be checked. I propose three requirements for a site to be used in a matrix:
Algorithm matrixconstruction
Figure 2: Interaction of concepts.
IMPLEMENTATION
RESULTS
Table 2: Accuracy of the chosen 2 kb exon sample
cons
predicted sites per 500 bp in 2 kb
predicted sites per 500 bp in 40 kb
65 %
180
180
70 %
111
111
75 %
34
39
80 %
12
12
85 %
3
2
90 %
0
0
Reference parameter set
Parameter evaluation
Figure 3: Influence of pairsim on recognition accuracy.
We see that the selection of the pairsim parameter has little influence on
recognition accuracy. But it has a big influence on runtime (Fig. 4).
Shown is the runtime of segmentation and matrix construction steps. They are
measured together as matrix construction is called for each segment identified
in segmentation step. As segmentation is cheap in runtime, matrix construction
is the time consuming step of both. The lower pairsim the more pairwise alignments are found and thus the time
for matrix construction rises. Therefore we should keep pairsim at 64.
Figure 4: Influence of pairsim on runtime.
Figure 5: Influence of matwidth on recognition accuracy.
Figure 6: Influence of matwidth on runtime.
Next, the parameter promsim is evaluated. A promoter similarity of 100 % reflects that
the matrix consensus is identical to the analysed sequence. If an unknown
sequence contains a nucleotide different from a maximally conserved nucleotide
in a matrix the Berg and von Hippel score will yield a similarity of 0 %. Thus, the
Berg and von Hippel measure is rather strict. Other measures could be used
which would enable a comparison of measures. In Fig. 7 promsim is varied from 1% to 100 %.
Figure 7: Influence of promsim on recognition accuracy.
As expected, a 100 % value of promsim gives highly specific results in
the very high specificity area above – 15 FP / 500 bp. For the general
application case 1 or 10 % are optimal values. Runtime for promsim is not shown as no significant dependency could be measured.
Figure 8: Influence of classification levels on recognition accuracy.
Figure 9: Influence of classification levels on runtime.
Optimal parameter set
Evaluation of Algorithmic Complexity
In Fig. 12 the runtime dependence
from the number of binding sites processed is evaluated. We also see a
principally linear dependence. Thus we conclude the time complexity to be:
Figure 12: Comparison of not normalised runtimes while varying the number of used sites.
Comparison with MatInspector
Figure 13: MatInspector test bed.
Figure 14: MatInspector public vs. AliBaba2 on TRANSFAC 3.5 public.
MatInspector knows two settings:
"selected matrices" and "all matrices". In "selected matrices" mode only better
conserved matrices with a ‘random expectation’ below a certain threshold are
used (see MatInspector documentation for more details). This increases
recognition accuracy but limits the number of sites which can be recognized.
Outcome of the comparison is that AliBaba2 clearly has as well a higher
sensitivity as higher sensitivity / specificity ratio. In all specificity
ranges it constantly can detect about 500 binding sites more than MatInspector
public which means about a doubled sensitivity in the average. It should be
mentioned that the recognition accuracy of AliBaba2 is measured via Jack-Knife
test. For the MatInspector this was not possible as we did not repeat the
construction of all matrices of TRANSFAC 3.5 public. Thus, for MatInspector
there is no separation of training and test data. But we take the results from
MatInspector as an upper boundary of recognition accuracy.
DISCUSSION
ACKNOWLEDGMENTS
REFERENCES