AliBaba2 is a method for predicting transcription factor binding sites. Predicting transcription factor binding sites using nucleotide weight matrices is widely used and theoretically supported. Instead of building general weight matrices in advance, we propose to build them highly specific dependent on the regulatory DNA sequence to be analysed. We regard the prediction as a process consisting of the steps site selection, aligning sites to the regulatory sequence and generation of alignments and matrices. By generating context specific matrices we can achieve a higher specificity of each matrix. Matrices are constructed using the classification of binding domains of TRANSFAC. The program is basically controlled by three parameters: specificity, minimum similarity of matrices to the regulatory sequence and minimum number of sites in each matrix. Using TRANSFAC 3.5 public, AliBaba2 has been compared to MatInspector 2.2 public as an example for a program relying on predefined matrices. Results so far indicate that AliBaba2 has a higher specificity respective the total number of binding sites which can be recognised. Aggregation of these sites by transcription factors show that both programs perform similar regarding the diversity of recognised factors. AliBaba2 is available at http://wwwiti.cs.unimagdeburg.de/ ~grabe.
Prediction of transcription factor binding sites is a basic step for the analysis of gene regulatory networks. Currently, there are several programs available, e.g. RGSiteScan [1], Signal Scan [2], TESS [3], MatInspector [4], and MatrixSearch [5]. Predicting binding sites using matrices is theoretically supported by Berg and von Hippel [6], who give a model based on statistical mechanics establishing a relationship between matrix scores and the energy bound by a static docking protein. For a comparison of MatInspector, TESS and TFSEARCH in predicting accurate binding energies see Roulet et al. [7]. Schneider et al. [8] analysed the information content of aligned binding sites. Nowadays there are numerous matrices available each for a special group of binding sites. All these matrices have a variable information content and thus specificity. Matrices and binding sites have been collected in TRANSFAC [9]. As a compensation to the variable specificity, optimised scores for each matrix can be used, e.g. in the commercial version of MatInspector. But without regards to what can be achieved by tuning predefined matrices, we think that the problem of variable specificity is just a symptom for a too narrow view in the field of binding site prediction. Generally, there are the following requirements when predicting binding sites by matrices:
We claim that these goals can be achieved best when viewing binding site prediction as a process starting from the known binding sites and ending at context specific matrices. Hence, specificity which is the critical factor in binding site prediction can be made central to the prediction process while general matrices have to be more unspecific to ensure recognition of a wider range of cases. The paper of Wasserman and Fickett [10] supports this idea as for the analysis of muscle specific promoters a special set of muscle specific matrices has been hand tailored. Frech [11] use special matrices for muscle TATA box and muscle INI box in their actin analysis. From these practical cases we can derive a need for highly specific context dependent matrices. Also from the theoretical point of view our approach is supported. In their model Berg and von Hippel assign to each known binding site an energy value relative to the promoter which has an energy of zero. All binding sites try to support the promoter and are centred around it. This is what we are trying to achieve by the presented method. By viewing binding site prediction as a process we can develop a method satisfying the above requirements.
The process consists of the steps site selection, aligning sites to the regulatory sequence and generation of alignments and matrices. Site selection is done by extracting the known binding sites contained in TRANSFAC. We use the links to EMBL to extract the DNA sequences of a predefined length, and construct a library of binding sites. A regulatory DNA sequence (in the following called promoter for simplification) is then analysed in the following two phases. In phase 1 each site of the library is pairwise aligned to the promoter. In phase 2 heuristic multiple alignments are constructed using the pairwise alignments with decreasing similarity to the promoter. Matrices are constructed using the classification of transcription factor binding domains provided by TRANSFAC. In higher levels of the hierarchical classification binding sites of different but similar factors may be used for one matrix. To ensure the similarity of the matrix to the promoter a similarity measure between matrix and promoter is applied. At the end of phase two the support of each matrices is checked, meaning that a least a user defined number of known binding sites is required to support the prediction.
Phase 1  pairwise alignment
We just construct gap free alignments because we want to compute high specific matrices. If we have to align sites belonging to factors with e.g. two half sites, we assume, there will by at least some examples from which the matrix can be constructed. We pairwise have to align each known binding site to the promoterm, meaning that we have to compute similarity. The more this similarity resembles the similarity calculated later on when comparing the resulting matrix to the promoter, the more accurate the selection of pairwise alignments for constructing matrices will be. In a first step we give a by constantly rising score for runs of successive matching nucleotides. In the second step the score of the last matching nucleotide is then assigned to all nucleotides of the run. Each run of length l gets a score of which ensures that a successive run of length l is always scored higher than two runs of length a and b with l = a + b. We get for n = number of matching runs, promoter p and site s :
It is not always known which nucleotides of the binding sites are essential for binding of the transcription factor. Therefore, not the original binding sites of TRANSFAC but the TRANSFAC links to EMBL are used to extract binding sites of an equally common length (e.g. 20 bp). Only a relatively short part of the whole extracted binding sites (e.g. 12 bp) will be biologically active in binding the factor. So we have to determine a window in the extracted binding site. This window is denoted the carrier of the site. We get the following basic algorithm for phase 1 starting with the promoter sequence p and the set of known binding sites S.
Algorithm Phase 1
for b_{i} := every site in S
for x_{p} := all promoter positions
H is the result set of 4tupels each describing a pairwise alignment of a binding site to the promoter. Only these pairwise alignments whose score exceed a given threshold minpairscore are further used. They are denoted hits.
Definition 1. A hit ( H is the pairwise alignment of the binding site b_{i} to a regulatory sequence p when exceeds a given threshold minpairscore. The carrier starts at position x_{p}+x_{b} and has the length .
We align each site to all promoter positions and find the strongest carrier. For this algorithm we get a complexity of O(S*p*c*) with S=number of sites, p=length of promoter, c=average number of carriers, = average length of carrier . A simple optimisation can be achieved by integrating the xb loop with the pairwise similarity. The similarity between site and promoter is computed not only for the carrier but for whole sequence bi which reduces the complexity to O(S*p*). Fig. 1 illustrates the alignment process.
Figure 1:Pairwise alignment process. 
To speed up the search an index of the promoter is built. All known binding sites are scanned against this index. The index has a word width of four bp compressed in one byte. Thus we can only find binding sites in a promoter where at least four successive nucleotides are identical in promoter and binding site. Principally, this can lead to loosing binding sites but the constraint did not show to be too restrictive in later experiments. Although phase 1 is principally similar to local alignment tools like BLAST [12], we need a specially constructed algorithm as we have to care for the very few potentially important nucleotides in binding site predictions.
Phase 2  constructing matrices
In phase 2 multiple alignments have to be generated from the pairwise ones. As a result of phase 1, we get a large amount of pairwise hits in H. Aim of phase 2 are matrices generated for all levels of the classification hierarchy of DNA binding domains (DBD) of transcription factors provided by TRANSFAC [13], see Tab. 1. AliBaba2 supports matrix construction in levels 5 to 3. We first sort all hits in H by the classification number of the binding sites yielding H_{C}.
H_{C} := H sorted by classification number
For each desired classification level HC is traversed once. Traversing HC, blocks B of hits belonging to the same classification node can be compiled. We aggregate H by classification nodes into blocks B. Each B contains hits meaning pairwise alignments above a given threshold regardless of their position. Thus, in the next step we have to sort all Hits in B by the start position of their carrier to identify possible segments in p.
B_{x} := B sorted by (g.x_{p} + g.x_{b})
A set of hits is considered a segment when it is consisting of at least minsupport hits overlapping at least maxmatoverlap nucleotides.
Definition 2. The support of set of hits H support(H) is defined as H.
Definition 3. A segment G is a subset of H given (a.x_{b} + a.x_{p} )  (b.x_{b} + b.x_{p} ) < maxmatoverlap with a Î G, b Î G and support(G) > minsupport.
In a first optimisation step we can discard all those segments whose support by sites is below the given threshold minsupport. For each identified segment three tasks have to be done. First, we have to construct matrices from its hits. Second, we have to check the minimum support of each matrix. Third, we have to split segments when constructed matrices exceed the given distance maxsegoverlap. For now we get the following algorithm.
Algorithm Phase2
HC := H sorted by classification number
for each classification level l:=3 to 5
Table 1: Classification of Transcription Factors taken from [13]

Algorithm segmentation _{}
Matrix construction is done heuristically and gap free. To avoid building all possible combinations for the multiple alignment, it is constructed in a precomputed order. The most sensible order here is the similarity of the pairwise alignments computed in phase 1. Hence, we sort the hits of our segment by the pairwise alignment score.
If we construct a single matrix right from GP this could lead to problems, especially in the case of repeats of transcription factor binding sites. As can be seen in Fig. 2 we try to map known binding sites bi to the promoter sequence. The common single matrix should contain two distinct parts (shaded). Due to the overlapping of sites noise is introduced. To avoid this kind of complications we reverse the direction of mapping. Instead of mapping the binding sites to the promoter we map the promoter to each binding site to construct different matrices. This approach leads to a self aggregation of matrices. From the data mining perspective, matrices are build using hierarchical clustering which is a form of unsupervised machine learning. To further relax the impacts of the heuristic alignments, each hit can be used in multiple matrices.
Figure 2: Mapping overlapping sites to the promoter leads to noise in a common matrix. 
To conclude, matrix construction is done by traversing GP and aligning each g Î G_{P} to each of the already constructed matrices. If we cannot use g, it induces its own matrix. The question is now how to decide if we should use an actual g Î G_{P} for an existing matrix M. To solve this question we do several tests.
First, we have to check if there is enough overlapping of g and M. To do this, we use again our parameter maxmatoverlap. Then we calculate the distance between g and M to check if g is at all similar to M. For calculating distances between a matrix and a sequence different measures are available. For example Berg and von Hippel use for the binding of free energy E
with s as alignment width and p_{lB} denotes the probability that base B occurs in position l. In TESS a loglikelihood ratio is used. Another example is the scoring function used in MatInspector. Currently, a modified MatInspector measure based on probabilities is implemented as it is seems tolerant to false matches in conserved regions as it is based on a sum instead a product. We use:
where spec(l) is the specificity as the normalised negative information content in column l of the matrix M. The correction factor in front cares for the case when the measured sequence does not cover the matrix completely.
Our next test is to check if the site we are examining is already contained in the matrix. This is done by comparing the EMBL reference given as the pair (ID, epos) of g with all the other EMBL references already contained. Two EMBL references are considered equal if they share the same ID and if they are at most 40 bp apart:
If distance(g, M) < minmatsim and iscontained (g, M) then we can construct a candidate matrix M_{C}.
Mc := M enhanced by g
M_{C} is constructed by updating the absolute frequencies of nucleotides M by the site g. From the absolute frequencies probabilities are estimated using:
p_{lB} = (n_{lB} + 0.01) / (N + 0.04)
with N as number of sites used for matrix construction. By doing so, we reach a very careful correction for the observed probabilities. Berg and von Hippel suggest p_{lB}= (n_{lB}+1) / (N+4), but since we are expecting only a few sites to support our prediction this correction seems too strong. Still we reach for N=0 a probability of 0.25. From the corrected probabilities we derive the conservation or specificity profile by the negative normalised information content in percent:
Because we do gap free alignments, we here do not have to define information content for gaps. In case of predefined matrices, this has to be done without a clear interpretation.
Having constructed the enhanced Matrix MC, we can measure its distance to the promoter by the above distance measure. This yields the first important parameter of the method: minpromsim. We conclude:
keep M_{C} if similarity (p, M_{C}) > minpromsim.
The second important parameter has to be checked now: the specificity of the matrix. We use the average specificity of the matrix:
Now we know that M_{C} satisfies all of our conditions and we replace M bei M_{C}. If any of these conditions except is_contained is not fulfilled for our site g, we construct a new matrix for g. This is done by building a preliminary matrix from the site and the promoter, i.e. we use two initial sequences and the promoter is contained in the matrix. If there are few sequences, this is reasonable as we have the hypothesis that our matrix reflects the promoter. Later on, when the matrix has grown, this poses severe problems in the case when the promoter sequence is not a binding site, i.e. a false hit. The reason for this is, that the promoter sequence disturbs the specificity profile which hinders the identification of mismatches to highly conserved nucleotides. Hence, if we use the promoter in inital matrix construction, we have to remove it when the matrix grows. Currently, this is done in an early stage when the matrix consists of three sites. Concluding, we get the following algorithm for the matrix construction:
Figure 3: Artificial example to illustrate the dependencies between the objects introduced in the paper ordered by the transcription factor classification hierarchy. 
In checkminsupp each matrix with less then minsupport sites is discarded. Each segment gets the name of its matrix containing the most hits. In splitsegments a segment is split in different ones if matrices do not overlap at least maxsegoverlap nucleotides. To check this, we have to sort our matrices by their position. Again each segment name is given according to its matrix with the most support.
The method was implemented using C++ with the Standard Template Library (STL) on Windows NT and Solaris. For convenient usage an interactive web interface was developed. At first only the names of segments are given as the prediction output. Then, each segment can be selected to show the corresponding matrices. All sites, factors, and classes are linked to TRANSFAC and can be individually explored. The parameters of AliBaba2 are split in expert and regular ones. Except for the expert parameters, linguistic variables like strict and lazy have been assigned to each possible value.
For evaluation of the approach several tests have been done to answer the questions:
A fair and complete comparison of the presented method to available tools is hardly possible. There are different data and program versions available and the presented algorithms leave room for improvements. All tests should be judged preliminary and with care. Currently TRANSFAC 3.5 in the public version as the data source and MatInspector 2.2 also in the public version (both are also available in enhanced commercial versions) have been used for evaluation. In the following just the terms "MatInspector" and "TRANSFAC" are used for simplicity. All sites of TRANSFAC which refer to a TRANSFAC matrix and whose factor contains a classification number have been extracted to the test set CM. The sites in CM have been replaced through the according EMBL links by sequences of a common length of 20 bp. CM contains 1429 sites. As a negative test set three exon3 sequences (AALFUL_9, ABPGKA_6, ANACUD_7) counting 2400 bp have been used. Also exon2 sequences (AF009652, AFPYRG_5, AFUOXGENE_4) have been tested, the results are the same. The task to solve for AliBaba2 in order to obtain a true positive hit was to assign the correct classification number to each site of CM. A false positive hit was assumed when a segment in an exon3 sequence was predicted. MatInspector was assigned a true positive hit when the according matrix was recognised. A false positive hit was assumed for a hit in the same exon3 sequences as for AliBaba2. False positives (FP) are normalised to 500 bp as this is frequently used as the length of a promoter. True positives (TP) are normalised to CM=1429.
For Alibaba2 the parameter minpromsim has been fixed to 85%, minsupport was set to three sequences and minspecificity ranged from 60 to 90%. It is possible to select different values for carrierlen, the matrix width. Here 10 and 12 bp are shown. As MatInspector knows two kinds of matrices, selected and all, both variants have been tested. Each time matcoresim was set 75% and matsim ranged from 70..100. The false positives have been limited to 200 in 500 bp.
Note that we did not use only these binding sites for testing which belong to the selected matrices of MatInspector. If the selected matrices do not find a hit due to unconserved nucleotides, also AliBaba2 will not be able to construct a matrix. The results of the comparison can be seen in Fig. 4. Two graphs with different parameters are shown for each, MatInspector and AliBaba2. The specificity value of 77% has been marked for both values of carrierlen 10 and 12 bp as 77% is the default specificity in AliBaba2. Also for MatInspector the default parameter of 85% is specially indicated. When using all matrices, MatInspector's specificity clearly drops compared to the selected matrices. Similar for carrierlen = 10 bp specificity obviously drops for AliBaba2. Nearly the same graph as carrierlen = 10 bp results from keeping carrierlen = 12 bp and lowering minpromsim to 80% (not shown).
The results indicate that AliBaba2 is able to recognise more binding sites given a fixed false positive rate or it can recognise the same amount of binding sites with a higher specificity. Table 2 lists the results in numbers for the best graphs of each program. From the results we can answer our first question that it seems at least sensible to construct matrices automatically and that these matrices can reach the quality of manually constructed ones. Is it also clear that we indeed can find an improvement in specificity which answers the second question.
Table 2: Recognition of binding sites by MatInspector and AliBaba2
MatInspector selected mat. 


AliBaba2 carrierlen=12 

minscore 
TP 
FN 
FP 
minspec 
TP 
FN 
FP 
70 
889 
540 
3552 
65 
1199 
230 
1966 
75 
856 
573 
2188 
70 
1094 
335 
850 
80 
805 
624 
1258 
75 
842 
587 
331 
85 
727 
702 
644 
77 
713 
716 
206 
90 
562 
867 
270 
80 
575 
854 
108 
95 
313 
1116 
79 
85 
278 
1151 
10 
100 
42 
1387 
3 
90 
117 
1312 
0 
We can now further ask which kind of sites are recognised by both programs. Therefore, we aggregated the binding sites recognised by transcription factors. As a transcription factor the binding sites' Tnumbers (Txxxxx) given by TRANSFAC were chosen. A program achieves a true positive hit, when a least one binding site of transcription factor is predicted. False positive hits are taken from Fig. 4. Surprisingly both programs showed a similar performance with MatInspector having a slight advantage in the area from 10 to 100 FP per 500 bp. Both programs mostly achieve the same diversity in transcription factors. Fig. 5 and 4 taken together indicate that AliBaba2 especially has better recognition capabilities when many sites are available. This is what can be expected.
Figure 4: Recognition of binding sites by MatInspector and AliBaba2. 
Figure 5: Recognition of transcription factors by MatInspector and AliBaba2. 
A further drill down of the results is taken in Fig. 6. As in Fig. 5. the points for the parameters minspec=75% and matsim=90% are the closest together, we can compare the according factors recognised. We see that 60% of the total transcription factors (79) are recognised by both programs. Additionally, each program has some unique factors it can recognise.
Figure 6: Recognised sites aggregated by transcription factotrs (Txxxx) for AliBaba2 (spec=75) and MatInspector (matsim=90). 
To illustrate the function of the presented algorithms Fig. 7 shows how many pairwise alignments of known sites and the promoter are found in phase 1 and how many segments finally result from phase 2 when examining the three exon3 sequences with minpromsim=85% and spec=77%. Fig. 8. shows an example output of the Sp1 binding site between TATA box and transcription start of the chicken actin alpha cardiac (M10607). For the constructed alignment, the drop in specificity, caused by the increasing support, is shown in Fig. 9. The sudden increase in specificity at support of 3 is caused by taking the promoter out of the matrix what happens automatically (see algorithm). This leads to an increased specificity as column 8 of the matrix is different from the promoter. In the above algorithms we compensate this by applying the distance measure on the promoter and the matrix.
Figure 7: Number of pairwise alignments vs. finally predicted segments in exon3 sequences. 
Figure 8: Example output of an alignment supporting a potential Sp1 site of an actin alpha cardiac promoter close to transcription start. Promoter and matrix differ at one conserved position. 
Figure 9: Specificity vs. support when constructing the above Sp1 alignment. At support of thtree the promoter is taken out of specificity (see algorithm). 
We have shown a new approach for the prediction of transcription factor binding sites. Our approach is the first which sees the prediction of binding sites as a process starting directly at the known binding sites and producing context specific matrices. Therefore, our approach resembles much more a real search engine than previous approaches. From the bioinformatics perspective, AliBaba2 unifies two general approaches, general local alignment search tools and the use of matrices as expert generated models for binding site prediction. From a theoretical point of view, our approach is supported by the central idea of Berg and von Hippel. We aim at constructing highly specific context dependent matrices for the sequence we are analysing. Previous approaches use predefined matrices for a general case which is why they loose important specificity. We developed the necessary algorithms implementing the idea and could show the feasibility of the general idea. Our method has been tested on a special data set created for comparing it to MatInspector as a well known program using predefined matrices. Our specifically constructed matrices not only match the quality of the manually constructed ones, they exceed them significantly when measuring the number of recognised binding sites. Future work will include more detailed comparisons and a refining of the shown algorithms. An obvious extension of the shown approach will be the iterative refining of predictions by using the constructed matrices.