In Silico Biology 6, 0030 (2006); ©2006, Bioinformation Systems e.V.  

Analysis and comparison of benchmarks for multiple sequence alignment

Gordon Blackshields*, Iain M. Wallace, Mark Larkin and Desmond G. Higgins

Conway Institute, University College Dublin
Belfield, Dublin 4, Ireland

* Corresponding author


Edited by E. Wingender; received April 03, 2006; revised May 30, 2006; accepted June 03, 2006; published June 08, 2006


The most popular way of comparing the performance of multiple sequence alignment programs is to use empirical testing on sets of test sequences. Several such test sets now exist, each with potential strengths and weaknesses. We apply several different alignment packages to 6 benchmark datasets, and compare their relative performances. HOMSTRAD, a collection of alignments of homologous proteins, is regularly used as a benchmark for sequence alignment though it is not designed as such, and lacks annotation of reliable regions within the alignment. We introduce this annotation into HOMSTRAD using protein structural superposition. Results on each database show that method performance is dependent on the input sequences. Alignment benchmarks are regularly used in combination to measure performance across a spectrum of alignment problems. Through combining benchmarks, it is possible to detect whether a program has been over-optimised for a single dataset, or alignment problem type.

Keywords: multiple sequence alignment, benchmark, BAliBASE, HOMSTRAD, IRMbase, OxBench, PREFAB, SABmark, structural superposition, RMSD, over-training, over-alignment


Over the past 10 years, more than 50 multiple sequence alignment methods and/or packages have been described. In 2005 alone there were at least 20 new publications [Wallace et al., 2006]. This reflects the importance of multiple alignments in day-to-day sequence analysis and the variety of purposes for which they are needed. This raises the question of which method to use and what criteria should be used in comparing methods. ClustalW [Chenna et al., 2003] is one of the most widely used packages and this reflects its stable and user-friendly interface. Other packages are faster and/or more accurate [Katoh et al., 2002; Edgar, 2004]. Some packages are relatively slow to run but appear to give very high accuracy alignments or to allow the mixing of heterogeneous data inputs [Notredame et al., 2000; Do et al., 2005]. Some methods are specialized for aligning whole genomes while others are specialised for local multiple alignments [Morgenstern, 1999; Van Walle et al., 2004].

One major way in which new packages have been compared is from performance on sets of benchmark test cases. It was common in the early days of sequence alignment software to demonstrate performance on a small number of examples. This was clearly open to abuse (developers could carefully choose which test cases and which parameters to use) and in the late 1990s, BaliBASE [Thompson et al., 1999] was produced as a large-scale test set. It included relatively large numbers of alignments and covered different situations. BaliBASE was produced by partial manual alignment of sequences and this left it open to the accusation that it could include subjective alignment information. This was partly fixed by its use of "core" regions that could be reliably and relatively objectively annotated and test comparisons were usually restricted to these core blocks.

To get round the possibility of subjective test cases, sets of 3-D protein structural superpositions have been used to derive test cases, as in HOMSTRAD [Mizuguchi et al., 1998]. OxBench [Raghava et al., 2003] and PREFAB [Edgar, 2004]. One problem with HOMSTRAD has been the lack of annotation of core and non-core regions. We remedy that in this paper by using a simple, automatic RMSD cut-off measure. A radical alternative has been the development of artificially generated test cases [Lassmann and Sonnhammer, 2002; Subramanian et al., 2005]. Although it is greatly preferred that assessment should be made on real biological data, this allows the assessment of programs designed to align sequences for which few test cases exist, such locally related sequence sets.

The question then is which benchmarks should be used and how should they be used? It is becoming increasingly common to use combinations of test sets to allay the fear that some software developers could potentially over-train on particular test sets. This is achieved by finding the parameter combinations that make your package look best with a particular set of test cases. While this undoubtedly happens, its importance may have been exaggerated but has still resulted in the perceived need for very extensive and laborious benchmarking for new packages. Other problems have been the tendency for some methods, especially global aligners, to over-align sequences (aligning residues and regions of the sequences outside of the conserved core that share no structural or homologous relationship), while local aligners are more prone to mis-alignment [Cline et al., 2002]. This is often not observed during benchmarking, as test cases are usually restricted to reliably alignable, annotated core regions. But, for example, if a user were to use a global aligner to align full-length sequences for which the domain boundaries are unknown, the extra alignment regions that are shown to be conserved may be misleading. Several benchmarks, such as BAliBASE and OxBench, now include test cases with full-length sequence data that can be used to assess this problem.

In this paper, we wish to make an exhaustive comparison of a set of alignment programs over a set of commonly used benchmarks in order to be able to make some clear recommendations as to which benchmarks to use.


Sequence Alignment Programs

15 alignment programs were investigated in this project, as listed in Tab. 1. There is considerable overlap between some (for example, several programs come from the MAFFT suite).

Table 1: Programs used in this investigation.
Align_m (2.3)
[Van Walle et al., 2004]
Local, specialised for highly divergent sequences.
ClustalW (1.8)
[Thompson et al., 1994]
Global, progressive alignment package.
Dialign2 (2.2)
[Morgenstern, 1999]
Local, aligns segments of sequences rather than individual residues.
Dialign-t (0.1.3)
[Subramanian et al., 2005]
Local, progressive alignment. Recent re-implementation of Dialign2.
MAFFT (5.531)
[Katoh et al., 2002]
Suite of alignment programs:
FFTNS Global, uses Fast Fourier Transform to generate tree.
FFTNSi As FFTNS, but with iteration step to refine alignment.
NWNS Global, uses traditional Needleman-Wunsch algorithm.
NWNSi As NWNS, but with iteration step to refine alignment.
FINSi Local, iterative, uses local pairwise alignment information.
GINSi Global, iterative, uses global pairwise alignment information.
MUSCLE (3.6)
[Edgar, 2004]
Global, iterative, progressive alignment program that uses Log Expectation as scoring function.
ProbCons (1.09)
[Do et al., 2005]
Global, uses posterior-probabilities from HMMs and pairwise alignment consistency.
PCMA (2.0)
[Pei et al., 2003]
Global, switches alignment strategies dependent on sequence data.
ClustalW is used to align highly similar sequences and to form pre-aligned groups. T-COFFEE is used to align the more divergent groups.
POA (v2)
[Lee et al., 2002]
Local; uses Partial Order graphs.
T-COFFEE (1.37)
[Notredame et al., 2000]
Combines both global and local methods; uses consistency.

Benchmark Datasets

6 alignment databases were used as benchmarks for this investigation.

BAliBASE [Thompson et al., 1999] contains eight reference sets, each dealing with a different type of alignment problem. Ref1 deals with test cases containing small numbers of equidistant sequences, and is further subdivided by percent identity. Ref2 alignments contain "orphan", or unrelated, sequences. Ref3 test cases contain a pair of divergent subfamilies, with less than 25% identity between the two groups. Ref4 is concerned with long terminal extensions, while Ref5 test cases contain large internal insertions and deletions. Test sets from References 6-8 deal with problems like transmembrane regions, inverted domains, and repeat sequences. In previous versions of BAliBASE, test cases were confined to homologous regions. In practice, the boundaries of such regions may be unknown. The current version [Thompson et al., 2005] now also provides duplicate test cases containing full-length sequences. Only the first five reference sets are used here, as they have been corrected and verified in the latest release.

OxBench [Raghava et al., 2003] comprises 3 related datasets. Test cases in the MASTER set deal with isolated domains derived exclusively from sequences of known structure. The FULL set was generated from suitable MASTER test cases, using full-length sequence data. High scoring homologous sequences were added to each MASTER test case to generate the EXTENDED set. The results from this third set, however, are not used here. It was found that some of the test cases in the EXTENDED set proved too large for some programs, and aborted due to excessive memory requirements. Of the 276 test cases selected from EXTENDED, T-COFFEE returned 235 alignments, and Align-m was only able to align 107, using a single processor with 4GB of RAM.

PREFAB [Edgar, 2004] test cases are generated by taking a pairwise alignment of sequences of known 3D structure, and adding up to 24 high scoring homologues for each sequence. Accuracy is assessed on the structural alignment of the original pair alone.

SABmark [Van Walle et al., 2005] is divided into two subsets. Each test group in the SUPERFAMILY set represents a SCOP superfamily, whose sequences are 25-50% identical. Each test group in the TWILIGHT set represents a common SCOP fold and sequences are 0-25% identical. In addition, these two subsets are also provided with non-homologous (false positive) sequences included within each group. Instead of a single alignment acting as a reference, SABmark provides multiple pairwise references for each test, and it is the average score from each of these references that is taken here as a score for each test case.

IRMBASE [Subramanian et al., 2005] test cases contain a number of simulated motifs [Stoye et al., 1998] inserted into otherwise random (unalignable) sequences, and as such is entirely different to the other benchmarks used in this study. Test cases are designed to examine whether a method can detect isolated motifs within sequences, and so are tailored to a local alignment approach.

HOMSTRAD [Mizuguchi et al., 1998] is a database exclusively based on protein structures derived from the PDB, arranged into homologous protein families. It was not specifically designed as a benchmark database, although it is regularly employed as such.

Each benchmark was divided into subsets. For BAliBASE, IRMBASE and SABmark, this division already exists. For HOMSTRAD, PREFAB and the OxBench datasets, we categorised the test cases according to percent sequence identity of the reference alignment. The names and details of each category are described in Tab. 2. The size of each dataset may differ from published figures, as they were restricted to test cases of 4 or more sequences for this analysis.

Table 2 : Benchmark Datasets used in this investigation, and their subsets.
Database Subset
Number of tests Average Percent Identity Average sequence number Average sequence length
BAliBASE [Thompson et al., 2005]
Each reference set is prefixed with either:
"BBS": reduced to homologous regions.
"BB": duplicate set containing contain full length
         sequence data.
BBS11 38 18 7 227
BBS12 44 32 9 208
BBS20 41 39 46 234
BBS30 30 31 63 246
BBS50 16 29 28 328
BB11 38 17 7 295
BB12 44 31 9 390
BB20 41 35 46 392
BB30 30 29 63 360
BB40 49 27 28 479
BB50 16 27 28 489
HOMSTRAD [Mizuguchi et al., 1998]
HOM_20 18 16 6 182
HOM_40 127 30 7 226
HOM_60 77 47 6 181
HOM_80 11 65 5 306
IRMBASE [Subramanian et al., 2005]
Subsets are categorised by number of motifs (1-3),
and are subdivided by number of sequences (4-16).
IRM_1_4 20 11 4 445
IRM_1_8 20 10 8 445
IRM_1_16 20 9 16 445
IRM_2_4 20 14 4 590
IRM_2_8 20 13 8 590
IRM_2_16 20 12 16 591
IRM_3_4 20 17 4 635
IRM_3_8 20 16 8 635
IRM_3_16 20 15 16 635
OxBench [Raghava et al., 2003]
"MASTER": reduced to homologous regions.
"FULL": duplicated set containing full-length sequence data.
MASTER_20 17 17 7 185
MASTER_40 82 33 15 138
MASTER_60 78 50 9 134
MASTER_80 73 69 9 136
MASTER_100 26 87 8 121
FULL_20 17 17 7 470
FULL_40 82 33 15 357
FULL_60 77 50 9 269
FULL_80 72 69 9 264
FULL_100 22 87 8 210
PREFAB [Edgar, 2004]
PRE_20 799 14 44 222
PRE_40 654 26 46 248
PRE_60 87 49 45 232
PRE_80 56 69 47 269
PRE_100 70 91 49 224
SABMARK [Van Walle et al., 2005]
"_FP" denotes inclusion of false-positive sequences
SUPERFAMILY 310 25-50 9 167
SUPERFAMILY_FP 424 25-50 15 231
TWILIGHT 158 0-25 10 158
TWILIGHT_FP 209 0-25 17 220
For BAliBASE, IRMBASE and SABmark, subsets are predefined. BAliBASE reference sets are prefixed with "BBS" or "BB", to denote when either homologous regions or full-length sequences are used in the test cases. IRMBASE reference sets are categorised by the number of motifs (1,2 or 3), and are further classified by the number of sequences (4, 8 or 16). For SABmark, the suffix "_FP" denotes the inclusion of false positive sequences into the test cases. For HOMSTRAD, PREFAB and the OxBench datasets (MASTER and FULL), test cases are categorised using the percent identity of the sequences within the reference alignment (0-20%, 20-40%, 40-60%, 60-80%, 80-100%), and are labelled accordingly. The size of each dataset differs from published figures, as they were restricted to test cases of 4 or more sequences for this analysis.

Core region identification in HOMSTRAD

A feature common to the other datasets but absent from HOMSTRAD is annotation of reliably alignable "core" regions (for SABmark, annotation is unnecessary as it is only the core regions that are used). Annotation allows alignment comparison programs to focus on the question of how well a method aligns what should be aligned and to disregard regions where the alignment is arbitrary. This annotation is especially important when using test cases containing, for example, full-length sequences with short, isolated motifs. To incorporate this feature into HOMSTRAD, we used structural superpositions to determine the core region for each entry. Within HOMSTRAD, a full structural superposition is available for each test. A column of residues in the reference can be considered reliably aligned if the corresponding alpha carbons for those residues are within a certain distance in 3D space. We calculated the Euclidean distance between each pair of alpha carbons within the column, and then used the RMSD (Root Mean Square Deviation) of those distances. If the RMSD of a particular column is within a threshold of 3.0 Angstroms, it is considered reliably aligned. Gaps within a column mean that the column is inherently less reliable, and so a penalty was applied relative to the proportion of gaps present. RMSD for a column c is therefore calculated as follows:

Where x, y, and z refer to the co-ordinates of a residue's alpha carbon, N is the number of available pairwise distances between residues, H = height of column and R = number of residues in that column. To illustrate this, we use the HOMSTRAD entry for globins, which at 41 sequences is the largest entry within HOMSTRAD, and whose SCOP-defined core of just 6 alpha helices is easily visualised (Fig. 1).

Figure 1: Determination of core regions in HOMSTRAD. (a) ClustalX representation of glob, the HOMSTRAD entry for the alignment of globins. There are 41 sequences in total, averaging 146 residues in length, and sharing 32% sequence identity. (b) ClustalX histogram of sequence conservation. (c) ClustalX histogram of SSE conservation (from DSSP assignments). The data for (a-c) is contained within the HOMSTRAD glob.tem file. (d) RMSD based definition of the globin core. Columns for which the RMSD between corresponding alpha carbons is below the 3A threshold (blue line) are considered reliably aligned ("core").

Assessment of Alignment Accuracy

Several of the datasets provide scoring schemes to assess alignment quality. BaliScore contains two metrics, CS (Column Score) and SPS (Sum Of Pairs). CS is the number of columns of residues that are identical in both test and reference alignments, divided by the length of the reference. The SPS scheme instead analyses the sequence alignment in a pairwise fashion, and is equivalent to PREFAB's Q score. The authors of IRMBASE also use these two scores as assessors of alignment accuracy. When the reference alignment is pairwise, as is the case for PREFAB and SABmark, these two scores are equivalent. Several more sophisticated methods are available, like APDB [O'Sullivan et al., 2003] and the STAMP metric [Russell and Barton, 1992], available as part of the OxBench package, which provide assessment by direct structural comparisons using PDB files, and so act independently of a reference alignment. This can provide a very useful assessment tool, but cannot be applied when no such PDB files exist, as is the case for IRMBASE. Therefore we choose to use general scoring methods that can be applied to any test case, provided that a suitable reference alignment is available, making assessment across multiple datasets more feasible.

Although very useful in evaluating alignment quality, the Column score has a disadvantage in that it does not penalise over-alignment (aligning residues whose positions in the reference are arbitrary). Therefore, as a complementary measure, we use the Shift score, proposed by [Cline et al., 2002], which addresses these issues. Based on a measure of disagreement between two alignments, the Shift score includes penalties for both over- and under-alignment, and gives positive but reduced scores for slight misalignment. Very bad alignments may achieve negative scores. A method that over-aligns sequences may have a high Column score, but low Shift score. Conversely, if the residues of a test alignment were slightly misaligned (shifted), then the Column score would be low, the Shift score relatively high. Also, columns that are almost correct, but whose residues are shifted by even a single position with respect to the reference, are not given any score. SPS is not used, as it is not very discriminating, tending to reduce the differences between methods.

We scored each test case using the qscore program (, which calculates both these scores. The Shift score is a pairwise measurement, so qscore takes the average scores from all pairs of sequences within the alignment. For all measurements, qscore gives a maximum score of 1 for a perfect alignment. We present these results as percentages. For BAliBASE, HOMSTRAD, IRMBASE and OxBench, the reference is a full multiple sequence alignment. Accuracy on the PREFAB dataset is assessed using the original pairwise structural alignment. For SABmark, many such pairwise references exist for each group, and the scores presented here are the result of averaging the pairwise scores before averaging all the scores for a particular subset. This takes into account the varying number of sequences per group. Results from a test case are taken from the regions defined by the reference to be reliable, and used only if all programs return an alignment.

Parameter optimisation and over-optimisation

To investigate the effects of parameter optimisation on a dataset, we selected three methods, ClustalW, MUSCLE and LINSi (from the MAFFT suite) and the main parameters, the gap-open (GOP) and gap-extension (GEP) penalties. The methods were chosen because they are fast, and share the option to change these two parameters at the command line, although each program has many other parameters that could also be optimised. We then applied each method to 5 different datasets (BAliBASE, the OxBench MASTER set, HOMSTRAD, and two randomly selected, non-overlapping PREFAB subsets (N = 200), denoted PRESUB1 and PRESUB2), across an exhaustive range of parameter combinations. Given the different defaults for each method, a relatively large search space was required. Each method was thus applied using an initial GOP range of 0.0-20.0 (in 0.5 increments), and 0.0-10.0 for GEP (0.2 increments). When roughly optimal parameters were initially identified the search was confined to the space immediately around these points, using a smaller increment of 0.02 for GOP, 0.01 for GEP.


All methods were scripted using the Python programming language ( Additional Python libraries used to calculate the RMSD data and to create the graphics in Fig. 3 (Numarray, Pylab, Matplotlib) are available from Wilcoxon rank sum tests were performed using the R statistical language (, while heatmaps in Fig. 2 were created using an additional R library, MADE4 [Culhane et al., 2005], downloaded from For the parameter optimisation searches, large amounts of processor time were required, provided by the SFI/HEA Irish Centre for High-End Computing (ICHEC,


When only alignable regions (BBS) are being aligned, ProbCons achieves the highest scores in all but one BAliBASE test (BBS12), and in this one case, the difference is not statistically significant. (Tab. 3). However, when full-length sequences are used in the test cases, in most cases the optimal aligner incorporates local alignment strategies (FINSi and PCMA). It is clear in both BBS and BB subsets that all methods experience difficulties in aligning sequences of low identity; the average improvement in Column score experienced going from BBS11 (0-20%) to BBS12 (20-40%) is 43.54%; going from BB11 to BB12 (i.e. full length sequences), this relative improvement increases to 55.16%.

Table 3: Results from a subset of the BAliBASE dataset for test cases of at least 4 sequences in size.
FFTNS 26.97**30.73**84.25**69.71**39.07**37.92**48.47**42.45*65.03**53.41**71.22*49.92**42.97**43.87**87.68**72.56**55.36**50.99**67.18**51.00**68.33**52.95**
STD DEV7.0272.9084.7891.657.173.5510.473.848.72.447.8982.6111.274.514.931.699.014.01611.863.577.062.399
Test cases contain full-length sequence data ("BB") and also sequences reduced to homologous regions only ("BBS"). Values in bold (red) denote the highest score within each subset. First score within each cell denotes the Column score, second score denotes Shift score. The significance of difference from the most accurate method is indicated by * (p < 0.05) or ** (p < 0.01) (Wilcoxon test).

For the OxBench, HOMSTRAD and PREFAB datasets (Tabs. 4-5), test cases were categorized with respect to sequence percent identity. As is the case for BAliBASE, local methods like FINSi and PCMA achieve highest scores when test cases containing full-length sequences of lower identity are used, while ProbCons soon becomes the optimal aligner for the later categories in OxBench. For HOMSTRAD, ProbCons again is the most accurate alignment method in all but the last category, where T-COFFEE and ClustalW both outperform ProbCons. PREFAB, however shows two iterative MAFFT programs to be the most accurate, except between 80-100% identity, where Dialign-t achieves a marginally higher score. As expected, the greatest variation in performance occurs when sequences are more distantly related, while there is much more agreement between methods at higher identities. The standard deviation for Column score at 60-80% identity is just one-third of the same measure at 0-20%, and reduces further to just 20% at the highest category if we were to exclude HOMSTRAD (which has no test cases in the 80-100% category). This agreement is reflected in the statistical analysis where very few methods perform statistically worse than the optimal method.

Table 4: Results from a subset of the OxBench dataset for test cases of at least 4 sequences in size.
POA2.14517.0128.03**35.8963.66**59.386.31**67.3392.96*75.58 3.88**22.30**37.58**59.14**71.73**81.72**88.34**90.83**97.19**95.83
STD DEV3.192.1553.5771.363.180.6091.070.3961.6080.5255.66.2876.5092.523.8180.8161.30.2360.420.07
Test cases contain full length sequence data ("FULL") and sequences restricted to alignable regions only ("MASTER") with 0-20%, 20-40%, 40-60%, 60-80% and 80-100% sequence identity respectively. Values in bold (red) denote the highest score within each subset. The first score within each cell denotes the Column score, while the second score denotes Shift score. The significance of difference from the most accurate method is indicated by * (p < 0.05) or ** (p < 0.01) (Wilcoxon test).

Table 5: Results from subsets of (a) HOMSTRAD and (b) PREFAB dataset for test cases of at least 4 sequences in size.
STD DEV8.874.9786.051.973.0770.825.0962.646.2855.3595.614.2772.692.780.8890.431.0590.74
Test cases are subdivided with 0-20%, 20-40%, 40-60%, 60-80% and 80-100% sequence identity respectively. Values in bold (red) denote the highest score within each subset. First score within each cell denotes the Column score, the second value denotes the Shift score. The significance of difference from the most accurate method is indicated by * (p < 0.05) or ** (p < 0.01) (Wilcoxon test).

In looking at the differences between the results for FULL (and BB11 and BB12 in BAliBASE) we can see evidence of over-alignment and slight mis-alignment, detected by Shift score. Below 20% sequence identity, test cases are very difficult to align, which Column score penalizes quite severely, but as the Shift score gives reduced, but positive scores for residues that are only slightly mis-aligned, these scores are high in comparison. However at higher percentages of sequence identity, the Column score is high, Shift score relatively low. This is likely to happen, for example, if a method identifies a domain correctly but mistakenly extends the domain too much. This level of reduction is not seen in the corresponding MASTER entries, as test cases are largely reduced to reliably alignable regions anyway.

For SABmark (Tab. 6), it is again clear that all alignment programs have problems when aligning divergent sequences. Results from the TWILIGHT set show an average drop of 28.5% accuracy compared to SUPERFAMILY. However, the addition of unrelated sequences into the test had various effects on performance. The largest decrease in performance is seen in the MAFFT suite of programs, with an average Column score drop of 5.45% in SUPERFAMILY, and 5.15% in TWILIGHT. ProbCons, the best performing method across all SABmark categories, experiences drops of only 1.2% and 1.91% respectively. In several instances, both scores show that some methods actually benefit from the inclusion of unrelated sequences. ClustalW's Column score improves by approximately 0.2% for the SUPERFAMILY SET, while POA's Column score shows a 1.7% increase in SUPERFAMILY, 2.2% in TWILIGHT, with similar increases for Shift score.

Table 6: Results from both subsets of SABmark, namely Superfamily and Twilight.
False Positives
False Positives
Each subset is also duplicated with false positive sequences added. Values in bold (red) denote the highest score within each subset. First score within each cell denotes the Column score, the second value denotes the Shift score. The significance of difference from the most accurate method is indicated by * (p < 0.05) or ** (p < 0.01) (Wilcoxon test).

On the IRMBASE dataset (Tab. 7), POA Shift score performance appears erratic with respect to the other methods. When only one motif is to be aligned, POA clearly outperforms all other methods, but its performance deteriorates rapidly as more sequences are added, levelling off only with addition of more sequences in each reference. Otherwise, all methods improve steadily with the addition of more sequences and motifs.

Using Column score, however, it is clear that all methods that use a local alignment strategy have a definite advantage relative to pure global aligners when using IRMBASE. Two distinct groups appear early on, when only one motif is to be aligned. Local aligners Dialign-t, Dialign2, POA, Align-m and FINSi all score highly, along with T_COFFEE and PCMA, which incorporate local alignment strategies. Much lower down, the second group comprises all global aligners, whose performance gradually improves, albeit somewhat erratically, as more sequences and motifs are added. This improvement appears much smoother when the Shift score is used, suggesting that while the columns may not be absolutely correct relative to the reference alignment, the overall quality of the alignment is slowly increasing. When only one motif is detectable, MUSCLE and ClustalW score so badly they receive negative Shift scores for all test cases with only one motif. ClustalW does not score above 0.0 until 3 motifs are to be aligned. Dialign-t, however, gets the highest Column score in 5 subsets, and the highest total average score over the full dataset (84.0%), followed by Dialign2 (81.13%).

Table 7: Results from the IRMbase dataset.
IRM_1_4 IRM_1_8 IRM_1_16 IRM_2_4 IRM_2_8 IRM_2_16 IRM_3_4 IRM_3_8 IRM_3_16
ClustalW1.029**-0.39**1.327**-0.98** 0.0**-2.28** 3.35**0.825** 0.33**-0.34** 0.65**-1.22**11.74**3.635** 2.99**3.455**3.068**2.808**
MUSCLE21.54**8.849**11.48** 7.72** 2.17** 5.97**16.41**10.84** 8.12**9.026** 8.18**11.66**31.24**19.90**30.41**23.53**16.32**21.29**
POA87.8891.3474.57**81.8470.2479.0926.93**52.712.80**41.12**5.036**37.27**13.49**39.44**10.36**40.85** 3.10**35.66**
Each set differs in the number of ROSE generated motifs (1,2 or 3) inserted into otherwise random sequences, and the number of such sequences used (4,8 or 16) in each alignment Values in bold (red) denote the highest score within each subset. First score within each cell denotes the Column score, the second value denotes the Shift score. The significance of difference from the most accurate method is indicated by * (p < 0.05) or ** (p < 0.01) (Wilcoxon test).

Clustering programs and benchmarks

Using the above data, correlation plots were made to visualise performance consistency across multiple datasets (Fig. 2). Column scores show quite distinct groupings between local and global alignment programs, due to performance on IRMBASE. The only local method not in this group is POA, whose performance rapidly deteriorates on IRMBASE, though it scored highly in the first subsets, which would explain its presence as an outlier. Within the global alignment group, ClustalW and MUSCLE group quite closely, away from the global MAFFT programs and ProbCons. Datasets, too, form clear groups. IRMBASE, unique among the datasets, clearly forms a tight group of its own. PREFAB test sets of high sequence identity group together, distinct from a large subgroup that contains almost all other test sets, except for HOM_80 and FULL_100, appearing on groups of their own.

All MAFFT programs group closely together when Shift scores are used. The most similar performance to MAFFT is that of ProbCons, and within this larger group we also find T-COFFEE and PCMA. Slightly more distant are MUSCLE and ClustalW. Outside this large subgroup Dialign-2 and Dialign-t group together, along with another local aligner, POA. Each dataset is similarly grouped according to method performance, and several subgroups are seen. IRMBASE again forms a group of its own, and is most closely linked with high-identity subsets from PREFAB and the MASTER datasets. SABmark's SUPERFAMILY set (25-50% identity) and TWILIGHT (0-25% identity) group closely with subsets of similar identity from all benchmark datasets that are reduced to homologous regions only. The effects of adding unrelated sequences to SABmark test cases were inconsistent across methods, and this is reflected where both datasets group together, close to FULL_20.

Figure 2: Clustering Programs and Datasets. Methods and datasets are grouped using correlation similarity and average linkage clustering as described by [Eisen et al., 1998], to visualise performance consistency across all datasets. Column (a) and Shift (b) scores are taken from each method across all datasets and their subdivisions as described in Tab. 2.

Parameter optimisation and over-optimisation

Optimal parameter combinations for ClustalW, LINSi and MUSCLE on each of the 5 test sets were identified, using Shift score as the measure of accuracy. ClustalW generally shows the greatest variations between optimal and default parameters (Tab. 8a, Fig. 3), while it is clear that the optimal and default parameters for LINSi and MUSCLE are highly co-localised in comparison, across all datasets. It is also shown that ClustalW arguably shows the greatest robustness when gap parameters are changed, with relatively little deterioration in accuracy even when clearly sub-optimal parameters are used. Overall, however, these effects are minimal when the overall performances of the methods are compared. In the cases of MUSCLE and LINSi the performance most likely reflects good initial parameter choice. With ClustalW, the initial parameter choices are not optimal but the effect on performance is minimal. The results are the same over large areas of the plots, regardless of the initial settings for the gap penalties.

These optimal parameters were then applied to each of the other datasets in turn and the differences in performance were measured (Tab. 8b). In each case, all three methods show improved scores when optimised on the dataset (shaded) and for each dataset, the improvements are statistically significant for at least two methods, according to the Wilcoxon Rank Sum test (P-value < 0.05). The application of sub-optimal parameters (i. e. parameters optimised on a different dataset) presents varying results. 11 sub-optimal parameter combinations present significant improvement in alignment accuracy over the corresponding defaults, while 3 result in significantly worse scores.

Figure 3: Parameter optimisation for three methods across different datasets. Five different datasets were used to optimise parameters of three different methods. (PRESUB1 and PRESUB2 are two non-overlapping PREFAB subsets, each containing 200 randomly selected tests). Gap Opening Penalty (GOP, horizontal axis) is shown across a range of 0.0-15.0 on each map. Gap Extension penalty (GEP, vertical axis) is shown for 0.0-5.0. Default parameter combination is marked as "O". Optimal parameter combination is marked as "X". Each map uses the same colour scale, between 40-90% alignment accuracy, as defined by the Shift score.

Table 8: Gap Opening and Extension Parameter optimisation.
(a) Method Default
Optimal parameters found for
ClustalW10.0, 0.203.50, 0.603.94, 0.837.00, 0.406.54, 0.325.00, 0.00
LINSi1.53, 0.1231.08, 0.121.04, 0.351.71, 0.051.90, 0.271.78, 0.49
MUSCLE3.0, 0.2752.80, 0.193.40, 0.184.52, 0.192.92, 0.243.00, 0.28
(a) Two parameters (GOP, GEP) were optimised for ClustalW, LINSi and MUSCLE on each analysed dataset and are shown alongside default values for each method. (b) Each method is then applied to each dataset in turn (listed in column 1), using these optimised parameters. Shift scores returned using default parameters are in column 2. Shift scores returned for a dataset using parameters optimised on that dataset are in shaded cells. Significant differences in performance are marked with respect to default1 and optimised2 score. Significance is calculated using Wilcoxon Rank Sum test (p < 0.05).


There is no single alignment program currently available that consistently outperforms the rest. Ranking is clearly dependent on the test dataset. When highly divergent sequences are used, all methods return alignments of low accuracy, but it is with these test sets that the largest differences between the methods is observed. As the sequences converge, so do the scores, and the differences between methods become marginal. In global alignment cases (the majority of the test databases), the global aligners such as ProbCons, PCMA and MUSCLE generally outperform local methods, even if the sequence identity is low. However, when trying to detect isolated motifs, or using full-length sequences, programs like Dialign, Align-m and POA all demonstrate the advantage of a local approach. PCMA switches between the ClustalW and T-COFFEE alignment strategies depending on the sequence data, and it is clear that this can be highly advantageous. However, if one were to choose one method, ProbCons is arguably the most accurate single program currently available, achieving the highest score across 18 of the 43 different subsets (if combined, MAFFT programs can manage this in 9 subsets using the Column score, but only 5 when Shift score is used).

An obvious goal of benchmarking is to validate the quality of sequence alignment programs, and recently it has become common practice to use increasingly extensive sets of test data for this purpose. This is to demonstrate performance across a wide spectrum of test cases, but also to safeguard against the suspicion that a method has been over-trained on a single dataset. Optimising on two separate subsets of test cases from the PREFAB benchmark naturally yields similar (though not identical) parameter combinations, but they are quite different from those found for BAliBASE, simply because the test sets contained within each dataset are also quite different. As the datasets get smaller, the optimal parameters become more specialised, and this difference increases. Therefore, there is an obvious advantage to using several benchmarks for program assessment. However, it can also easily become time consuming and confusing. Different databases use different arrangements of test alignments and file formats, and are often distributed with proprietary scoring programs that assess alignment quality in different ways. In addition, most current benchmarks are based, either entirely or in part, on alignments of proteins with known structures. While it is clearly desirable to keep the databases as biologically "real" as possible, this results in a high level of redundancy between benchmarks, due to the limited numbers of available structures. From the results presented here, it is clear that the test data, rather than the algorithm, are the driving force when clustering methods and datasets. Method rankings change very little across the different datasets. In other words, it is as much the nature of the dataset as the program that determines alignment quality. So which benchmark set should we choose?

Categorising HOMSTRAD, MASTER, FULL and PREFAB test cases by percent sequence identity yielded almost identical patterns, as would be expected. Each of these datasets shows that the ability to accurately align sequences is largely dependent on the diversity of those sequences. SABmark boasts full coverage of the known fold space, but as there are multiple pairwise references for each group (and the number of references changes between groups), assessment is much more time consuming, and may become complicated depending on how the results are treated. All alignments within each of these datasets are based on structural superposition and undoubtedly of high quality, but they do not explicitly address different problem areas experienced in multiple sequence alignment. As IRMBASE is not based on real biological data, it is more difficult assess program accuracy using it alone. Clearly IRMBASE is designed to test local aligners but here there are issues that are specific to local alignment programs that do not arise when considering purely global alignments. The inclusion of full-length sequence data, such as in the FULL set, and the simulated IRMBASE, is quite desirable, as the reduction of benchmark datasets to homologous domains may be unrealistically simplistic. Test cases automatically become easier to align, especially when using global algorithms. In addition, domain boundaries are often unknown, and scores achieved on such a benchmark might in fact prove to be a poor reflection of actual performance. The importance of core region annotation becomes increasingly clear here. HOMSTRAD is often used as a benchmark though it lacks this annotation. Through our analysis of the structural superpositions, we found that of the 233 test cases that we selected from HOMSTRAD, on average only 68% of the alignment lies within the core region.

The BAliBASE architecture means that several distinct problem areas are explicitly addressed, more so than any of the other datasets investigated. There has been some argument against the quality of BAliBASE as a benchmark, but the clustering of the different methods and subsets shows that the BAliBASE test sets are comparable to similar subsets from different benchmarks, based on method performance. BAliBASE's usefulness as a benchmark has increased now that it now includes full-length sequence data in addition to the original test cases. In addition, BAliBASE is inherently more difficult to be over-trained on due to the different problem sets. Therefore it seems that it is one of the best benchmarks available. However, as it stands it might still be considered too small to be a fully comprehensive benchmark. Of the benchmarks based on real biological data, the full BAliBASE dataset currently contains the smallest number of test cases, and number of sequences with elucidated structures, of any of the analysed datasets. The latest release contains 217 corrected and verified alignments based on 605 structures from the PDB. The full distribution of PREFAB has 1682 alignments, and HOMSTRAD has 1032, and these datasets use 1731 and 2083 PDB structures, respectively (excluding different chains). Though it is not the goal of a benchmark to be as big as possible, having a large range of representative examples from the known fold-space to evaluate relative performance is clearly desirable. This is a major advantage of a dataset like BAliBASE, and given the rates at which new sequences and structures are being published, is one that could readily be exploited further. The paper announcing the latest release of BAliBASE describes a semi-automatic update protocol that will allow more frequent releases of the dataset, which should make this quite feasible.

Parameter optimisation searches show that although it is clear that for each dataset the optimal and default parameters for LINSi and MUSCLE co-localise to a much greater extent than those for ClustalW, it is also clear that these parameters hardly change going between datasets. We conclude that this indicates good initial choices of parameters, not over-optimisation. The distance of optimal parameters from the default for ClustalW, while maintaining a relatively robust performance across different parameter combinations, suggest that other internal parameters stabilise against such change.

The ultimate answer to the benchmarking problem might be found in a blind test set that is unavailable to developers until they submit their program to be tested. This proposal is similar to CASP for the protein modelling community and in fact could make use of some of the CASP data. Unfortunately such a test environment would be very complicated to set up and run.


This work was funded by a PI grant from Science Foundation Ireland. We are greatly indebted to the SFI/HEA Irish Centre for High-End Computing (ICHEC, for the provision of computational facilities and support. We are also grateful to Bob Edgar and especially Kevin Karplus for discussion and help.