| In Silico Biology 2, 0046 (2002); ©2002, Bioinformation Systems e.V. |
Department of Computer Science,
University of Wales Aberystwyth, Penglais,
Aberystwyth
SY23 3DB
Fax: +44-1970-622455
*Corresponding author:
Email: ajc99@aber.ac.uk
Edited by H. Michael; received June 26, 2002; revised and accepted August 29, 2002; published September 20, 2002
We wished to quantify the state-of-the-art of our understanding of clusters in microarray data. To do this we systematically compared the clusters produced on sets of microarray data using a representative set of clustering algorithms (hierarchical, k-means, and a modified version of QT_CLUST) with the annotation schemes MIPS, GeneOntology and GenProtEC. We assumed that if a cluster reflected known biology its members would share related ontological annotations. This assumption is the basis of "guilt-by-association" and is commonly used to assign the putative function of proteins. To statistically measure the relationship between cluster and annotation we developed a new predictive discriminatory measure.
We found that the clusters found in microarray data do not in general agree with functional annotation classes. Although many statistically significant relationships can be found, the majority of clusters are not related to known biology (as described in annotation ontologies). This implies that use of guilt-by-association is not supported by annotation ontologies. Depending on the estimate of the amount of noise in the data, our results suggest that bioinformatics has only codified a small proportion of the biological knowledge required to understand microarray data.
The annotated clusters can be found at http://www.aber.ac.uk/compsci/Research/bio/dss/gba/.
Key words: Ontology, statistics, machine-learning,
transcriptome, microarray data.
The most frequent method used to analyse microarray gene expression data is unsupervised learning in the form of clustering. This has been applied in many forms, including hierarchical clustering (top down and bottom up), Bayesian models, simple networks based on mutual information or jackknife correlation, simulated annealing, k-means clustering and self-organising maps [Alon et al., 1999; Eisen et al., 1998; Barash and Friedman, 2001; Butte and Kohane, 2000; Heyer et al., 1999; Lukashin and Fuchs, 2001; Tavazoie et al., 1999; Törönen et al., 1999]. How well do we understand the clusters produced by these methods? Most papers on expression data clustering report a selection of good clusters which the authors have selected by hand and which correspond to known biology. However microarray clustering experiments also generally produce clusters which seem to correspond less well with known biology. This type of cluster has received less attention in the literature. A common rationale behind using clustering algorithms is guilt-by-association [Walker et al., 1999]: if gene expression patterns are similar then the genes are likely to share similar biological function. Is this generally true?
Two main approaches have been employed in testing the reliability of microarray clusters: self-consistency, and consistency with known biology. In self-consistency the idea is to predict some left out information in a cluster. For example Yeung et al. [2001] used self-consistency to assess the results of their clustering by leaving out one experimental condition (or time point), and using this held out condition to test the predictive ability of the clusters. They were testing whether the cluster could predict the value of the missing experimental condition, or the value at the missing time point.
The idea behind the use of existing biological knowledge is that if a cluster is consistent with known biological knowledge then it reflects a real feature in the data. This is essentially a systematic version of the informal approach generally taken to evaluate clustering. This idea was used by Tavazoie et al. [1999] who clustered the S. cerevisiae cdc28 dataset [Cho et al., 1998] with k-means clustering (k = 30), and checked the validity of their clustering by mapping the MIPS functional classes onto the clusters to show which classes were found more often than by chance in each cluster. Several clusters were significantly enriched for one or more classes.
In this paper we combine the ideas of self-consistency and using known
biological knowledge to test systematically the relationship between
microarray clusters and known biology. We examine, for a range of
clustering algorithms, the quantitative self-consistency of the
functional classifications in the clusters.
Microarray Data
To test our approach we used the classic microarray data from Spellman
et al. [1998], which included
4 different experiments measuring cell-cycle expression levels in the
S. cerevisiae genome:
-factor based synchronisation,
cdc15-based synchronisation, elutrition-based synchronisation and the
cdc28-based data from Cho et al. [1998]. To show that these results are not
specific to this dataset of S. cerevisiae we also used the data
from Khodursky et al. [2000] for E. coli. This data
measured expression levels in response to changes in tryptophan
metabolism. Then to show that the general trends are also true of more
recent yeast data sets we used the data set from Gasch et al.
[2000] which measured the expression
levels of yeast cells when subject to a wide variety of environmental
changes.
Classification Schemes
For S. cerevisiae we selected the most commonly used functional classification schemes: the Munich Information Center for Protein Sequences (MIPS) scheme (http://mips.gsf.de/proj/yeast/catalogues/funcat), and the SGD annotations of the GeneOntology (GO) consortium scheme (http://www.geneontology.org). The GO scheme is particularly interesting as it is part of a cross-species functional classification project. For E. coli we used GenProtEC's MultiFun classification scheme (http://genprotec.mbl.edu/start), arguably the most reliable of all functional classification schemes. These classification schemes are hierarchical (MIPS and GenProtEC are trees, GeneOntology a directed acyclic graph) and provide several levels of granularity in the classification, from very general classes at the top of the hierarchy to more specific classes lower down.
Clustering Methods
We chose three clustering methods to compare: agglomerative hierarchical clustering [Eisen et al., 1998], k-means clustering [Duda et al., 2000], and a modified version of the method of Heyer et al. [1999] "QT_CLUST". These were chosen not to show that any one was better than any other (this would be difficult to prove, given the variety of parameters that can be tuned in each algorithm), but rather, to give a representative sample of commonly used clustering algorithms for microarray data - to show that what we observe is approximately true for any reasonable clustering.
Hierarchical and k-means clustering are the two methods currently available on the Expression Profiler (http://ep.ebi.ac.uk/) web server. The Expression Profiler is an up to date set of tools available over the web for clustering, analysis, and visualisation of microarray data.
Hierarchical clustering is an agglomerative method, joining the closest two clusters each time, then recalculating the inter-cluster distances, and joining the next closest two clusters together. We chose the average linkage of Pearson correlation as the measure of distance between clusters, and a cut-off value of 0.3.
k-means clustering is a standard clustering technique, well known in the fields of statistics and machine learning. Correlation was used as distance measure. We used k = 100.
The QT_CLUST algorithm is described in Heyer et al. [1999]. We implemented a modified version of this algorithm. The data was normalised so that the data for each ORF had mean 0 and variance 1. We did not implement Heyer et al.'s method of removal of outliers by computing jackknife correlations. Pearson correlation was used as a similarity measure, and each ORF was used to seed a cluster. Further ORFs were added to the cluster if their similarity with all ORFs in the cluster was greater than a fixed threshold (the cluster diameter). This was taken to be our final clustering. We henceforth refer to this method as QT_CLUST_MOD. We did not, as Heyer et al. do, set aside the largest cluster and recluster, since we did not demand a unique solution, and in fact we wanted a clustering in which each ORF could be in more than one cluster. Allowing each ORF to be in more than one cluster is similar to the situation when using the functional hierarchies as ORFs can belong to more than one functional class. There can be as many clusters as there are ORFs - there is no fixed number of clusters which has to be decided beforehand or as part of the training process. We used a cluster diameter (minimum similarity) of 0.7.
All clustering parameters (hierarchical cut-off value, k means, cluster diameter) were chosen to be reasonable after experimentation with various values. Although parameters could possibly have been further refined, this was not the aim of this work.
Predictive Power
We required a quantitative measure of how coherent the ORF clusters formed from microarray expression data are with the known functions of the ORFs. We considered using cross-entropy type measures, but decided in favour of using a predictive discriminatory measure as a more direct measure of coherence.
We form this measure as follows: in the k-means and hierarchical clusterings, each ORF appears in only one cluster, so we can take each ORF in turn and test whether or not the cluster without this ORF can predict its class; for the QT_CLUST_MOD scheme, the clusters are seeded by each ORF in turn, and we test for each seed in turn whether or not the rest of the cluster predicts the class of the seed. That is, we test whether or not the majority class of the cluster is one of the classes of the heldout ORF. The majority class is the class most frequently found among the ORFs in the cluster. We call this measure the `Predictive Power' of the clustering. The measure is related to our previous work in predicting ORF function from sequence [King et al., 2001]. Tests of predictive power were only carried out on clusters which had more than 5 ORFs, and only on ORFs which were not classified as unknown.
For example, if a cluster contained ORFs belonging to the following classes:
ORF Class orf1 A orf2 B orf3 A orf4 A orf5 A orf6 A orf7 A orf8 A orf9 B orf10 A
then the majority class of the cluster without orf1 would be "A", which is a correct prediction of the actual class of orf1. However orf2 and orf9 would be incorrectly predicted by this cluster. Class "A" is correctly predicted in 8/10 cases. (Class "B" is never predicted by this cluster, since it is never the majority class).
To test the statistical significance of this measure we used a Monte Carlo type approach. ORFs were chosen at random, without replacement, and random clusters formed using the same cluster size distribution as was observed in the real clusterings. The resulting random clusters were then analysed in the same manner as the real clusters. A thousand random clusterings were made each time, and both the mean results reported, and how many times the random cluster accuracy for a functional class was equal or exceeded that of the actual clusters.
Preprocessing of data
The majority of the results here are presented with no preprocessing
of the data other than to interpolate missing values by using the
average of the two adjacent values. This is to present a consistent
approach to all datasets and algorithms we tested. Experiments were
carried out to ascertain whether simple preprocessing would make a
significant difference. It is common in microarray data processing to
normalise the data to have a mean of zero and a standard deviation of
one. It is also common to remove ORFs whose expression does not vary
significantly over the timepoints, since these ORFs might cluster
together without having a common reason. We chose to remove all ORFs
that had a standard deviation in the first quartile of the standard
deviations of the data set. This is all ORFs with a standard deviation
in the bottom 25% of the dataset. This removed 1495 ORFs. Table 9 shows a comparison of the different
preprocessing methods on result accuracy.
The quality of the clusters produced by the different programs was, we believe, consistent with those shown in previous results. Initial inspection of the clusters showed some obviously good groupings, and other clusters were produced which generally seemed to have something in common, but the signal was less strong. However, most clusters on inspection did not appear to share anything in common at all.
An example of a strong cluster in shown in Table
1. The probability of this cluster occurring by chance
is estimated to be less than 10-10 (calculated by hypergeometric
distribution). However, note the mannose related sub-cluster in this
cluster. The most likely explanation for the sub-cluster is: as the
yeast data was formed to study cell division, histones and mannose are
both required in the same time specific pattern during division. They
therefore have either share the same transcription control mechanism,
or have ones similarly controlled. This hypothetical common
transcriptional control in cell division is not reflected in the
current annotation.
Table 1: A yeast histone cluster (cdc15 data, QT_CLUST_MOD clustering algorithm, MIPS annotations, cluster id: 203)
| ORF | Description | MIPS classes |
| ybr008c | fluconazole resistance protein | 11/7/0/0 7/28/0/0 |
| ypl127c | histone H1 protein | 30/10/0/0 30/13/0/0 |
| ynl031c | histone H3 | 30/10/0/0 30/13/0/0 4/5/1/4 |
| ynl030w | histone H4 | 30/10/0/0 30/13/0/0 4/5/1/4 |
| ylr455w | weak similarity to human G/T mismatch binding protein | 99/0/0/0 |
| ygl065c | mannosyltransferase | 1/5/1/0 6/7/0/0 |
| yer003c | mannose-6-phosphate isomerase | 1/5/1/0 30/3/0/0 |
| ydr225w | histone H2A | 30/10/0/0 30/13/0/0 4/5/1/4 |
| ydr224c | histone H2B | 30/10/0/0 30/13/0/0 4/5/1/4 |
| ydl055c | mannose-1-phosphate guanyltransferase | 1/5/1/0 9/1/0/0 |
| ybr010w | histone H3 | 30/10/0/0 30/13/0/0 4/5/1/4 |
| ybr009c | histone H4 | 30/10/0/0 30/13/0/0 4/5/1/4 |
| ybl003c | histone H2A.2 | 30/10/0/0 30/13/0/0 4/5/1/4 |
| ybl002w | histone H2B.2 | 30/10/0/0 30/13/0/0 4/5/1/4 |
Clusters which appeared to be unrelated were common. There were also
clusters which did seem contain related ORFs, but less obviously than
the histone cluster mentioned above. Table
2 shows an example of a cluster which does show a DNA
processing theme, but this theme is not reflected in the variety of
classifications of the ORFs.
Table 2: A yeast DNA processing cluster. Note the variety of MIPS classes represented here. (cdc28 data, QT_CLUST_MOD clustering algorithm, MIPS annotations, cluster id: 599)
| ORF | Description | MIPS classes |
| ycl064c | L-serine/L-threonine deaminase | 1/1/10/0 |
| yol090w | DNA mismatch repair protein | 3/19/0/0 30/10/0/0 |
| yol017w | similarity to YFR013w | 99/0/0/0 |
| ynl273w | topoisomerase I interacting factor 1 | 99/0/0/0 |
| ynl262w | DNA-directed DNA polymerase epsilon, catalytic subunit A | 11/4/0/0 3/16/0/0 3/22/1/0 30/10/0/0 |
| ynl082w | DNA mismatch repair protein | 3/19/0/0 30/10/0/0 |
| ynl072w | RNase H(35), a 35 kDa ribonuclease H | 1/3/16/0 |
| ylr049c | hypothetical protein | 99/0/0/0 |
| yjl074c | required for structural maintenance of chromosomes | 3/22/0/0 9/13/0/0 |
| yhr153c | sporulation protein | 3/10/0/0 3/13/0/0 |
| yhr110w | p24 protein involved in membrane trafficking | 6/4/0/0 8/99/0/0 |
| ygr041w | budding protein | 3/4/0/0 |
| ydl227c | homothallic switching endonuclease | 3/7/0/0 30/10/0/0 |
| ydl164c | DNA ligase | 11/4/0/0 3/16/0/0 3/19/0/0 30/10/0/0 |
| ydl156w | weak similarity to Pas7p | 99/0/0/0 |
| ybr071w | hypothetical protein | 99/0/0/0 |
| yar007c | DNA replication factor A, 69 KD subunit | 3/16/0/0 3/19/0/0 3/7/0/0 30/10/0/0 |
To quantitatively test the relationship between clusters and
annotations we evaluated all the clusters formed using our measure of
predictive power (see Table 3). For
S. cerevisiae we calculated the predictive power of each
clustering method (k-means, hierarchical, and QT_CLUST_MOD) for each
functional class in levels 1 and 2 of MIPS and GO. For E. coli
we calculated this for each functional class in level 1 of the
GenProtEC hierarchy. This produced a large number of annotated
clusterings, and the complete set of these can be found at http://www.aber.ac.uk/compsci/Research/bio/dss/gba/.
The same broad conclusions regarding the relationship between clusters
and annotation were true for both species and using all clustering
methods; we have therefore chosen to present the S. cerevisiae
tables only for hierarchical clustering and for only the first levels
of the GO and MIPS annotation hierarchies, and one table for
E. coli. The predictive power of the different clustering methods
can be seen in Tables 4, 5, 6 and 7. The results are broken down according to the
majority classes of the clusters. Absence of data for a class
indicates that this class was not the majority class of any cluster.
Table 3: A summary of the average predictive power for each type of clustering for each experiment. Figures are percentage correct predictions. "random" shows mean over 1000 random clusterings. Figures in brackets show how many times out of 1000 the random clustering produced equal or greater than this percentage.
| random | | cdc15 | cdc28 | elu | E. coli | |
| MIPS - hier | 56.257 | 61.062 (0) | 62.821 (0) | 61.341 (0) | 60.270 (0) | - |
| MIPS - k | 59.304 | 58.795 (989) | 59.053 (908) | 59.677 (4) | 59.583 (17) | - |
| MIPS - QT | 58.256 | 59.714 (16) | 61.697 (0) | 62.136 (0) | 59.631 (21) | - |
| GO - hier | 52.265 | 62.526 (0) | 60.799 (0) | 61.087 (0) | 59.344 (0) | - |
| GO - k | 59.301 | 61.649 (0) | 59.990 (1) | 62.067 (0) | 60.638 (0) | - |
| GO - QT | 55.397 | 60.799 (0) | 60.236 (0) | 61.288 (0) | 60.146 (0) | - |
| E. coli - hier | 52.906 | - | - | - | - | 57.785 (0) |
| E. coli - k | 53.104 | - | - | - | - | 56.103 (0) |
| E. coli - QT | 52.751 | - | - | - | - | 55.291 (0) |
Table 4: Yeast hierarchical clustering (cut-off =0 .3) class by class breakdown at level 1 GO.
First column shows average over 1000
random clusterings.
, cdc15, cdc28 and elu are the 4 cell-cycle
synchronisation methods. Figures show percentage correct
predictions. The figure in brackets is how many times out of 1000 the
random clustering produced equal or greater than this percentage. If
less than 5, this value is highlighted.
| class | random | | cdc15 | cdc28 | elu |
| enzyme | 59.487 | 64.578 (0) | 61.753 (64) | 61.771 (61) | 60.511 (238) |
| nucleic acid binding | 16.355 | 26.531 (13) | 15.584 (574) | 25.439 (22) | 22.430 (78) |
| structural protein | 12.767 | 50.725 (0) | 47.423 (0) | 57.792 (0) | 20.290 (52) |
| transporter | 8.585 | 13.725 (154) | 32.432 (0) | 10.417 (330) | 5.357 (735) |
| ligand binding or carrier | 6.825 | 4.167 (669) | 30.769 (0) | 3.571 (706) | 21.429 (4) |
| chaperone | 3.170 | 13.636 (55) | - | 5.263 (278) | - |
| signal transducer | 2.938 | 14.815 (34) | 15.000 (34) | 3.703 (303) | 11.765 (69) |
| motor | 1.069 | - | - | - | 11.111 (41) |
Table 5: Yeast hierarchical clustering (cut-off 0.3) class by class
breakdown at level 1 MIPS.
First column shows average over 1000
random clusterings.
, cdc15, cdc28 and elu are the 4 cell-cycle
synchronisation methods. Figures show percentage correct
predictions. The figure in brackets is how many timesout of 1000 the
random clustering produced equal or greater than this percentage. If
less than 5, this value is highlighted.
| class | random | | cdc15 | cdc28 | elu |
| cellular organization | 59.344 | 61.988 (4) | 63.258 (0) | 64.442 (0) | 61.146 (42) |
| metabolism | 28.827 | 39.721 (1) | 40.136 (0) | 33.951 (82) | 37.061 (6) |
| cell growth, cell division and DNA synthesis | 22.429 | 43.421 (0) | 46.961 (0) | 35.816 (1) | 22.963 (478) |
| transcription | 20.879 | 23.333 (304) | 30.496 (26) | 33.333 (4) | 18.750 (681) |
| protein destination | 14.988 | 17.021 (350) | 18.367 (266) | 14.865 (495) | 11.842 (710) |
| cellular transport and transport mechanisms | 12.500 | 12.500 (491) | - | 2.273 (957) | 21.951 (70) |
| cell rescue, defense, cell death and ageing | 9.495 | 18.750 (60) | 18.605 (60) | 21.739 (35) | 14.706 (163) |
| transport facilitation | 8.024 | 14.815 (135) | 28.889 (2) | 2.703 (792) | - |
| protein synthesis | 8.760 | 68.000 (0) | 15.385 (175) | 56.716 (0) | - |
| energy | 5.891 | 13.636 (140) | - | 7.895 (349) | - |
| cellular biogenesis | 5.241 | 11.765 (160) | - | 6.250 (367) | - |
| ionic homeostasis | 2.805 | - | - | - | 14.286 (73) |
Table 6: E. coli class by class breakdown at level 1.
QT
= QT_CLUST_MOD (0.7), k = k-means clustering (100), hier =
hierarchical clustering (0.3). Figures show percentage correct
predictions. The figure in brackets is how many times out of 1000 the
random clustering produced equal or greater than this percentage. If
less than 5, this value is highlighted.
| class | hier | k | QT |
| metabolism | 56.579 (0) | 55.581 (0) | 58.115 (0) |
| location of gene products | 57.107 (0) | 55.037 (4) | 56.513 (0) |
| cell structure | 59.794 (0) | 34.884 (294) | 30.755 (509) |
| information transfer | 35.417 (116) | 26.315 (341) | 16.667 (699) |
| regulation | 36.364 (68) | - | 7.692 (623) |
| transport | 38.095 (69) | - | 13.333 (748) |
| cell processes | 57.692 (11) | 58.140 (14) | 63.636 (8) |
| extrachromosomal | 64.286 (0) | 39.189 (61) | 42.105 (3) |
Table 7: Gasch data set, MIPS classification as of
21/12/00.
Class by class breakdown at level 1. Hierarchical
clustering with cut-off 0.3. Figures show percentage correct
predictions.
| class | hier |
| energy | 88.462 |
| cellular organization | 65.163 |
| protein destination | 62.500 |
| metabolism | 56.738 |
| cell growth, cell division and DNA synthesis | 48.649 |
| transcription | 46.970 |
| transport facilitation | 42.105 |
| cellular transport and transport mechanisms | 37.500 |
| cellular biogenesis | 20.000 |
| cell rescue, defense, cell death and ageing | 16.667 |
All the clustering methods produced statistically significant clusters with all three functional annotation schemes. This confirms that clusters produced from microarray data reflect some known biology. However, the predictive power of even the best clusters of the clearest functional classes is low (mostly <50%). This means that if you predict function based on guilt-by-association your predictions will contradict existing annotations a large percentage of the time.
One of the clearest messages from the data is the large difference in predictive power across the different: microarray experiments, clustering methods, and annotation schemes. There is no clear best approach and quite different significance results are obtained using different combinations.
Perhaps the most interesting differences are those between different
microarray experiments:
, cdc15, cdc28, and elu. It is to be
expected that different microarray experiments will highlight
different features of cell organisation. However it is unclear how
biologically significant these differences are. Using the GO
annotation scheme:
is best for predicting classes: enzyme; nucleic acid;
chaperone; motor; and cell-adhesion.Using the MIPS annotation scheme:
is best for predicting class: protein destination.A particularly dramatic difference is that for the GO class ligand
binding or carrier using hierarchical clustering, where the cdc15 and
elu experiments produced highly significant clusters whereas the
and cdc28 experiments produced clusters with negative correlation.
The classes highlighted also differed significantly between clustering methods. Considering first the GO annotation scheme: the clustering method k-means is best for predicting enzyme class; QT_CLUST_MOD is best for the classes nucleic acid binding, chaperone, and cell-adhesion; and hierarchical clustering is best for structural protein, transporter, and ligand binding or carrier chaperone. Considering the MIPS annotation scheme: the clustering method k-means is best for predicting classes cell rescue, defence, cell death and ageing, and protein synthesis; QT_CLUST_MOD is best for the class transcription; and hierarchical clustering is best for cell organisation, metabolism, cell growth cell division and DNA synthesis.
The data also revealed some unexpected apparent negative correlations between clusters and classes. For example using the MIPS annotation scheme and hierarchical clustering the cdc28 clusters for class cellular transport and transport mechanism the random clustering produced a higher predictive power >95% of the time. Transport proteins seem particularly poorly predicted both in both S. cerevisiae and E. coli. A possible explanation for this is that their transcription control is synchronised with the specific pathways they are involved with rather than as a group.
Do the two annotation schemes of MIPS and GO agree on cluster consistency? Sometimes, with strong clusters, such as within the ribosomal clusters (see Figure 1). But on the whole, the correlation between the scores given by the two annotation schemes is approximately 0. This is partly due to the fact that GO had many fewer annotations than MIPS, so there are several clusters which show a trend under MIPS annotation that cannot be seen under GO, because too many ORFs have no annotation.
|
Figure 1: Ribosomal cluster as agreed by MIPS and GO. Semicolons
separate the levels of GO classes. They agree on all except
ykr094c. (This example is cluster ID: 390,
data, hierarchical
clustering). |
Are the annotation schemes improving over time with respect to these
clusters? Tables 7 and 8
show the accuracies of the clusters found by hierarchical clustering
under MIPS annotations. Table 7 uses the MIPS
annotations from 21st December 2000, whereas Table 8 uses the MIPS annotations from 24th April 2002,
16 months later. The accuracies are almost identical.
Table 8: Gasch data set, MIPS classification as of
24/4/02.
Class by class breakdown at level 1. Hierarchical
clustering with cut-off 0.3. Figures show percentage correct
predictions.
| class | hier |
| energy | 79.310 |
| protein fate (folding, modification, destination) | 66.667 |
| subcellular localisation | 64.158 |
| metabolism | 50.794 |
| cell cycle and DNA processing | 48.148 |
| transcription | 47.692 |
| transport facilitation | 42.105 |
| cellular transport and transport mechanisms | 35.294 |
| control of cellular organization | 25.000 |
| cell rescue, defense and virulence | 16.667 |
Does simple preprocessing help? Table 9 shows a
comparison of accuracy when normalisation or removal of ORFs with low
standard deviation was used. There is no consistent trend of
improvement or degradation and very little difference between the
results.
Table 9: The effects of preprocessing on accuracy.
Preprocessing of the data is compared. "plain" is a baseline, no
preprocessing. "norm" is normalisation where mean and standard
deviation are normalised to 0 and 1 respectively for each ORF. "rem
low" is removal of ORFs with low standard deviation (in the bottom 25%
of the data set). "rem low and norm" is removal of ORFs with low
standard deviation followed by normalisation of the remaining ORFs.
The dataset was
data, MIPS classification as of 24/4/02.
Clustering was hierarchical clustering, average linkage of
correlation, cut-off = 0.3.
| class | plain | norm | rem low | norm and rem low |
| energy | 21.739 | 15.789 | 18.750 | 18.750 |
| protein fate (folding, modification, destination) | 16.102 | 16.667 | 18.519 | 9.756 |
| subcellular localisation | 61.079 | 61.460 | 62.097 | 62.267 |
| metabolism | 39.432 | 38.387 | 37.500 | 29.208 |
| cell cycle and DNA processing | 37.013 | 36.111 | 41.026 | 43.564 |
| transcription | 20.800 | 29.134 | 16.346 | 18.447 |
| transport facilitation | 11.765 | 19.355 | - | - |
| cellular transport and transport mechanisms | 10.204 | 10.870 | - | - |
| control of cellular organization | 8.696 | 10.000 | 15.385 | 20.000 |
| cell rescue, defense and virulence | 21.622 | 21.622 | 33.333 | 22.222 |
| cell fate | 16.129 | 17.544 | 25.000 | 14.634 |
| protein synthesis | 61.765 | 67.741 | 43.478 | 52.174 |
| regulation of/interaction with cellular environment | - | 15.385 | 14.286 | - |
It has been commented that perhaps other distance measures or a
different choice of linkage could be more appropriate for the use of
hierarchical clustering on expression data. We show that use of
Euclidean distance and complete linkage does little to change the
accuracy and in fact seems worse than correlation and average
linkage for the
dataset. This can be seen in Table 10 by comparison to Table 9.
Table 10: Use of complete linkage and Euclidean distance for
hierarchical clustering.
Compare these values with the "plain
column" in Table 9. Dataset was
data, MIPS
classification as of 24/4/02. Clustering was hierarchical clustering,
complete linkage of Euclidean distance, cut-off = 1.0. (0.3 gave
clusters containing a maximum of 2 ORFs only, so was too tight).
| class | complete linkage, euclidean distance |
| energy | 18.182 |
| protein fate (folding, modification, destination) | 24.107 |
| subcellular localisation | 60.623 |
| metabolism | 34.819 |
| cell cycle and DNA processing | 19.388 |
| transcription | 35.545 |
| transport facilitation | - |
| cellular transport and transport mechanisms | 10.169 |
| control of cellular organization | 7.143 |
| cell rescue, defense and virulence | 5.405 |
| cell fate | 7.407 |
| protein synthesis | 33.766 |
| regulation of/interaction with cellular environment | 6.250 |
Given a clustering produced from microarray data and a protein functional classification scheme there are four possibilities:
This paper quantifies how often the first case occurs and illustrates the limitations of existing annotations in explaining microarray data. The consequences of these limitations are discussed in [Kell and King, 2000]. These clusters where annotation and microarray data agree are the first choice of clusters to examine to gain knowledge of control of transcription. In favour of the second explanation is the fact the functional classification schemes are still "under construction" and do not reflect all that is known about biology. There are also many possible improvements in clustering algorithms which could improve the consistency. It is to be expected that the third case will predominate. Microarrays are a fascinating technology and an industry has been based on them. It is almost inconceivable that microarrays will not reveal large amounts of new and fascinating biological knowledge.
A major challenge in microarray analysis is therefore to discriminate between the old and new biology in the data and the noise. To achieve this we require
The consequences of our observations are that there is still much to
be done to extract information from the results of microarray
experiments. Clusters produced from the data reflect some known
biology, however, the majority of clusters have no obvious common
annotations from the current ORF annotation schemes. We expect that
microarray data presents new biological knowledge and that in time the
annotation schemes will represent this. We also conclude that
unsupervised clustering is limited, and we recommend that deeper data
analysis methods need to be used in future.
We would like to thank the people at EBI behind the Expression
Profiler system (http://ep.ebi.ac.uk/) for making their
clustering software easy to use and publicly available. Amanda Clare
was supported by MRC grant G78/6609.