| In Silico Biology 8, 0029 (2008); ©2008, Bioinformation Systems e.V. |
1 Molecular Immunology/Bioinformatics Group, Microarray Facility, Division of BioSciences, Brunel University, Uxbridge, UB8 3PH, UK
2 Intelligent Data Analysis Group, Department of Information Systems and Computing, Brunel University, Uxbridge, UB8 3PH, UK
3 Medical Oncology Unit, Institute of Cancer, Barts and London, Queen Mary School of Medicine, London, UK
4 Immunology Group, Institute of Cell and Molecular Sciences, Barts and London, Queen Mary School of Medicine, London, UK
* Corresponding author
Email: Su-ling.li@brunel.ac.uk
Edited by E. Wingender; received December 30, 2007; revised May 15, 2008; accepted July 13, 2008; published July 22, 2008
Microarray gene expression datasets are continually being placed in public repositories. As a result, one of the most important emerging challenges is that which enables researchers to take full advantage of such previously accumulated data to discover or validate common genes in similar biological systems. In light of this we have designed the MaXlab software to not only cross-compare available array data from different laboratories but also extract further knowledge from gene expression patterns embedded within published data. More importantly MaXlab offers a flexible and automated solution applicable for microarray technologies including cDNA and Affymetrix gene chips generating expression profiles for common genes with biological significance. We have identified several sets of genes previously unknown to be commonly expressed across studies investigating related biological questions. Among them is the identification of 17 genes involved in the dysregulation of immune tolerance including the crucial transcription factor Egr2. In addition, we have identified 175 genes commonly expressed in basal and luminal breast tumours in response to the chemotherapeutic drug doxorubicin. The universal expression and characterisation of these encouraging genes identified through MaXlab suggests that they may play a common role in the mechanism of disease and hence act as an incentive for further investigation for identifying potential therapeutic targets. Overall, MaXlab is an attractive application for molecular biologists extracting the intersection between microarray datasets together with the gene expression profiles, from which biologists are able to infer further biological insights.
The software together with file formats and additional material is freely available at http://www.immuno-software.org.
Keywords: microarray, cross-comparison, meta-analysis, multi-platform, data-fusion
Since the full potential of microarrays has been recognised, the statement "microarray technology allows one to study the mRNA expression of all the genes within a genome simultaneously" [Brown and Botstein, 1999; Lockhart and Winzeler, 2000; Schena et al., 2005] is the introductory sentence of almost every article focussing on gene expression microarrays. Combined with the use of specialised microarray data analysis tools for functional annotation [Khatri et al., 2002; Doniger et al., 2003; Zeeberg et al., 2003; Khalid et al., 2006a, 2006b] such studies expose potential cutting edge relationships between genes and disease phenotypes, which could be of paramount importance for medical advancement. Since the development and extensive use of this powerful technology a large number of gene expression studies have been performed and their results deposited in public repositories such as the Gene Expression Omnibus (GEO) and ArrayExpress. This has led to one of the most challenging tasks involving the development of a methodology to compare, integrate and extract information from multiple datasets in related biological systems. Such combinatorial studies address the hypothesis that selected sets of differential expression signatures share a significant intersection of genes, thus inferring a biological relatedness with respect to the molecular dysregulation underlying the disease.
Whilst recent meta-analyses studies have been carried out in attempts to correlate Affymetrix and cDNA gene expression datasets using statistical techniques [Rhodes et al., 2002, 2004; Choi et al., 2003, 2004; Ghosh et al., 2003; Lee et al., 2004; Wang et al., 2004; Jiang et al., 2004] or the labour-intensive manual literature mining methods [Wahl et al., 1997; Crow and Wohlgemuth, 2003; Crow et al., 2003; Qing and Putterman, 2004; Oertelt et al., 2005], there remains a computational limitation in terms of automating this process. Manual comparisons of gene lists or arrays from multiple experiments to identify a common gene signature are illogical and inefficient. In addition, it is not feasible for one research laboratory to perform microarray experiments of every nature relating to biological questions that are of interest to them. To this end, we have developed the first multi-functional software called MaXlab, which provides a user-friendly automated solution for the molecular biologist to overcome the painstaking task of comparing array studies. MaXlab (microarray data comparison across laboratories) employs the meta-analytic principle but more importantly offers effective exploration through the combined comparison of global gene expression datasets and relative gene expression analysis for autonomous microarray studies. More specifically, MaXlab can: compare several biologically significant gene lists (pre-defined by the author); find the intersection between entire arrays based on a single user-defined expression threshold across all experiments or unique thresholds for each experiment and lastly compare time series array experiments presenting genes expressed above or below user defined thresholds across multiple time-points. The resulting commonalities in the expression of genes across related studies can increase the confidence that genes identified as having a significant role within a disease or in response to a treatment are not by chance alone. This in turn provides more reliable biological insight into the genes and pathways that may be shared in the underlying molecular dysregulation and ultimately common drug targets among related disease states.
The MaXlab software has been designed and implemented using the programming language Visual Basic.Net, MySQL and ActivePerl. Processed data from MaXlab is presented on a multiple panel graphical user interface (GUI) displaying the results obtained from each set functionality procedure combined with a graphical output. The GUI is user-friendly interacting with users via menus, mouse clicks and user-input dialogs and can be utilised not only by researchers actively involved within microarray research but also those working in biological research in general.
System architecture
MaXlab offers two routes for meta-profiling - one for MySQL users and one that is more user-friendly for biologists with little programming knowledge. There are four main functions offered by MaXlab. The first is to compare biologically meaningful gene lists without any threshold selection. The second is to compare multiple arrays using a single threshold value across all experiments. The third function compares multiple arrays using multiple threshold values - one for each individual experiment, and the final function compares time series experiments in their entirety or based on a chosen threshold. Data processed using MaXlab is presented on the GUI in a format that provides the user with a flexible and intuitive view of their manipulated data that is easy to interpret (Fig. 1).
Data collection, processing, and storage
Microarray datasets used for this study (Tab. 1) were downloaded from public web sites (Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo/, or ArrayExpress, http://www.ebi.ac.uk/arrayexpress/) or provided by the authors upon request. The data consisted of two general types, two channel ratio data (for cDNA arrays) and single channel intensity data (for Affymetrix arrays), and were usually provided in single composite file format. All available gene identifiers were included in the analysis ensuring that the gene identifiers were of the same type prior to computing the similarity. Significantly expressed gene datasets were obtained from searching the appropriate literature, extracting pre-processed gene expression sets considered important to the researchers biological query.
| Table 1: | Information of the microarray datasets used in this study |
| Reference | Array Name | Genes in Gene Set | Array ID | Interesting Genes | Methodology |
| Baechler et al., 2003 | Affymetrix U95A Human GeneChip | 10260 | GPL91 (GEO) | 286 | Affymetrix |
| Der et al., 1998 | Hu6800 GeneChip aka HuGene FL Genome Array | 6325 | GPL80 (GEO) | 171 | Affymetrix |
| Greenberg et al., 2005 | Affymetrix U133A Human GeneChip | 16000 | GPL96 (GEO) | 26 | Affymetrix |
| Tezak et al., 2002 | Affymetrix HuFL GeneChip | 5600 | GPL80 (GEO) | 125 | Affymetrix |
| Anderson et al., 2006 | RFCGR HGMPMouse MmSGC Av1 | 9216 | A-MEXP-165 (ArrayExpress) | 2345 | cDNA |
| Safford et al., 2005 | Affymetrix mouse GeneChip, MU74A, MU74B, MU74C | 33773 | GPL81, GPL82, GPL83 (GEO) | 60 | Affymetrix |
| Anderson et al., 2006 | RFCGR HGMPMouse MmSGC Av1 | 9216 | A-MEXP-165 (ArrayExpress) | User defined >1.0 | cDNA |
| Safford et al., 2005 | Affymetrix mouse GeneChip, MU74A | 12422 | GPL81 (GEO) | User defined >1.0 | Affymetrix |
| Troester et al., 2004 | UNC compugen oligo array for toxgenomics study | 20163 | GPL550 (GEO) Basal cell line | User defined >1.0 | cDNA |
| Troester et al., 2004 | UNC compugen oligo array for toxgenomics study | 20163 | GPL550 (GEO) Luminal cell line | User defined >1.0 | cDNA |
Implementation of the meta-profiling procedures
The output from each function provided by MaXlab generates an intersection of genes common to all datasets together with their gene expression profiles in a graphical format. Each function adopts a different methodology for identifying the common genes.
The comparison of interesting genes from microarray experiments retrieved from published literature
Most often, following a microarray experiment, the genes that are significantly expressed and of most interest to the researcher's underlying biological question, will be published in the full text or supplementary information of the corresponding article. Thus, this aspect of the software is important for the comparison of these genes that are considered biologically significant by the researcher. More importantly, this functionality aims to identify the common genes that are of increased biological relevance to each researcher's investigation. Due to the use of the most interesting gene lists published in the full text of an article the data will already have been pre-processed and the gene expression levels averaged for duplicate genes. Following the import of the significantly expressed genes the procedure below is executed to analyse the commonality between the gene lists. The algorithm continues until all the gene lists have been compared, presenting one final dataset together with the gene expression values corresponding to each gene list provided. The pseudo-code for the method is as follows:
Inputs
Int gene set 1, Int gene set 2 (Maximum of 10)
Outputs
Gene list B
Algorithm
1. Function: Interesting gene similarity (Int gene set 1, Int gene set 2, Int gene set 3: Gene list A, Gene list B) 2. For each item (n) in Int gene set 1 3. If Int gene set 1 (n) = Int gene set 2 (j) 4. Gene list A (n) = Int gene set 1 (n) 5. End if 6. Next Item 7. For each item (n) in Gene list A (n) 8. If Gene list A (n) = Int gene set 3(n) 9. Gene list B = Gene list A (n) 10. End if 11. Next Item 12. Continue until function completed for the total number of interesting gene sets imported 13. End Function
Multiple experiment comparisons based on user defined thresholds
In addition to comparing significantly expressed genes from the literature, the software is able to compare array chips from different laboratories where the interesting genes are defined based on the user's threshold. Via the automated generation of interesting genes from the entire array gene set, potentially new genes that are common can be extracted from the microarray experiments. MaXlab prompts the user for a single threshold for all experiments or a unique threshold for each experiment above which genes are compared to generate a final set of common genes across all related biological experiments. Such threshold values can be based on those published within the corresponding literature.
Combining array chip enrichment with interesting gene list comparison
Through the comparison of the interesting gene sets alone, there may be very few or even a lack of common genes. Thus one may ask if this is a result of the use of entirely different array chips or due to the difference in the cut-off threshold for selecting significantly expressed genes. Therefore we provide users with the option to compare the array gene sets used by each laboratory to identify the consensus genes. Genes in the first dataset are compared with this consensus to find the matching genes (output 1). Genes in the second dataset are also compared with the consensus to identify the matching genes (output 2). The outputs 1 and 2 are matched to find an intersection that is based on the commonality between the array platforms provided.
Identifying common gene expression profiles for time series experiments
The final functionality set of the software is designed for time series microarray experiments accepting time-points or conditions for two array chips used by the same or different research laboratories. Following the import of data the underlying algorithm prompts the user to enter the number of time-points corresponding to each array chip. Using this strategy the genes above the thresholds in the chosen time point for each array will be compared to identify common genes and expression patterns. Once again, the array chips can consist of duplicate genes for which the associated expression values (e.g. median of ratio) will be averaged.
The common genes that that overlap between the experiments are displayed in the common gene expression (CoGeEx) panel together with their gene descriptions and gene expression values as provided by the user. These results are automatically exported and displayed in a graphical format together with Pearson correlation coefficient, F-test and standard deviation statistics representing the correlations between the gene expression patterns. Currently, it is essential that gene identifiers from different array chips are of the same type or manipulated using tools such as MatchMiner (http://discover.nci.nih.gov/matchminer/index.jsp) or the Synergizer (http://llama.med.harvard.edu/cgi/synergizer/translate) that facilitates the conversion of gene identifiers. However, we shall incorporate such a function within MaXlab in version 2.0 to facilitate the process of gene id conversions.
Generation of the interactome of the genes with common gene expression profiles
The network analysis within this study was carried out using the Ingenuity pathways knowledge base to further identify the interactions between the significantly differentially expressed common genes identified from the experimental datasets showing similar gene expression profiles from the related biological studies. (http://www.ingenuity.com/products/pathways_knowledge.html).
Researchers are intrigued to further associate the genes of significant interest generated as a result of their own microarray experiment with those of other laboratories. Although this is possible for a few genes via literature mining methods it is not a practical solution for genes that are derived via microarray methods where the genes of interest can be numerous. Through the development of MaXlab, offering a solution for the comparative analysis of multiple studies, this becomes possible. Of great importance in working with this data is the realisation that different experiments are typically designed to address different questions. In general, it will only make sense to combine datasets if the questions are the same, or, if some aspects of the experiments are sufficiently similar that one can hope to make better inference from the whole than from the experiments separately. To demonstrate the functionality of our novel MaXlab software we collected data from several microarray experiments published on ArrayExpress or GEO investigating a variety of diseases (Tab. 1).
Identification of common gene expression patterns amongst immunological disorders
Genome-wide profiling has been applied to the field of immunology to examine the perturbations and decipher key cellular or molecular pathways associated with specific diseases [Davidson and Diamond, 2001; Rus et al., 2002; Bennett et al., 2003; Matos et al., 2004; Poirot et al., 2004; Adarichev et al., 2005]. Using MaXlab we have carried out several comparisons to ascertain the similarities between related studies investigating immunological diseases.
Comparison of microarray data for autoimmune disease
We compared the interesting gene expression results provided by two research groups [Der et al., 1998; Baechler et al., 2003] investigating the molecular intricacies of the interferon (IFN) pathway underlying systemic lupus erythematosus (SLE) following interferon treatments, to assess the coherency of the findings between studies and thus ultimately identify common sets of differentially expressed genes regulated by IFN. Following the comparative evaluation of these datasets using MaXlab we identified 34 genes common to both microarray investigations with highly similar gene expression patterns (Tab. 2a, Fig. 2; see also Supplemental Table). Our results display a remarkable similarity in the gene expression profile generated from the data of both research labs strongly supporting the significance of the IFN pathway and the control of the IFN-α gene in the regulation of numerous genes involved in SLE. Amongst these, 14 genes have previously been reported to be differentially expressed in SLE and agree with our findings, one of which is the interferon-induced protein (IFIT1) possessing translation regulatory activity, which was one of the first genes to be associated with SLE [Ye et al., 2003]. Importantly, amongst these are the known IFN-α regulated genes: OAS1, MXA, MXB, STAT1 and ISGF3 [Aebi et al., 1989].
| Table 2a: | Summary of the results generated using MaXlab |
| Gene Set similarity (%)* | 5162 (31.1%) | 5285 (24.5%) | 6239 (14.5%) |
| No. of interesting genes common to both gene sets from each study** |
Baechler et al., 2003: 198 | Greenberg et al., 2005: 11 | Anderson et al., 2006: 1486 |
| Der et al., 1998: 113 | Tezak et al., 2002: 106 | Safford et al., 2005: 60 | |
| No. of significant genes commonly expressed across studies |
34 | 5 | 17 |
| * 5162 represents the number of genes that are common to both arrays. The value 31.1% represents the number of common genes shown as a percentage of the total number of genes present on both arrays (5162/(Array 1 + Array 2)) x 100 ** The interesting expression dataset consists of 198 and 113 significantly expressed genes as obtained from the literature from which 34 are common to both arrays (5162). |
| Table 2b: | Comparing interesting gene lists generated from the microarray gene expression sets based on user-defined thresholds. |
| Reference | Genes in Gene Set | Threshold | Common genes |
| Anderson et al., 2006 | 9216 | User defined >1.0 | 240 |
| Safford et al., 2005 | 12422 | User defined >1.0 | |
| Troester et al., 2004 | 20163 Basal cell line | User defined >1.0 at 36 hours | 175 |
| Troester et al., 2004 | 20163 Luminal cell line | User defined >1.0 at 36 hours |
Furthermore the common genes identified through our comparative analysis were also among those reported in the Bennett and colleagues study in 2003 (IFI44, MX1, MX2, PLSCR1 and TAP1) and by Crow et al., 2003 (IFI44, MX1, G1P3, PLSCR1 and G1P2) thus confirming the importance of these genes within the SLE signature and strongly suggesting that IFN-α is crucial in disease progression. Intriguingly, our results identify the genes NMI (N-MYC and STAT interactor) and SP110 whose roles have not previously been clarified in SLE to be commonly over expressed in response to IFN-α in both datasets. This in turn strongly suggests potential genes for further investigation, especially since they interact with STAT1 and IL6, respectively (Fig. 3). Similarly, MaXlab identified several other common genes between the studies including IRF2, PML, PMAIP1 and FAS that have not previously been associated with SLE and thus their roles within SLE have not been elucidated. However, the common over expression of the genes IRF2 and PML involved in the negative regulation of transcription and cell proliferation and PMAIP1 and FAS playing a functional role in the induction of apoptosis (http://www.geneontology.org), suggest potential target genes for further examination to clarify their roles in the pathogenesis of SLE (Supplemental Table).
Other disorders that have been modelled as autoimmune diseases whose pathophysiology is not fully understood are dermatomyositis (DM) and juvenile dermatomyositis (JDM). To examine the correlation between significantly expressed genes in both DM and JDM we carried out a comparative analysis of two related expression studies [Tezak et al., 2002; Greenberg et al., 2005] to infer further biological meaning to understand the mechanisms involved in the pathogenesis of DM and JDM. We identified a set of genes including the interferon-α (type 1) inducible genes MXA, MXB, IFI27 and IFI44 and the interferon regulatory factor gene IRF7 thus confirming their importance within the biological pathways in both DM and JDM [Der et al., 1998; Sato et al., 1998; Greenberg et al., 2005] (Fig. 4; see also Supplemental Table, part B). What is striking about the genes identified through our software is that they are analogous to those commonly identified from comparing the SLE studies (Supplemental Table). Thus we compared SLE and DM and JDM studies [Der et al., 1998; Baechler et al., 2003; Greenberg et al., 2005], which interestingly revealed the common over-expression of GIP2, GIP3, PLSCR1 and OAS1. The common expression of these significant genes across the autoimmune diseases DM, JDM and SLE using MaXlab combined with their known involvement within the IFN pathway [Bennett et al., 2003; Crow et al., 2003; Ishii et al., 2005] suggests that these diseases share a common pathophysiology.
Comparing immuno-tolerance microarray studies
Numerous studies have been conducted in which anergic cell states have been induced to identify the mechanisms that lead to the dysregulation of tolerance [Ibrahim et al., 2001; Lock et al., 2002; Matejuk et al., 2003; Zhang et al., 2003]. To identify the genes and pathways that promote the induction of T-cell anergy, Safford et al., 2005, carried out a microarray analysis on T-cells activated in conditions that either promote or inhibit anergy induction. In addition, a similar study by Anderson et al., 2006, exploited cDNA microarray technology to demonstrate a balanced transcription program regulated by different transcription factors for T-cell activation and/or tolerance during antigen induced T-cell responses. When assessing the commonality our software was able to reveal the role for 17 genes common to both tolerant conditions, including that of zinc finger transcription factor early growth response gene 2 (Egr2), a gene required for the full induction of T-cell anergy [Harris et al., 2004], alongside the genes of transcription factors Irf4, Jarid2 and Nfatc1 as well as the chemokines or cytokines Tnfsf11, Tnfsf9 and Ccl1 (Tab. 2a, Fig. 5; see also Supplemental Table, part C). The identification of the principal Egr2 gene and its role as a negative regulator of T-cell function and thus anergy induction has been further supported by several studies using high-dimension genomic analysis to examine the genes upregulated during both T and B-cell anergy [Glynne et al., 2000; Macián et al., 2000; Lechner et al., 2001].
To demonstrate the comparability function of MaXlab for whole microarrays based on user-defined thresholds we exploited the entire microarray gene sets used by Safford et al., 2005, and Anderson et al., 2006, and a cut-off expression threshold of 1.0 fold. As a result, MaXlab revealed a total of 240 genes common to both studies (Tab. 2b; see also Supplemental Table, part D and Supplemental Figure 1). Several genes were found with a potential for further investigation, including Irf8 (a negative regulator of cell proliferation), Tgfb1 and JunB (possessing transcriptional regulatory activity), Prkca (a negative regulator of protein kinase activity) and Ptprv (an inducer of apoptosis) (http://www.geneontology.org). More importantly, Jak2 involved in the Jak-Stat cascade known as a negative regulator of cell proliferation, Casp3 that is more specifically a negative regulator of activated T cell proliferation, Cdkn1a (cyclin dependent protein kinase inhibitor activity) and Cdkn2b (regulator of transcription) have also been commonly identified. These genes have not been shown to have an involvement in tolerance. However their common expression in both studies and investigation into their functional activities suggest them to be prospective targets for further investigation to elucidate their potential involvement in initiating or maintaining T cell tolerance.
Comparing time series based microarray data for breast cancer
Often microarray experiments that are carried out are based on time series. Here we have used MaXlab to further explore a gene expression microarray experiment carried out by Troester et al., 2004, investigating the response of basal and luminal breast tumours to the drugs, doxorubicin and 5-fluorouracil. One aspect that may be of potential importance is identifying the genes that are commonly and significantly expressed in cells from various cancers in response to a particular drug. Alternatively, it may be equally valuable to know the common genes that are expressed in one particular type of cancer that has been treated with several drugs. To demonstrate the microarray time series functionality of MaXlab, as an example we chose to identify genes commonly expressed in both basal and luminal cell lines following treatment with doxorubicin above a gene expression threshold of 1.0 at 36 hours. This revealed 175 genes that are expressed in both cell lines derived from basal and luminal epithelium in response to doxorubicin thus revealing potentially common targets for this drug (Tab. 2b; see also Supplemental Table, part E and Supplemental Figure 2). Amongst these, were those discussed by Troester et al., 2004, including the p53 regulated gene TP53I3 (tumour protein p53 inducible protein 3) involved in the induction of apoptosis, Cdkn1a involved in the negative regulation of cell proliferation and induction of apoptosis, FDXR (ferredoxin reductase) and also glutathione-S-transferase π (GST-π) that were induced in both cell lines, although less dramatically in the luminal cell line. Other genes of potential interest commonly expressed included CTSO (cathepsin) involved in proteolysis, S100A9 involved in leukocyte chemotaxis, BBC3 involved in caspase activation and positive regulation of apoptosis (although much higher in the luminal cell line) and LRDD involved in death receptor binding (http://www.geneontology.org). Through the cross comparison of multiple studies, MaXlab can provide researchers with an insight into the genes playing a potential common role in related diseases in an automated fashion via several flexible functionalities and view potentially important disease signatures.
In conclusion, we believe that the MaXlab software is an attractive and powerful application for the scientific community involved in microarray research allowing researchers to gain knowledge from existing datasets, the majority of which sit stagnant and disjointed following publication. Following the systematic collection of public microarray data, we have demonstrated the explorative functionality of MaXlab for the comparative meta-profiling of biologically relevant datasets generated by independent research labs. By integrating related gene expression matrices we identified several sets of common genes from related studies significantly expressed and more importantly possessing similar expression profiles. More interestingly, our software has also been able to determine several commonly expressed genes of high significance based on expression or gene function across related biological conditions that have not been associated with the disease before. The universal expression and characterisation of these encouraging genes suggests that they may play a common role in the mechanism of disease and are hence possible genes worthy of further investigation and could serve as potential therapeutic targets.
Project name: MaXlab
Project Home Page: Databases including the software executable can be accessed from http://www.immuno-software.org.
Operating system: Tested on Windows 2000 Workstation (SP4) and Windows XP (SP24)
Programming language: Microsoft Visual Basic.Net and MySQL and ActivePerl
Other requirements: Microsoft .NET Framework version 2.0 Software Development Kit (SDK) min, MySQL database server no later than 4.1, MySQL Connector/ODBC 3.51 and Microsoft Office 2000
This study was partly supported by grants from the UK Medical Research Council (MRC) (Grant number: G0300520) and the Brunel University Studentship. We thank Ingenuity Systems for allowing us to use their Ingenuity pathways knowledge base.