In Silico Biology 5, 0035 (2005); ©2005, Bioinformation Systems e.V.  

Analysis of Gene Ontology features in microarray data using the Proteome BioKnowledge® Library


Robin J. Johnson1*, Jennifer M. Williams1, Barbara M. Schreiber2, Charles D. Elfe1, Kelley L. Lennon-Hopkins1, Marek S. Skrzypek3 and Renee D. White1




1 Biobase Corporation, 100 Cummings Center, Ste. 420B, Beverly, MA 01915
2 Department of Biochemistry, Boston University School of Medicine, Boston, MA 02118
3 Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305



* Corresponding author

   Email: robin.johnson@biobase-international.com





Edited by E. Wingender; received January 06, 2005; revised May 02; accepted May 26, 2005; published June 03, 2005



Abstract

Microarray technology has resulted in an explosion of complex, valuable data. Integrating data analysis tools with a comprehensive underlying database would allow efficient identification of common properties among differentially regulated genes. In this study we sought to compare the utility of various databases in microarray analysis.

The Proteome BioKnowledge® Library (BKL), a manually curated, proteome-wide compilation of the scientific literature, was used to generate a list of Gene Ontology (GO) Biological Process (BP) terms enriched among proteins involved in cardiovascular disease. Analysis of DNA microarray data generated in a study of rat vascular smooth muscle cell responses revealed significant enrichment in a number of GO BPs that were also enriched among cardiovascular disease-related proteins. Using annotation from LocusLink and chip annotation from the Gene Expression Omnibus yielded fewer enriched cardiovascular disease-associated GO BP terms. Data sets of orthologous genes from mouse and human were generated using the BKL Retriever. Analysis of these sets focusing on BKL Disease annotation, revealed a significant association of these genes with cardiovascular disease. These results and the extensive presence of experimental evidence for BKL GO and Disease features, underscore the benefits of using this database for microarray analysis.

Keywords: microarray analysis, Gene Ontology, disease, cardiovascular disease, protein databases



Introduction

The use of microarrays is becoming increasingly common in many areas of biological research. A search of PubMed for DNA microarrays yields a list of publications that ranges in topic from basic-science research across many species to studies of human disease, which include drug discovery, clinical research, and pharmacogenetics; to food science; toxicology; occupational health; and exercise physiology research. The benefit of microarray technology is that it gives scientists the ability to examine tens of thousands of genes, or entire genomes, at once. The drawback is that a single experiment generates a large amount of complex data that must then be stored, analyzed, and translated into usable biological knowledge.

For many scientists, making sense of microarray data has required manually annotating the genes, searching the scientific literature and utilizing information and tools provided by a plethora of publicly available biological databases. Such manual annotation of genes is tedious and can take hundreds of hours for a single data set [Hosack et al., 2003]. Even then, only a minute portion of the wealth of information available within the data set is garnered. The various databases that have been created to integrate biological data contain different types of information, often in very different formats. Moreover, an investigator must often reference multiple species-specific databases for cross-species comparisons. At the end of the analyses, the investigator is often left with a long list of gene identifiers, the biological significance of which is by no means obvious.

In recent years, various strategies have been employed that help to alleviate some of the problems associated with integrating information from the publicly available databases. The Gene Ontology (GO) Consortium, which started in 1998 as a collaboration among Flybase, Saccharomyces Genome Database (SGD), and the Mouse Genome Informatics (MGI) project, has provided structured, controlled vocabulary for describing the molecular function, biological process and cellular component characteristics of gene products [Ashburner et al., 2000]. Since its inception, GO has been adopted by many databases across plant, animal, and microbial species [Harris et al., 2004].

Tools have been developed to facilitate analysis of microarray data for GO features, allowing direct association between gene identifiers and GO terms that have been curated for the associated gene products [Adryan and Schuh, 2004; Al-Shahrour et al., 2004; Cheng et al., 2004; Feng et al., 2003; Masseroli et al., 2004; Robinson et al., 2004]. These tools allow scientists to assign biological information to lists of genes and thereby simplify the process of identifying functional patterns among gene clusters. However, these tools are limited by the depth, breadth, and quantity of the GO curation and the paucity of other types of curation in the available protein databases.

The Proteome BioKnowlege® Library (BKL) is a manually curated, proteome-wide compilation of the scientific literature organized in 6 volumes covering 24 species. As of July, 2004, the BKL contained curated information on over 45,000 unique mammalian genes, including over 255,000 manually curated Gene Ontology (GO) observations associated with these genes of which 40% have been assigned based on experimental evidence. In addition, it contains curated annotation on properties including expression (for human, mouse, rat and worm), disease involvement (human), mutant phenotype (mouse, worm, fungi), protein domains, and genetic location, all of which are searchable, alone or in combination with GO annotation, using the integrated BKL RetrieverTM search tool.

In this study, we utilized the BKL and BKL Retriever to analyze data from a DNA microarray experiment in an attempt to link our experimental model (rat smooth muscle cells were treated with an inflammatory protein) with human cardiovascular disease. First, a set of human proteins associated with the MeSH term, Cardiovascular Diseases was generated using the BKL Retriever. This set of proteins was analyzed for overrepresentation of GO Biological Process (GO BP) terms. Using three protein databases: the BKL, LocusLink and platform information for the Affymetrix Rat Expression Array 230A downloaded from Gene Expression Omnibus [Edgar et al., 2002], the set upregulated rat genes that were analyzed to determine if any of 16 selected GO BP terms were significantly enriched among them. Finally, sets of mouse and human orthologs of the upregulated rat genes were generated and analyzed for overrepresentation of Disease annotation in the BKL.

Comparison of the GO BP annotation in each of the three information sources revealed that the BKL contained more rat proteins with GO BP annotation (and more with GO BP terms with experimental evidence) than LocusLink and the array annotation. Analysis of the microarray data using the BKL revealed that 12 of 16 GO BP terms enriched among cardiovascular disease-associated proteins were significantly enriched among the upregulated rat genes. The same analysis using LocusLink also showed 13 BPs that were significantly enriched, while the array annotation showed no significantly enriched BPs. The number of proteins associated with each GO BP term was consistently higher in the BKL compared to LocusLink. Finally, in sets of human and mouse orthologs of the upregulated rat proteins, annotation to the Cardiovascular Diseases MeSH term was significantly enriched. These findings illustrate the power and flexibility of the BKL in integrating microarray data into a cohesive hypothesis concerning physiological responses of smooth muscle, linking them to disease processes and allowing the integration of new data into an existing knowledge base.



Methods


Generation of target and reference data sets

For each analysis we generated a target data set and a reference data set. In general, the target data set represents the set of proteins being analyzed and is a subset of the reference dataset.

First, for analysis of GO BP features associated with cardiovascular disease, the target data set, defined as the set of all human proteins with BKL annotation to the MeSH term, Cardiovascular Diseases, was obtained by browsing the BKL Disease hierarchy with the BKL Retriever. The reference data set was all human proteins in the BKL.

Second, microarray data were generated from an experiment designed to model aspects of smooth muscle cell responses in cardiovascular disease. First, sets of genes from the Affymetrix Rat Expression Array 230A that were differentially expressed in the model system were generated by the Microarray Resource in the Department of Genetics and Genomics, Boston University School of Medicine, Boston, MA. Genes that were overexpressed at least two-fold and had a minimal signal value of 300 were selected. These identifiers were analyzed and matched against proteins in the BKL and LocusLink databases and the chip annotation from GEO to give final target data sets of upregulated genes with protein products. For each of the three information sources in our comparisons, the reference set consisted of the set of all rat genes with protein products rather than the entire set of rat genes. Genes or transcripts that may represent non-translated RNA or duplications were not included, since the goal was to identify proteins with GO BP terms of interest for further study. Moreover, the ratio of the number of target data set genes with protein products, n, to number of total database genes with protein products, N, is roughly similar for BKL (0.053) and LocusLink (0.052) to that obtained using the total number of upregulated Affymetrix identifiers as n and the total number of identifiers on the Affymetrix Rat Expression Array 230A chip as N (0.047). The ratio for the chip annotation is somewhat higher (0.077).

Platform information for the Affymetrix Rat Expression Array 230A (accession number GPL341) was downloaded from Gene Expression Omnibus [Edgar et al., 2002] (http://www.ncbi.nlm.nih.gov/geo/) on May 4, 2004. The LocusLink data was downloaded from the LocusLink ftp site (ftp://ftp.ncbi.nlm.nih.gov/refseq/LocusLink) and extracted from the LL_tmpl file dated July 10, 2004. To be consistent with the BKL and GEO analysis, only those LocusLink records of type gene with protein product, function known or inferred and gene with protein product, function unknown were considered. The BKL is a subscription-based database containing manual GO annotation drawn largely from the scientific literature. The BKL data was also captured on July 10, 2004, to reflect a similar set of rat genes.

Finally, for analysis of disease features associated with the upregulated genes, two target data sets were generated by browsing the Species hierarchy using the BKL Retriever. Upregulated rat gene identifiers were uploaded into the Retriever and sets containing all human and mouse orthologs of these rat proteins were generated. The sets containing all mouse and all human proteins in the BKL served as the corresponding reference data sets.


Calculation of enrichment of GO BP and Disease terms in the analyzed data sets

GO BP terms or Disease features were considered enriched if their actual observed number of target data set proteins associated with a term or feature, F, (ka) was higher than the number of proteins that would be expected to be associated with F in a similar-sized set of randomly selected proteins (ke). The ke for a given term can be calculated using the expression:

where

n = number of genes in the target data set

N = number of genes in the reference data set

K = number of genes in the reference data set associated with F

The enrichment factor, R, is calculated simply by taking the ratio of ka and ke as follows:

Computing significance

The p-values computed for Tables 1, 2 and 3 represent the likelihood of having k genes of a possible K that are associated with a particular term in a subset of n genes randomly drawn from a total N genes. Since the set of n distinct genes represents a sample drawn without replacement, the p-value is calculated using the hypergeometric function. The formula below computes the sum of the probabilities of having (k - 1) or fewer genes associated with a term and subtracts it from 1, effectively calculating the probability of having at least k genes associated with a particular term. [Draghici et al., 2003], where

Since we tested for significance among thousands of features in the GO BP hierarchy, the q-value for determining significance in multiple hypothesis testing [Storey and Tibshirani, 2003] based on the False Discovery Rate (FDR) was computed from the resulting p-values using QVALUE software.


Analysis of GO BP terms associated with cardiovascular disease

A set of human proteins associated with cardiovascular disease (n = 584) was obtained from the Human PSD database of the BKL using the Retriever tool. This set, the target data set, was analyzed for GO BP terms that were significantly (q < 0.05) enriched at least two-fold in comparison with all human proteins in the BKL, the reference data set (N = 18,359). Terms with a K value less than 50 or a ka value less than 15 were eliminated. Because analysis using the array annotation downloaded from GEO was done manually, 16 terms of particular interest to the laboratory were selected for further analysis. The selected terms, their GO ID numbers, the values of K, ka, ke, R, p-value, and q-value are shown in Table 1.

Table 1: Selected GO BP terms associated with Cardiovascular Disease.
Biological Process (BP)GO IDKkekaRpq
Cellular morphogenesisGO:000090271723773.41.0E-152.4E-15
AngiogenesisGO:00015252438729.42.1E-153.2E-15
Lipid metabolismGO:0006629888281063.800
Phospholipid metabolismGO:00066441595183.62.8E-061.5E-06
Lipid transportGO:0006869973196.21.6E-101.2E-10
Inflammatory responseGO:0006954540171347.92.4E-153.2E-15
Complement activationGO:0006956592179.11.9E-121.8E-12
Response to oxidative stressGO:00069791896396.61.0E-152.4E-15
Cell cycleGO:00070491,17337732.01.7E-081.2E-08
Protein kinase cascadeGO:0007243881281284.61.0E-152.4E15
Steroid metabolismGO:00082022187385.51.1E-152.4E-15
Regulation of blood pressureGO:00082178534918.36.7E162.0E-15
+ regulation of cell proliferationGO:0008284620201065.400
- regulation of cell proliferationGO:000828565821743.61.1E-165.3E-16
Cell migrationGO:0016477672211165.52.4E-153.2E-15
Regulation of apoptosisGO:0042981931291394.71.4E-152.8E-15
Values for expected frequency (ke), enrichment (R) and significance were calculated as described in "Methods". Values for ke are rounded to the nearest integer and R to the nearest tenth. The number of proteins associated with a given BP (K) was obtained from the BKL using the Retriever tool. The number of human proteins related to cardiovascular disease (n) is 584. The total number of human proteins in the Human PSD (N) is 18539.


Analysis of annotation quantity and evidence mapping

Multiple GO BP terms were considered for individual proteins, and each GO term was tallied separately. In order to facilitate comparison between Proteome's experimental evidence code (E) and the experimentally-based evidence codes from GO (IDA, IMP, IPI and IGI) and used by LocusLink, (http://www.geneontology.org/GO.evidence.html) it was necessary to map the GO evidence codes to Proteome evidence codes based upon the GO consortium's definitions and use cases for Biological Processes. It was determined that GO's IDA (inferred by direct assay) and IMP (inferred from mutant phenotype) evidence codes map to Proteome's E (information was demonstrated experimentally) evidence code. GO's IPI (inferred from protein interaction) and IGI (inferred from genetic interaction) evidence codes map to Proteome's P (predicted in the literature by means other than sequence similarity; for example, by its association with a complex) evidence code. Only BKL GO BP terms having E evidence codes and LocusLink GO BP terms having IDA or IMP evidence codes were considered to be experimentally determined. For GEO, annotations were provided by the manufacturer of the DNA microarray chip and those with evidence codes of 'E', 'IDA', or 'IMP' were considered to be experimentally determined.


Analysis of differentially regulated rat genes

For BKL-based analysis, gene identifiers from the cluster of upregulated rat genes were uploaded into the BKL Retriever and the number of unique proteins retrieved, n, was noted. A number of SQL queries were created to extract the relevant data and calculate the basic counts: ka, N, and the K values for all GO BP terms. These queries were embedded in Perl scripts and then run.

Analysis of the LocusLink data required a number of steps. The LocusLink LL_tmpl file was parsed and loaded into tables in a relational database. The LocusLink identifiers from the microarray dataset were also loaded into the database. Using a previously loaded instance of GO, SQL queries were created as above to extract the relevant data and calculate n, ka, N and K. Perl scripts containing these queries were then run.

For the chip annotation data downloaded from GEO, data in a Microsoft Excel® file was manually sorted by LocusLink identifiers, GO BP terms, and GO evidence codes. Total number, n, N, K and ka values for the 16 selected GO BP terms were determined manually.


Analysis of human and mouse orthologs

The 378 LocusLink identifiers for the upregulated rat genes were uploaded into the BKL Retriever via the user interface, and sets of human and mouse orthologs were obtained. Human orthologs and mouse orthologs with annotation associating them with the term Cardiovascular Diseases were analyzed to obtain ka values for each of these sets. The entire human and mouse databases were then searched to obtain K values for Cardiovascular Diseases for each. As this analysis represented testing of a single hypothesis, the results were considered significant if p < 0.01.



Results


GO BP annotation quantity and evidence mapping

To evaluate the GO curation in each database, the BKL, NCBI's LocusLink, and the chip manufacturer's annotation data on NCBI's GEO website were analyzed with respect to the quantity and evidence mapping of their GO curation for Biological Process (GO BP). LocusLink and GEO are both publicly available resources that derive GO annotation from several sources.

The total set of expressed rat genes with protein products present in the three information sources (as of July 10, 2004), N, was 6914 for the BKL, 7160 for LocusLink, and 4507 for the Affymetrix Rat Expression Array 230A annotation from GEO (Figure 1a). The subset of rat genes that had GO BP annotation in the three databases was 6454 (93%) for BKL, 2876 (40%) for LocusLink, and 1006 (22%) for GEO. The subset of rat genes that had GO BP annotation with experimentally derived (rather than predicted) evidence in the three databases was 2749 (40%), 596 (8%), and 53 (1%), respectively. Thus, the BKL offers a strong foundation of curated data for analysis of a microarray data set.

In addition, the three information sources were used to identify and analyze the GO BP curation for the target data set of distinct rat genes with protein products that showed at least two-fold upregulation in the model system (Figure 1b). The sizes of the target data sets, n, differed slightly when analyzed with BKL (367), LocusLink (371) or GEO (345) information. The number of genes with GO BP annotation for this data set was 354 (96%) for the BKL, 225 (60%) for LocusLink, and 102 (30%) for the GEO information. The number of genes with GO BP annotation with experimentally derived (rather than predicted) evidence for this data set was 229 (62%) for BKL, 60 (16%) for LocusLink, and 3 (<1%) for GEO. Thus, similar to the data for the entire rat proteome, the BKL offers the strongest foundation of curated data for the genes upregulated in this microarray.

Figure 1: Comparison of GO Biological Process annotation in the BKL, LocusLink, and GEO chip information. The number of genes with GO BP terms was determined for each of the three information sources. White areas of bars represent the number of proteins with GO BP terms based on experimental data, light gray areas represent the number of proteins with predicted GO BP terms, and dark gray areas represent proteins without any associated GO BP terms (see Methods for details).
A) GO BP annotation in the rat gene reference data set. Counts are based on total rat proteins in the BKL database (6,914), LocusLink database (7,160), and Chip Annotation (4,507)
B) GO BP annotation in the rat gene target data set. Counts are based on upregulated rat (target data set) proteins in the BKL database (367), LocusLink database (371), and Chip Annotation (345).



GO BP enrichment in cardiovascular disease

The BKL curation was mined to identify GO BP properties that were significantly enriched among human proteins implicated in cardiovascular disease (see Methods; Table 1). The target data set of human genes associated with the MeSH term Cardiovascular Diseases consisted of 584 proteins. Analysis for enrichment of GO BP terms in this set yielded 1442 GO BP terms associated with the target data set. 1175 were enriched at least 2-fold, and all these were significantly enriched (q < 0.05). The number of terms with a K value greater than 50 and a ka value greater than 15 was 199.


GO BP enrichment in analyzed data sets

Atherosclerosis is an inflammatory disease with lesions characterized by infiltrating monocytes/macrophages, lymphocytes and smooth muscle cells [Ross, 1999]. Smooth muscle cells migrate to the vascular intima, proliferate and deposit extracellular matrix. Smooth muscle cells aren't terminally differentiated and atherosclerosis involves a return to a proliferative phenotype. The set of rat genes upregulated in our microarray experiment was analyzed for GO BP enrichment. Analysis of the upregulated gene set with BKL curation showed 973 GO BP terms associated with the set (ka > 0). 839 terms were enriched (R > 1) and there was significant enrichment (q < 0.05) in 111 terms. R values for significantly enriched terms ranged from 1.2-18.8. Assessment of enrichment in any of the 16 pre-selected GO BP terms revealed that 12 out of 16 terms were significantly enriched (Table 2). The same analysis using LocusLink curation showed 516 GO BP terms represented, with 467 enriched, and 357 terms enriched significantly (q < 0.05). R values ranged from 1.2 -19.3. Thirteen out of the 16 selected GO BP terms were significantly enriched. Because analysis of the chip annotation data was done manually, a full list of represented GO BP terms was not generated. Among the 16 selected GO BP terms, 3 showed enrichment (1 R 8.7), though none were significantly enriched using a Bonferroni corrected p-value of 0.003125. The inability to show significance for enrichment of any of these GO BPs is likely due to the low number of genes with curation in GEO.


Table 2: Occurrence of GO BPs connected to cardiovascular disease in the target rat data set. Comparison of curation of three databases for significantly enriched GO BP terms.
 BKLLocusLinkGEO
GO IDGO BP termKkekaRp-valueq-valueKkekaRp-valueq-valueKkekaRp-value
GO:0000902cellular morphogenesis27314.49281.930.0005620.003904542.8041.430.3061390.05218510.08000.076548
GO:0001525angiogenesis623.29133.950.0000170.000300221.1454.390.0046050.01113500.000n/dn/d
GO:0006629lipid metabolism47425.16411.630.0011480.0071741678.65202.310.0003710.179255534.0640.990.586884
GO:0006644phospholipid metabolism673.5692.530.0084480.030991211.0921.840.2974330.05218110.08000.076548
GO:0006869lipid transport351.8663.230.0094120.032708251.3043.090.0378540.02269040.31000.272873
GO:0006954inflammatory response24212.85282.180.0000730.000751442.28114.820.0000100.000113110.8433.560.046235
GO:0006956complement activation271.4364.190.0024550.013747160.8322.410.1998220.04131230.2328.710.016642
GO:0006979response to oxidative stress1276.74192.820.0000350.000518281.4564.140.0026480.00796020.15000.147251
GO:0007049cell cycle29315.55201.290.1469360.1009071487.67172.220.0016040.005403151.1532.610.101791
GO:0007243protein kinase cascade31416.67301.800.0011640.007174381.9731.520.3145790.052695100.77000.549406
GO:0008202steroid metabolism1407.4391.210.3255370.143923462.3872.940.0088180.018505120.9211.090.615898
GO:0008217regulation of blood pressure1055.57122.150.0093780.032708140.7356.890.0004930.00219510.08000.076548
GO:0008284+ regulation of cell proliferation25613.59271.990.0004550.003416170.8822.270.2191770.04375020.15000.147251
GO:0008285- regulation of cell proliferation19510.35171.640.0295180.062302110.5747.020.0017500.00556110.08000.076548
GO:0016477cell migration70337.32330.880.8021350.251709392.0273.460.0034560.00950500.000n/dn/d
GO:0042981regulation of apoptosis36019.11472.460.0000000.000000623.2172.180.0401650.02269040.31000.272873


Analysis of each GO BP term reveals that the BKL yields the greatest number of proteins (ka) for further analysis. Numbers range from 6 - 47 for the BKL, 2 - 20 for LocusLink, and 2 - 4 GEO. A comparison of the K values (the number of proteins with curation to a given GO BP term) in the three databases shows that the BKL has a range of K of 27 - 703 vs. LocusLink (11 K 167) and GEO (0 K 53). These data provide specific examples of the data in Figure 1 and further illustrate the greater breadth of curation in the BKL allowing for a more complete analysis and interpretation of the microarray results.


Ortholog analysis

The rat genes that were upregulated in the current DNA microarray study were used to identify human and mouse orthologs with disease relevance. Orthologs were identified in the BKL by inputting rat gene identifiers into the BKL Retriever search engine, which "translates" the gene set into a list of orthologous proteins in the species of interest, according to BKL curation. The 367 rat genes of the target data set were "translated" by the BKL Retriever to yield new target data sets with n values of 329 for human orthologous genes and 340 for mouse orthologous genes. The reference data sets represent the entire human and mouse protein sets present in the BKL and have N values of 18,482 and 19,692 respectively. As shown in Table 3, 55 human orthologs and 27 mouse orthologs have literature-based curation associated with Cardiovascular Diseases. In each case, the target data set is significantly enriched (p < 0.01) in annotation to the MeSH term, Cardiovascular Diseases as compared to human and mouse proteomes as a whole. Thus, ortholog analysis of disease associations adds another layer of utility to the BKL as a tool for microarray data analysis.

Table 3: Analysis of orthologs of upregulated gene set in human and mouse. The target gene sets consist of human and mouse orthologs of the rat genes in our original target dataset of upregulated genes from our microarray experiment.
 NnKkekaR
Human18,48232956410.0555.48*
Mouse19,6923402684.63275.83*
* indicates p < 0.01.



Discussion

High-throughput experimental analyses, including DNA microarrays, have become an important part of systems biology research, the goal of which is the understanding of how all the elements in a system are related. However, such analyses generate large amounts of data that must then be stored, analyzed, and finally, translated into new biological knowledge. Much effort has been put into devising methods for statistical analyses and clustering of DNA microarray data, with the purpose of identifying specific groups or clusters of genes with expression levels that are significantly altered by the experimental conditions [Quackenbush, 2002; Slonim, 2002]. The result of such analyses, however, is a long list of gene identifiers, of which bilogical sense must be made. To actually gain significant insights into biological processes, the new data must be integrated with prior knowledge of the system.

The BKL comprises a collection of six protein databases, or volumes, which organize publicly available sequence information with functional annotation derived from comprehensive, manual curation of the scientific literature [Costanzo et al., 2001; Hodges et al., 2002]. As of July 2004, the BKL contained more than 85,000 Protein Reports describing the proteomes of human, mouse, rat, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans, and 18 major fungal pathogens of humans, including Candida albicans. Each BKL Protein Report includes a descriptive title line that summarizes key characteristics of the protein, as well as functional properties extracted from over 190,000 published research articles. Each piece of data is linked to the original publication that reported the information. The volumes of the BKL are interlinked to allow cross-species analysis such as that illustrated in Table 3.

While the major part of BKL protein annotation is the thousands of GO biological process (BP), molecular function (MF) and cellular component (CC) terms, it also contains curated annotation on properties including expression (for human, mouse, rat and worm), disease involvement (human), mutant phenotype (mouse, worm, fungi), protein domains, and genetic location, which are all searchable, alone or in combination with GO annotation. In the BKL 93% of all the proteins in the rat proteome have GO BP annotation. More importantly, for 40% of rat proteins, the annotation is based on direct experimental evidence.

This study has taken advantage of the integration of GO curation with information on disease association and cross-species accessibility in the BKL to quickly and comprehensively analyze a cluster of genes upregulated in an in vitro model designed to test vascular smooth muscle cell responses in cardiovascular disease.

The utility of the BKL and the BKL Retriever search tool is illustrated at several levels. In addition to analyzing data generated by the microarray experiment as the target data set, user-defined target data sets can be generated using the BKL Retriever. This is illustrated by the cardiovascular disease-related and the human and mouse ortholog target data sets that are discussed here. Once a target data set has been uploaded into the BKL Retriever, a hierarchical display of all curated data for that specific data set, (GO properties, expression data, disease association, mutant phenotype, protein domain structure, genetic location, and calculated properties such as molecular weight, pI) can be displayed. The set-specific data can then be browsed to look for salient features, as is shown here for GO BP and disease data. Complex query capability allows users to search on each property alone or in combination. In addition, the recently introduced Bioknowledge Workspace, which is accessible from the BKL Retriever and from the Protein Reports from all the BKL volumes, provides a graphic interface for protein interactions information found in the BKL. All this contrasts with the laborious, time-intensive manual sorting and extraction of data required with the chip annotation.

It should be noted that at present, the BKL does not provide a tool for exhaustive analysis of the significantly enriched GO features in a target data set, such as is the case with dedicated GO mining tools. However, as noted by Masseroli and co-workers [Masseroli et al., 2004], the accuracy of the results from these tools is fully dependent on the accuracy of the annotations in the databases analyzed. The BKL is a unique resource in that it combines a large, integrated database of expertly curated experimental data on 24 species, with a comprehensive search tool that allows querying and browsing of curated features. The BKL is a widely applicable systems biology tool that allows researchers to maximally leverage their own experimental data, find relationships among co-regulated genes, uncover pathways in which proteins participate, and make novel connections between existing and new biological information.



Acknowledgements

This work was supported by the American Heart Association grant #0455846T (B.M.S). We would like to thank Mike Tillberg, and the curators who maintain the BKL.




References