| In Silico Biology 5, 0002 (2004); ©2004, Bioinformation Systems e.V. |
| Ontology Workshop Göttingen 2004 |
EMBL Outstation- European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
* Corresponding author
Email: goa@ebi.ac.uk
Phone: +44-1223 494465; Fax: +44-1223 494468
Edited by E. Wingender; received September 15, 2004; revised and accepted November 26, 2004; published December 07, 2004
The number of large-scale experimental datasets generated from high-throughput technologies has grown rapidly. Biological knowledge resources such as the Gene Ontology Annotation (GOA) database, which provides high-quality functional annotation to proteins within the UniProt Knowledgebase, can play an important role in the analysis of such data. The integration of GOA with analytical tools has proved to aid the clustering, annotation and biological interpretation of such large expression datasets. GOA is also useful in the development and validation of automated annotation tools, in particular text-mining systems. The increasing interest in GOA highlights the great potential of this freely available resource to assist both the biological research and bioinformatics communities.
Keywords: Gene Ontology, annotation, data analysis
High-throughput technologies generating massive amounts of functional genomics data have revolutionised biological research. For instance, microarray technology and high-throughput screening (HTS) proteomics, which provide large expression datasets across time and multiple experimental conditions, have advanced research in the life science, biotechnology and healthcare sectors. A major challenge lies in the rational handling and interpretation of these rapidly accumulating data. Integration of biological knowledge resources, such as Gene Ontology (GO) annotation, and molecular information resources (e. g. the DNA and protein databases) with analytical tools (using computational, bioinformatic and mathematical methods) can facilitate the clustering and functional annotation of the expression data, consequently translating these data into a format that provides a better understood representation of the underlying biology (Figure 1).
|
Figure 1: Generation, analysis and utility of large-scale experimental data in biological research. Datasets generated from high-throughput technologies, systems biology approaches or cytogenetic studies are stored and analysed using the platforms that integrate molecular information (e. g. the DNA and protein databases) and biological knowledge (e. g. GO annotations) with bioinformatics, computing and mathematics sciences. This strategy has been used for gaining a better biological understanding and interpretation of the large experimental datasets in biological studies [13, 14, 15, 16], pathological research [6, 7, 17, 18] and pharmaceutical screening [3, 4, 5]. |
The Gene Ontology Annotation (GOA) project (http://www.ebi.ac.uk/GOA) at the European Bioinformatics Institute (EBI) uses the dynamic controlled vocabulary of GO, which describes the biological process, molecular function and cellular component of a generic cell [1], to characterise the gene products within the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR) [2]. Currently (as of the GOA UniProt 20.0 release), the GOA database provides over 4.8 million GO annotations across 1 million protein entries (approximately 75% coverage of most well-known proteins within UniProt) for >76,000 species, making it the largest contributor to the GO Consortium's public repository of annotations. The biological knowledge in the GOA data is very useful for biomedical research and pharmaceutical discovery when combined with results from '-omics' experiments and systems biology approaches. In cancer research, integrating GO annotations with microarray data is one of the key strategies to select candidate biomarkers, identify novel therapeutics and evaluate effects of drug treatments [3, 4]. In addition, positional (cytogenetic) annotation of the genes that reside in the regions of tumour-related chromosomal abnormalities using the GOA data has identified tumour suppressor candidate sets and has characterised the molecular basis of tumorigenesis [5]. Apart from cancers, the GOA knowledge also allows a systematic investigation and functional classification of the etiologies of multi-factorial diseases, such as schizophrenia and obesity [6, 7], which will significantly improve the treatment planning of these diseases. These examples highlight the great potential of the GOA database in knowledge discovery and data mining of large-scale datasets derived from high-throughput experiments and disease-oriented cytogenetic studies (Figure 1). A number of analytical tools, such as MAPPFinder [8], GoMiner [9], FatiGO [10], Onto-Express [11] and EASE [12], provide visualisation, hierarchical clustering or functional annotation of genome-scale datasets based on the GO annotation. A list of GO-related tools can be found at the GO Consortium website (http://www.geneontology.org/GO.tools.html). Use of these tools in the analysis of microarray and proteomics data has provided new insights into the complicated mechanisms and regulatory networks in developmental biology [13], immunology [14, 15], evolutionary biology [16] and disease pathogenesis [17, 18].
The GO annotation data also serves as a gold standard resource for the development and assessment of automated annotation tools,which are based on sequence homology with GO-annotated proteins, protein domain analysis or text mining from the literature [19, 20, 21, 22]. Given that these automated tools have produced an enormous amount of predictions of gene function[20, 23, 24], phenotypes [25], protein-protein interactions [26] and protein subcellular locations [27], a validation of these tools would accelerate the generation of high quality biological annotations. Some comparative analyses, in terms of agreement and coverage of resultant annotations, between these automatically annotating approaches and the GO association file releases have been performed to evaluate these tools [19, 20, 22, 23]. The evaluation results are variable due to different technical approaches used. In general, the automatically generated annotations tend to represent higher-level nodes (parents of the same lineage) or, in some cases, the close but different lineages in comparison with the manual GOA data (non-electronic GO annotations [2]). The GOA group at EBI has contributed to the text mining annotation activities through its participation in the BioCreative challenge (an assessment of automatic text mining and information extraction techniques in biology), in which GOA provided the training and test sets as well as a manual evaluation of the test results [28]. It is believed that these efforts will improve the performance and accuracy of the text mining systems in future BioCreative challenges.
It is hoped that the comprehensive and high quality annotations, either provided by the GOA database [29] or validated by comparative analyses with the GOA releases, will assist the biological research community in the biological interpretation of large-scale data and in turn will promote the maximum utilisation of these data.
We thank Dr. Midori Harris for her helpful discussions and comments on the manuscript. The GOA project is supported by grants QRLT-2001-00015 and QLRI-2000-00981 of the European Commission and a supplementary grant, HG-O2273 from the National Institute of Health (NIH).