| In Silico Biology 5, 0007 (2004); ©2005, Bioinformation Systems e.V. |
| Ontology Workshop Göttingen 2004 |
1 Department of Bioinformatics, Faculty of Medicine, Georg August University Göttingen, Goldschmidtstr. 1, D-37077 Göttingen, Germany
2 The National Laboratory of Protein Engineering and Plant Genetic Engineering, College of Life Sciences, Peking University, Beijing 100871, PR China
3 BIOBASE GmbH, Halchtersche Str. 33, D-38304 Wolfenbüttel, Germany
* Corresponding author
Phone: +49-551-39 14918; Fax: +49-551-39 14914; Email: hom@bioinf.med.uni-goettingen.de
Edited by T. Takai-Igarashi; received January 28, 2005; revised February 06, 2005; accepted February 09, 2005; published February 13, 2005
CYTOMER®,a is a relational database of organs/tissues, cell types, physiological systems and developmental stages that currently focuses on the human system. From this database, we have derived an ontology for anatomical and morphological structures for the human organism which includes all embryonic stages and the cell types constituting these structures. The ontology has been transferred to the OWL format and is freely available for download at http://cytomer.bioinf.med.uni-goettingen.de.
Keywords: ontologies, human developmental stages, gene expression sources, relational database system, OWL, Internet resource
An important part of a gene's function is its expression pattern. Until now, most tools developed for genome annotation have emphasized identification of open reading frames, deduction of their potential products and elucidation of their biochemical function by sophisticated sequence or structure homology searches. However, an accurate assessment of the biological and physiological function of a gene (e. g. encoding a certain type of enzyme) is impossible without knowing when, where and under what conditions that gene will be expressed. To overcome this lack of knowledge, advanced technology, such as e. g. microarrays, is increasingly being applied to collect mass data on gene expression. However, efficient use of these data has been hampered by the lack of standards for their representation, thus making it difficult to compare the data generated at one site, and impossible to comprehensively evaluate data generated by different laboratories.
These problems have been tackled by international consortia and efforts such as those coordinated by the European Bioinformatics Institute (EBI) in Hinxton, UK [Brazma et al., 2003], and the MGED society [Stoeckert et al., 2002]. While these initiatives have been successful in setting up standards for the expression patterns themselves, they have also revealed the urgent need for ontologies, including expression ontologies, in addition to several other requirements [Brazma et al., 2000]. A number of international working groups have been established, one of them focusing on the development of ontologies for sample description, including information about tissues and cell lines. The Jackson Laboratory, in cooperation with Edinburgh University, has already done pioneering work in this area for the mouse system [Bard, 2003; Davidson et al., 2001].
Previously, we presented an ontology for human anatomical and morphological structures, including the cell types constituting these structures [Chen et al., 1999; Wingender, 2003]. We generated underlying trees for all human embryonic stages, as well as for the adult organism. Here we describe the principles of organizing these data in Protégé and OWL.
CYTOMER has been constructed and is presently maintained as a relational database system aimed at providing a comprehensive overview of all gene expression sources, focusing thus far on human entities [Chen et al., 1999]. In addition, CYTOMER currently includes all developmental stages of Caenorhabditis elegans. The gene expression sources included are organs, tissues and cell types at the different developmental stages of an organism. CYTOMER is thus a database of physiological systems (table system), developmental stages (tables stage and period), anatomical structures and substructures (table organ) and the constituting cell types (table cell) in different organisms or species (table species). The entities from the organ table are linked to specific stages by the organtree table, which comprises the columns organ_parent_no., organ_no., and stage_no.. The organtree table itself is connected to the HUB table. HUB is the central table of CYTOMER which combines entries in the organtree table with specific entries from the system, cell, and other tables. The HUB table incorporates anatomical/histological knowledge about which cells occur with what kind of function in which organs, and at what stages and in which species (Fig. 1).
The most extensive tables of CYTOMER are organ and cell. The organ table represents an ontology of anatomical structures and morphological substructures. It is hierarchically organized as a directed acyclic graph (DAG), starting with the entry "human body of developmental stage n" as the root concept (level 0). The adult organism tree, which is the most complex, proceeds through 80 nodes (or concepts) of level 1 (the "primary organs") and 6281 inner nodes, ending up in 2091 end nodes ("leaves"). The hierarchy depth varies greatly between the branches, ranging between 1 to 11 levels underneath the root of the adult organism (Tab. 1).
| Table 1: | Numbers of entries in the CYTOMER ontology for the different Carnegie stages of the embryo and for the adult human. |
| Stagename | entries | inner nodes | leaves | depth | links to cells |
| Carnegie Stage 1 | 6 | 2 | 4 | 1 | 0 |
| Carnegie Stage 2 | 6 | 2 | 4 | 1 | 0 |
| Carnegie Stage 3 | 13 | 5 | 8 | 3 | 5 |
| Carnegie Stage 4 | 11 | 6 | 5 | 4 | 4 |
| Carnegie Stage 5a | 15 | 7 | 8 | 4 | 5 |
| Carnegie Stage 5b | 21 | 9 | 12 | 5 | 11 |
| Carnegie Stage 5c | 19 | 9 | 10 | 5 | 8 |
| Carnegie Stage 6a | 31 | 12 | 19 | 4 | 16 |
| Carnegie Stage 6b | 40 | 14 | 26 | 4 | 19 |
| Carnegie Stage 7 | 46 | 16 | 30 | 4 | 14 |
| Carnegie Stage 8 | 53 | 18 | 35 | 6 | 11 |
| Carnegie Stage 9 | 175 | 72 | 103 | 8 | 24 |
| Carnegie Stage 10 | 293 | 107 | 186 | 7 | 90 |
| Carnegie Stage 11 | 357 | 127 | 230 | 7 | 114 |
| Carnegie Stage 12 | 404 | 143 | 261 | 7 | 133 |
| Carnegie Stage 13 | 527 | 179 | 347 | 9 | 168 |
| Carnegie Stage 14 | 635 | 218 | 415 | 9 | 197 |
| Carnegie Stage 15 | 800 | 279 | 521 | 9 | 245 |
| Carnegie Stage 16 | 835 | 306 | 529 | 9 | 219 |
| Carnegie Stage 17 | 965 | 348 | 617 | 9 | 287 |
| Carnegie Stage 18 | 1066 | 372 | 694 | 9 | 292 |
| Carnegie Stage 19 | 1133 | 392 | 741 | 9 | 299 |
| Carnegie Stage 20 | 1174 | 394 | 780 | 9 | 299 |
| adult | 8372 | 2091 | 6281 | 11 | 4961 |
| Inner nodes represent those concepts which have other concepts as children, whereas leaves are the end-nodes of the hierarchy. Depth denotes the maximal hierarchical depth. |
CYTOMER has been compiled from a number of standard text books as well as from several Internet-based sources. For the organs and anatomical and morphological structures, we used the Terminologia Anatomica, systematically comparing it with the nomenclature used by the Edinburgh "Atlas & Database of Human Developmental Anatomy" for embryonic Carnegie Stages 1-20 [Hunter et al., 2003].
For all entities, slots are provided for English and German names, synonyms, and the relevant medical terminology. Definitions of anatomical structures are given in German and English. The cell table includes an international, an English and a German cell name as well as synonyms in both languages and the cell parents. Furthermore, short descriptions of location and cell function are also included so that, for instance, the lung together with nose, larynx, trachea and bronchial tree are represented as parts of the respiratory system. The respiratory system itself belongs to the physiological system table.
Thus far, CYTOMER has mainly been used to annotate expression patterns of transcription factors within the TRANSFAC database [Matys et al., 2003]. However, the database structure can be used to connect any kind of expression pattern with CYTOMER. Therefore, CYTOMER will be an essential part of our Expression Analysis Pipeline (XAP) system that is now under construction and aims at establishing an integrated workflow system that can provide a complete analysis platform for gene expression data [Haubrock, pers. commun.].
As a first attempt to create an ontology for human organs and tissues, we chose a tree-like structure as explained above [see also Chen et al., 1999]. However, as is the case with nearly any attempt to systematize complex biological knowledge, this approach has its limits, since living systems arising from a long evolutionary history have not developed in accordance with our intellectual constructs. Thus, even when strictly applying the criterion of localized structures to the items in our organs list, there may be objects that cannot clearly and unambiguously be assigned to a given superstructure. As a result, the curators of the CYTOMER database have inevitably had to make arbitrary assignment decisions so as to keep the tree-like structure consistent.
In order to go beyond a simple tree-like structure and to allow for multiple parent assignments, we decided to base further development of CYTOMER upon a standard ontology development system. A number of such systems are available, each with its own strengths and weaknesses [Lambrix et al., 2003]. After evaluating a number of different clients we chose Protégé [Noy et al., 2003] for two reasons:
The second point proved especially valuable in migrating the data content from the relational system to a format readable and editable by Protégé.
Therefore, a tool based on the Jena framework (http://jena.sourceforge.net) has been implemented in Java. This tool maps the CYTOMER data content to the OWL format (Fig. 2) by building a specific organtree and creating a Jena Ontology Model (OntModel). This OntModel includes a so-called Individual for each organ; each Individual contains all relevant descriptions, including the organ_no., as DatatypeProperties. The organ_no. associated with each Individual can serve as a target for external linking. Finally, the OntModel is serialized and written to an OWL file.
|
Figure 2: CYTOMER loaded as an OWL file in the Protégé (Beta release - version 3.0) ontology editor. |
OWL, the Web Ontology Language, is an emerging ontology language standard that has been optimized for data exchange and knowledge sharing (http://www.w3.org/2004/OWL/). In addition, OWL provides formal semantics and has built-in reasoning support, e. g. for the Racer classifier [Haarslev and Moeller, 2000].
Thus, CYTOMER in OWL will provide us with a flexible format to tackle two main tasks:
We are grateful to the German Federal Ministry of Education and Research (BMBF) for financial support to the initial stages of CYTOMER development (CHN-305-97). We also would like to express our thanks to T. Groß for his help in developing an efficient input client for database curation, to D. Karas, S. Land and S. Rotert for expert annotation work, and to Amy E. Hodge and Alison Gagnon for critically reading the manuscript.
Footnote:
a CYTOMER is a registered trademark of BIOBASE GmbH, Wolfenbüttel, Germany.