Deriving an ontology for human gene expression sources from the CYTOMER(R) database on human organs and cell types
In Silico Biology 5, 0007 (2004); ©2005, Bioinformation Systems e.V.  
Ontology Workshop Göttingen 2004

Deriving an ontology for human gene expression sources from the CYTOMER® database on human organs and cell types


Holger Michael1,*, Xin Chen2, Ellen Fricke3, Martin Haubrock1, Remko Ricanek1 and Edgar Wingender1,3




1 Department of Bioinformatics, Faculty of Medicine, Georg August University Göttingen, Goldschmidtstr. 1, D-37077 Göttingen, Germany
2 The National Laboratory of Protein Engineering and Plant Genetic Engineering, College of Life Sciences, Peking University, Beijing 100871, PR China
3 BIOBASE GmbH, Halchtersche Str. 33, D-38304 Wolfenbüttel, Germany



* Corresponding author
   Phone: +49-551-39 14918; Fax: +49-551-39 14914; Email: hom@bioinf.med.uni-goettingen.de



Edited by T. Takai-Igarashi; received January 28, 2005; revised February 06, 2005; accepted February 09, 2005; published February 13, 2005



Abstract

CYTOMER®,a is a relational database of organs/tissues, cell types, physiological systems and developmental stages that currently focuses on the human system. From this database, we have derived an ontology for anatomical and morphological structures for the human organism which includes all embryonic stages and the cell types constituting these structures. The ontology has been transferred to the OWL format and is freely available for download at http://cytomer.bioinf.med.uni-goettingen.de.

Keywords: ontologies, human developmental stages, gene expression sources, relational database system, OWL, Internet resource



Introduction

An important part of a gene's function is its expression pattern. Until now, most tools developed for genome annotation have emphasized identification of open reading frames, deduction of their potential products and elucidation of their biochemical function by sophisticated sequence or structure homology searches. However, an accurate assessment of the biological and physiological function of a gene (e. g. encoding a certain type of enzyme) is impossible without knowing when, where and under what conditions that gene will be expressed. To overcome this lack of knowledge, advanced technology, such as e. g. microarrays, is increasingly being applied to collect mass data on gene expression. However, efficient use of these data has been hampered by the lack of standards for their representation, thus making it difficult to compare the data generated at one site, and impossible to comprehensively evaluate data generated by different laboratories.

These problems have been tackled by international consortia and efforts such as those coordinated by the European Bioinformatics Institute (EBI) in Hinxton, UK [Brazma et al., 2003], and the MGED society [Stoeckert et al., 2002]. While these initiatives have been successful in setting up standards for the expression patterns themselves, they have also revealed the urgent need for ontologies, including expression ontologies, in addition to several other requirements [Brazma et al., 2000]. A number of international working groups have been established, one of them focusing on the development of ontologies for sample description, including information about tissues and cell lines. The Jackson Laboratory, in cooperation with Edinburgh University, has already done pioneering work in this area for the mouse system [Bard, 2003; Davidson et al., 2001].

Previously, we presented an ontology for human anatomical and morphological structures, including the cell types constituting these structures [Chen et al., 1999; Wingender, 2003]. We generated underlying trees for all human embryonic stages, as well as for the adult organism. Here we describe the principles of organizing these data in Protégé and OWL.



CYTOMER structure

CYTOMER has been constructed and is presently maintained as a relational database system aimed at providing a comprehensive overview of all gene expression sources, focusing thus far on human entities [Chen et al., 1999]. In addition, CYTOMER currently includes all developmental stages of Caenorhabditis elegans. The gene expression sources included are organs, tissues and cell types at the different developmental stages of an organism. CYTOMER is thus a database of physiological systems (table system), developmental stages (tables stage and period), anatomical structures and substructures (table organ) and the constituting cell types (table cell) in different organisms or species (table species). The entities from the organ table are linked to specific stages by the organtree table, which comprises the columns organ_parent_no., organ_no., and stage_no.. The organtree table itself is connected to the HUB table. HUB is the central table of CYTOMER which combines entries in the organtree table with specific entries from the system, cell, and other tables. The HUB table incorporates anatomical/histological knowledge about which cells occur with what kind of function in which organs, and at what stages and in which species (Fig. 1).



Figure 1: Simplified schema of the CYTOMER relational database structure.


The most extensive tables of CYTOMER are organ and cell. The organ table represents an ontology of anatomical structures and morphological substructures. It is hierarchically organized as a directed acyclic graph (DAG), starting with the entry "human body of developmental stage n" as the root concept (level 0). The adult organism tree, which is the most complex, proceeds through 80 nodes (or concepts) of level 1 (the "primary organs") and 6281 inner nodes, ending up in 2091 end nodes ("leaves"). The hierarchy depth varies greatly between the branches, ranging between 1 to 11 levels underneath the root of the adult organism (Tab. 1).

Table 1: Numbers of entries in the CYTOMER ontology for the different Carnegie stages of the embryo and for the adult human.
Stagename entries inner nodes leaves depth links to cells
Carnegie Stage 1 6 2 4 1 0
Carnegie Stage 2 6 2 4 1 0
Carnegie Stage 3 13 5 8 3 5
Carnegie Stage 4 11 6 5 4 4
Carnegie Stage 5a 15 7 8 4 5
Carnegie Stage 5b 21 9 12 5 11
Carnegie Stage 5c 19 9 10 5 8
Carnegie Stage 6a 31 12 19 4 16
Carnegie Stage 6b 40 14 26 4 19
Carnegie Stage 7 46 16 30 4 14
Carnegie Stage 8 53 18 35 6 11
Carnegie Stage 9 175 72 103 8 24
Carnegie Stage 10 293 107 186 7 90
Carnegie Stage 11 357 127 230 7 114
Carnegie Stage 12 404 143 261 7 133
Carnegie Stage 13 527 179 347 9 168
Carnegie Stage 14 635 218 415 9 197
Carnegie Stage 15 800 279 521 9 245
Carnegie Stage 16 835 306 529 9 219
Carnegie Stage 17 965 348 617 9 287
Carnegie Stage 18 1066 372 694 9 292
Carnegie Stage 19 1133 392 741 9 299
Carnegie Stage 20 1174 394 780 9 299
adult 8372 2091 6281 11 4961
Inner nodes represent those concepts which have other concepts as children, whereas leaves are the end-nodes of the hierarchy. Depth denotes the maximal hierarchical depth.



CYTOMER contents

CYTOMER has been compiled from a number of standard text books as well as from several Internet-based sources. For the organs and anatomical and morphological structures, we used the Terminologia Anatomica, systematically comparing it with the nomenclature used by the Edinburgh "Atlas & Database of Human Developmental Anatomy" for embryonic Carnegie Stages 1-20 [Hunter et al., 2003].

For all entities, slots are provided for English and German names, synonyms, and the relevant medical terminology. Definitions of anatomical structures are given in German and English. The cell table includes an international, an English and a German cell name as well as synonyms in both languages and the cell parents. Furthermore, short descriptions of location and cell function are also included so that, for instance, the lung together with nose, larynx, trachea and bronchial tree are represented as parts of the respiratory system. The respiratory system itself belongs to the physiological system table.

Thus far, CYTOMER has mainly been used to annotate expression patterns of transcription factors within the TRANSFAC database [Matys et al., 2003]. However, the database structure can be used to connect any kind of expression pattern with CYTOMER. Therefore, CYTOMER will be an essential part of our Expression Analysis Pipeline (XAP) system that is now under construction and aims at establishing an integrated workflow system that can provide a complete analysis platform for gene expression data [Haubrock, pers. commun.].



CYTOMER in Protégé and OWL

As a first attempt to create an ontology for human organs and tissues, we chose a tree-like structure as explained above [see also Chen et al., 1999]. However, as is the case with nearly any attempt to systematize complex biological knowledge, this approach has its limits, since living systems arising from a long evolutionary history have not developed in accordance with our intellectual constructs. Thus, even when strictly applying the criterion of localized structures to the items in our organs list, there may be objects that cannot clearly and unambiguously be assigned to a given superstructure. As a result, the curators of the CYTOMER database have inevitably had to make arbitrary assignment decisions so as to keep the tree-like structure consistent.

In order to go beyond a simple tree-like structure and to allow for multiple parent assignments, we decided to base further development of CYTOMER upon a standard ontology development system. A number of such systems are available, each with its own strengths and weaknesses [Lambrix et al., 2003]. After evaluating a number of different clients we chose Protégé [Noy et al., 2003] for two reasons:

  1. The user interface provides an environment that can easily be learned and remembered by the domain experts, who are medical scientists employed as annotators.
  2. The ever-growing number of plug-ins provides Protégé with considerable versatility.

The second point proved especially valuable in migrating the data content from the relational system to a format readable and editable by Protégé.

Therefore, a tool based on the Jena framework (http://jena.sourceforge.net) has been implemented in Java. This tool maps the CYTOMER data content to the OWL format (Fig. 2) by building a specific organtree and creating a Jena Ontology Model (OntModel). This OntModel includes a so-called Individual for each organ; each Individual contains all relevant descriptions, including the organ_no., as DatatypeProperties. The organ_no. associated with each Individual can serve as a target for external linking. Finally, the OntModel is serialized and written to an OWL file.



Figure 2: CYTOMER loaded as an OWL file in the Protégé (Beta release - version 3.0) ontology editor.


OWL, the Web Ontology Language, is an emerging ontology language standard that has been optimized for data exchange and knowledge sharing (http://www.w3.org/2004/OWL/). In addition, OWL provides formal semantics and has built-in reasoning support, e. g. for the Racer classifier [Haarslev and Moeller, 2000].

Thus, CYTOMER in OWL will provide us with a flexible format to tackle two main tasks:

  1. As outlined above, the current structure of CYTOMER requires differentiation of its present tree-like structure to allow for multiple parent assignments.
  2. Another challenge presents itself in the modeling of developmental processes. Time lines are currently represented by the different developmental stages of the whole organism. However, we also have developmental and differentiation processes at a defined stage, e. g. within the adult organism. Within the relational model, the precursor-descendent relation is modeled by a second parent-child attribute that is not yet systematically assigned across the whole table. Moreover, if corresponding items between the distinct structural DAGs for the individual developmental stages are properly linked to each other, a set of trees can be constructed that is orthogonal to the time snap-shots, creating differentiation trees for individual structures along the developmental axis.



Acknowledgements

We are grateful to the German Federal Ministry of Education and Research (BMBF) for financial support to the initial stages of CYTOMER development (CHN-305-97). We also would like to express our thanks to T. Groß for his help in developing an efficient input client for database curation, to D. Karas, S. Land and S. Rotert for expert annotation work, and to Amy E. Hodge and Alison Gagnon for critically reading the manuscript.




References




Footnote:

a CYTOMER is a registered trademark of BIOBASE GmbH, Wolfenbüttel, Germany.