In Silico Biology 4, 0006 (2003); ©2003, Bioinformation Systems e.V.  
Ontology Workshop Tokyo 2003



TRANSFAC®, TRANSPATH® and CYTOMER® as starting points for an ontology of regulatory networks

Edgar Wingender




Dept. of Bioinformatics, UKG, University of Göttingen, Goldschmidtstr. 1, D-37077 Göttingen, Germany
BIOBASE GmbH, Halchtersche Str. 33, D-38304 Wolfenbüttel, Germany
  Email: e.wingender@med.uni-goettingen.de





Edited by T. Takai-Igarashi; received March 02, 2004; revised and accepted March 16; published March 16, 2004



Abstract

Building an ontology of a defined knowledge domain can help to model an appropriate database structure for the relevant contents. On the other hand, having a comprehensive overview of the knowledge of a certain domain as it may be provided by corresponding databases facilitates building an appropriate ontology, or ontologies with different granularities, which may then provide many additional benefits in handling the stored and in retrieving additional information from heterogenous sources. In this communication, the first steps are reported how we may derive an ontology for the domain of "molecular regulation" from our databases TRANSFAC® (transcriptional regulation), TRANSPATH® (signal transduction) and CYTOMER® (cellular locations).

Key words: regulatory networks, signal transduction pathways, transcriptional regulation, databases, ontological modeling



Introduction

The "TRANSFAC System" is a collection of databases which deal with information about gene expression. TRANSFAC and TRANSCompel are databases about transcriptional regulation: the former provides information about transcription factors, their DNA-binding sites and DNA-binding properties, the latter about the combinatorics of transcriptional control through composite elements in promoters or enhancers [Kel-Margoulis et al., 2002a; Matys et al., 2003]. TRANSPATH is a database on signal transduction pathways, providing information about the components of these pathways (molecules and complexes) and the reactions between them [Krull et al., 2003]. It primarily aims at those pathways that regulate transcriptional events. To model cell-specific regulatory processes and pathways, a database (CYTOMER) has been developed to model "expression sources", i. e. cells, tissues, organs at different developmental stages, with the focus on human sources [Chen et al., 1999; Wingender et al., 2001]. In the different stages of these databases, the needs for controlled vocabularies became evident, but moreover, the huge body of data made it more and more obvious that the time was ripe to organize (some of) the contents of these databases in a more systematic way. At the end of these efforts, we hope to achieve some comprehensive systematics about regulatory processes and networks within and, later on, between the cells of a complex organism such as the human body. A formal representation of these systematics could be an ontological model of the involved concepts and their relations. In this contribution, we discuss the first steps how this can be done for some kinds of objects that play key roles in gene expression processes.



Methods

The work described refers to TRANSFAC version 7.4, TRANSPATH version 3.4, and to CYTOMER 1.0. The ontology derived from the CYTOMER database is freely available under http://www.biobase.de:8080/index.html. All these databases have been constructed as relational models and have been made available as flat file systems



Results and discussion

Top-level ontology

Ontologies of regulatory biological networks can be useful in many respect, for instance in establishing controlled vocabularies extended by properly modeled relations between the terms, in developing natural language processing tools for automatic retrieval of relevant data from original research papers, or in designing the principal structure of a database storing corresponding data. In addition, they may also help us to gain new insights: For building an ontology, a thorough analysis of the respective knowledge domain is an absolute prerequisite since first of all, we have to understand its intrinsic structure, i. e. the nature of and the relations between the objects of interest. During the process of building the ontology, however, weaknesses of the object definitions and fuzzyness in our understanding of the relations become obvious and can trigger more thorough considerations or even active research work on them. Moreover, new relations might be detected or, at least, hypothesized during this process as well. Even more: An ontology may help to cope with the problem of incomplete knowledge in a domain by facilitating reusability and transparency of data between subdomains. A conceivable example about reuse of quantitative data for simulating signaling processes will be discussed below (see section "Signal transduction").

The strategy of our work has been to start with systematically collecting the relevant data, to identify the underlying concepts, to try to find the intrinsic relations among them and to use them for building appropriate classifications. The next step is to transform these classifications into ontologies by consistently curating the relations between the concepts covered and, wherever appropriate or required, by adding definitions to it.

The domain we are primarily interested in is "Regulation of molecular processes". For this domain, we made an attempt to develop a top-level ontology with regulatory components and regulatory processes as main entities (Fig. 1). As regulatory components, concepts of biological objects are comprehended which can act as regulating as well as regulated entities. Figure 1 also indicates which parts of this knowledge domain are covered by which of our databases. Those databases that will be mentioned in the following sections comprise data about instances, but also definitions of concepts.


Figure 1: Top-level ontology of "molecular regulation" in biological systems. The different entities and their "is-a" relation is shown along with those databases that provide detailed information about the corresponding objects (instances). Thus, the TRANSFAC database does not only provide information about (regulated) genes and the (regulating) promoters, enhancers etc., but also of molecular components involved in transcriptional control. As such, they appear both as "molecular structural components" (e. g., polypeptides, protein complexes, etc.) as well as "molecular functional components" (transcription factors). For other types of molecules, "molecular components" is the main domain of the TRANSPATH database. Further databases mentioned are: TRANSCompel (transcriptional regulatory composite elements), S/MARt DB (genomic scaffold/matrix attached regions), ReAlSplice (regulated alternative splice events), and EndoNet (endocrine networks). The latter is presently under construction, whereas the other are accessible as "Services" at http://www.bioinf.med.uni-goettingen.de/.


Transcription regulation

One of the most fundamental regulatory mechanisms in living cells is transcriptional regulation, since transcription is the step which activates the information that is statically encoded in the genomic DNA. This is presently subject to large-scale analyses by transcriptomics approaches. The basic components, promoters and their constituting regulatory elements on one side (to be put under "regulatory genomic component" of the top-level ontology of Fig. 1), and the binding transcription factors as subnodes of "molecular functional components" on the other, would therefore be first candidates for ontological modeling.

A comprehensive and exhaustive classification of regulatory genomic components (mainly, promoters and enhancers) and their constituents would allow to systematically describe "regulons" in eukaryotic genomes, and assigning new members to them would mean to predict the regulation of the respective gene. These efforts are in a very early stage, some first attempts along with the principal approach have been published earlier by Kel-Margoulis et al., 2002b.

A complement of such a promoter classification would be a classification of the individual binding sites within these regions and the transcription factors interacting with them. Transcription factors (TF) are modularly composed, one particularly important region being the DNA-binding domain (DBD). A classification of DBDs could be correlated with a corresponding classification of the cognate sites to deduce correlations between the features of both of them and to derive rules for the DNA-recognition by the individual DBD classes. Such rules could then be used for the prediction of the DNA-binding specificities of newly disclosed and classified transcription factors or, in turn, make predictions about the nature of a factor that may interact with a certain DNA element that is assumed to play a regulatory role, e. g. because of the results of phylogenetic footprinting studies.

Altogether, 1381 factors out of a total of 5401 TF entries in TRANSFAC (26%) could be classified thus far. Based on this training set, a library of Hidden Markov Models has been constructed and is used by an automatic tool for classifying further factors, including those that have not yet been experimentally characterized as transcription factors and therefore have not yet been taken up in TRANSFAC (Stegmeier et al., in preparation). Though the percentage of "classifiable" TFs in TRANSFAC will be enhanced in this way, it will never comprise all factors because (1) there may be many TFs which do not fit into any generalized picture, and (2) many TRANSFAC FACTOR entries represent complexes rather than individual polypeptides. The present status of the transcription factor classification is available at http://www.gene-regulation.com/pub/databases/transfac/cl.html.

A TF classification based merely on sequence (or structure) similarities of their DNA-binding domains alone cannot predict function. For this, we have to involve additional expert knowledge about the structure-function relationships. This is what we are aiming at, by formalizing the knowledge about functions of TFs, TF families and their genomic targets.



Signal transduction

Upstream of transcription factors in the regulatory path, more or less complex signal transduction cascades control gene expression processes by regulating the activity of TFs. They mediate triggers set by extracellular messenger molecules such as hormones or growth factors, but also the responses to pathogenic agents and other biological, chemical or physical stimuli.

To investigate the general structure of signaling pathways, it is important to characterize the function of the individual components, e. g. as ligands, receptors, adaptors, protein kinases, transcription factors, etc., as they are described in the TRANSPATH database. In TRANSPATH, there are two classes of entries: Molecules and Reactions, both are organized in a hierarchical way. The existing classification of the signaling molecules in TRANSPATH contains the levels "basic molecules" (i. e., individual molecules to which a sequence or chemical formula and, thus, a molecular weight can be assigned), "orthologous groups" (summarizing the properties of orthologous polypeptides and proteins from different species), and "families". "Basic molecules" should be considered as instances of "orthologous groups", e. g., taking "human c-Jun" as an instance of "c-Jun" which is considered as the lowest level of a molecular and functional concept. "Orthologous groups", on the other hand, are the end-nodes of a more comprehensive functional classification which proceeds through a hierarchical tree of "families" to the root of "signaling molecule" (http://www.gene-regulation.com/cgi-bin/pub/databases/transpath/search.cgi). Presently, this classification is a tree-structured classification of pure "is-a" relationships, although it became clear already that this structure has to be extended to allow multiple inheritance for molecules with hybrid functions. It differs from Gene Ontology (GO) in that it specifically aims at systematically assigning functional roles to signal transduction components along the signaling pathways.

Reactions are presently subclassified according to the way they are represented (mechanistically or semantically) [Krull et al., 2003]. A classification of the reactions according to their nature is under way. The major aim of classifying both molecular components of signaling pathways and their reactions is at systematizing the function of distinct steps of a signaling cascade with regard to, e. g., forwarding a signal to another location, to multiplying it in a catalytic step, or to make it more specific by combining it at a certain check point with that of another pathway. As a first step to achieve this goal, a systematic abstraction has recently been introduced into the database. It summarizes the numerous individual signaling reactions which are evidenced by a vast amount of published experimental results, represented as mechanistic reactions in the database on he "evidence level", and that are the basis for an extended data quality assessment system [Choi et al., 2004]. This abstraction allows non-redundant and consistent re-construction of signaling pathways and is therefore called "pathway level". In contrast to these mechanistic reactions which are designed to give the details of the chemical reactions involved in a signaling pathway, the semantic "projection" reduces the pathway to the key molecules amongst which the signal is proceeded.

One of the perspectives of the TRANSPATH database is to provide the architecture of the signaling network of "a cell" for simulating the dynamics of this network. In addition to the network structure, information about (semi-)quantitative reaction parameters would be required as well. This kind of data, however, is largely missing for eukaryotic signal transduction reactions. At this point, an ontology about the relevant reaction types may help by pinpointing related reactions for which the required data are known and which could be reasonably used as substitute, or by assigning coarse time courses to the distinct types of reactions. This way, approximate predictions of the behavior of newly unravelled components and the subnet they are involved in can be done.



Cellular locations

When using the integrated contents of TRANSFAC and TRANSPATH for modeling and predicting regulatory networks, a lot of (presumably) "false positive" predictions will be obtained. One way to avoid this is to include filters for the physiological environment, i. e. cells / tissues / organs, where a certain signal may become effective. To properly assign the location of molecules and reactions, an ontology of organs, morphological structures, tissues and cell types is required. This has been initiated using the CYTOMER database [Chen et al., 1999; Wingender et al., 2001].

CYTOMER was originally built as a relational database system comprising the tables Organ, Cell, System, and Stage. The Organ table provides morphological structures in a hierarchical structure of finest possible granularity. Its terminology follows largely the Terminologia Anatomica as a standard vocabulary in anatomy [FCAT Staff, 1998]. The relations between the entries of these four principal tables were modeled using a "Hub" table which lists all Organ (sub)structures, the Cell types found in them (which can be found in other tissues as well and, in turn, leaving it possible that these structures comprise more than just one cell type), the physiological functional System these combinations may contribute to, and the developmental Stage where this happens. Thus, this list represents all structural entities of the human organism and their cellular composition that exert a certain function at a given time period. From the hierarchically structured Organ table, we extracted our first Ontology, which is a pure tree-like "part-of" ontology (http://www.biobase.de:8080/servlet/de.biobase.cytomer.web.OrganBrowser?species_no=86).

More recently, a similar approach to build an anatomy ontology has been reported, the Foundation Model of Anatomy (FMA Embryonal stages, FMA) [Rosse and Mejino, 2003]. FMA focuses more on the structure of the adult human organism whereas CYTOMER also comprises embryonal stages. This, on the other hand, is the focus of the Human Developmental Anatomy ontology [Hunter et al., 2003] compared with which CYTOMER may aim at a more stringent separation between functional system, organs and localizable morphological structures with a finer granularity. Very much oriented along physiological systems is the tissue database (TissueDB; Tokyo University; http://tissuedb.ontology.ims.u-tokyo.ac.jp/tissuedb/index.html) which primarily categorizes tissues in an "is-a" relation tree.

Closer inspection of the vocabulary of the Terminologia anatomica revealed that it comprises a mixture of purely localizable structural items with more functionally defined, system-related entities that should be better separated in our structure. We therefore started to rework our Organ Ontology by separating strictly localizable anatomical Structures from Organs, where we put all terms which would be conventionally agreed upon to be called "organs" and their substructures, and keep the separate tables for Systems and Cell types. In many cases, localizable structures and organs overlap, but for organs that are barely localizable (i. e., skin or blood), this helped a lot. The new ontologically re-modeled CYTOMER will be made available in near future.



Concluding remarks

Altogether, the described classifications may be a useful step towards building a more comprehensive ontology of the knowledge domain "Regulation of molecular processes". The next steps will include, besides continuously revising the existing classifications,



Acknowledgment

Parts of this work have been supported by grants of the German Ministry of Education and Research (BMBF), namely a BioChance project to BIOBASE (no. 0312432), and as part of the Intergenomics Bioinformatics Competence Center (to both BIOBASE and the University of Göttingen; grant no. 031U210B). The author is particularly indebted to Dr. Takai-Igarashi for many helpful suggestions during numerous discussions. TRANSFAC, TRANSPATH and CYTOMER are registered trademarks of BIOBASE GmbH, Wolfenbüttel, Germany.



References