In Silico Biology 7, 0055 (2007); ©2007, Bioinformation Systems e.V.  


Cell System Ontology: Representation for modeling, visualizing, and simulating biological pathways


Euna Jeong#, Masao Nagasaki#*, Ayumu Saito and Satoru Miyano




Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo 108-8639, Japan

# Both authors contributed equally to this research

* Corresponding author

   Email: masao@ims.u-tokyo.ac.jp





Edited by E. Wingender; received July 04, 2007; revised September 05, 2007; accepted October 07, 2007; published October 28, 2007



Abstract

With the rapidly accumulating knowledge of biological entities and networks, there is a growing need for a general framework to understand this information at a system level. In order to understand life as system, a formal description of system dynamics with semantic validation will be necessary. Within the context of biological pathways, several formats have been proposed, e. g., SBML, CellML, and BioPAX. Unfortunately, these formats lack the formal definitions of each term or fail to capture the system dynamics behavior. Thus, we have developed a new system dynamics centered ontology called Cell System Ontology (CSO). As an exchange format, the ontology is implemented in the Web Ontology Language (OWL), which enables semantic validation and automatic reasoning to check the consistency of biological pathway models. The features of CSO are as follows: (1) manipulation of different levels of granularity and abstraction of pathways, e. g., metabolic pathways, regulatory pathways, signal transduction pathways, and cell-cell interactions; (2) capture of both quantitative and qualitative aspects of biological function by using hybrid functional Petri net with extension (HFPNe); and (3) encoding of biological pathway data related to visualization and simulation, as well as modeling. The new ontology also predefines mature core vocabulary, which will be necessary for creating models with system dynamics. In addition, each of the core terms has at least one standard icon for easy modeling and accelerating the exchangeability among applications. In order to demonstrate the potential of CSO-based pathway modeling, visualization, and simulation, we present an HFPNe model for the ASEL and ASER regulatory networks in Caenorhabditis elegans.

Keywords: cell system ontology, CSO, biological pathway, data exchange format, HFPNe, dynamic simulation, visualization



Introduction

In the current post-genomic era, the interactions among biological entities and networks are being uncovered by molecular biologists at an accelerating pace. The huge accumulation of data is very heterogeneous, containing information on genomes, mRNA, protein structures, cells, protein-protein interaction, and metabolic pathways. Understanding individual biological entities and networks is insufficient to answer the question of how a cell works, i. e., how biological processes at multiple scales give rise to cell- and organ-level behavior. There is a growing need to develop environments that enable us to describe complex and dynamic biological pathways at a system level.

Our aim is to establish a general framework for understanding the behavior of cell systems in an integrated way. In attempting to deal with this problem, we have developed a comprehensive representation for modeling, visualizing, and simulating biological pathways. Biological pathways are an integration of diverse information, including biological entities, networks, and other information such as literature citations and experimental data. There are several types of biological pathways revealing different features, including metabolic pathways, gene regulatory pathways, signal transduction, and cell-cell interactions. Metabolic pathways are a series of chemical reactions among enzymes, substrates, and products in a cell. The regulation of gene expression by transcriptional factors is represented as a gene regulatory network. Signal transduction pathways are signal cascades resulting from the conversion of one (initial) stimulus into another. The living cells of an organism communicate with one another via cell surface protein-mediated cell-cell interactions. It is essential that a modeling language represents the various types of biological pathways within a unified framework.

Based on analyzing the underlying biological processes, biomedical researchers need to simulate and predict biological functions in order to discover drug targets and diagnostics. The simulation of a model will be useful in analyzing the cellular behavior that arises from the interactions among huge numbers of molecules. The need for modeling the quantitative interactions of molecules has been recognized as an important aspect of modeling biological pathways. Quantitative models can be used to numerically compute the biological functions and behavior of networks. Although several modeling languages have attempted to encode quantitative information, visualizing pathways is not generally considered a part of the modeling language; it is mainly handled by applications. For example, there is an approach to standardize graphical notations for biological pathways [Kohn et al., 2006]. In order to facilitate the analysis and exploration of complex pathways, the visualization of pathways is indispensable. The development of pathway visualization tools needs to define data structures for network entities and interactions, the physical locations of pathway entities in cells, and time-related properties for dynamic visualization. With the exception of drawing algorithms, the requirements and procedures are almost the same as modeling pathways. We consider that it is important for the modeling language to encode both textual and graphical description, as well as a dynamic simulation of biological pathways. This approach leads to shorter application development times by reducing the need to rewrite models and to link data to drawing. Furthermore, it improves the integration and interoperability of data among applications by sharing common format for parameters and properties related to visualization and simulation.

Existing languages and semantic models are inadequate to satisfy the requirements for representing biological pathways. One approach to describing biological processes is to represent these processes as annotation-based languages, such as GO [Harris et al., 2004], INOH ontology [Kushida et al., 2006], and PSI MI controlled vocabulary (http://www.psidev.info). The curated terms from the literature are compiled in a controlled vocabulary or taxonomy. The annotations are often based on natural language descriptions, so as to be comprehensible to humans. Although controlled vocabularies have a hierarchical structure, indexed terms lack any order in the representation, and relationships for these terms are very limited. Another approach is to develop model descriptive languages using the eXtensible Markup Language (XML) [Bray et al., 2006]. XML describes structured data and permits developers to define their own set of tags to suit their purpose, including KEGG markup language for metabolic pathways [Kanehisa et al., 2002], EcoCyc database for Escherichia coli metabolic pathways [Keseler et al., 2005], PSI MI for molecular interactions [Hermjakob et al., 2004], and CellML [Lloyd et al., 2004] and SBML [Hucka et al., 2007] for mathematical models of biochemical networks. Since the intended meaning of the different elements is entirely implicit in the XML document [Decker et al., 2000], many formats accommodate controlled vocabularies to clearly annotate the semantics. However, use of controlled vocabularies for annotation in the XML schema does not fully resolve semantic consistency. An alternative to model description languages is a formal ontology-based approach, such as an ontology for biological function [Karp, 2000], Cell Signaling Networks Ontology for cell signaling pathways [Takai-Igarashi and Mizoguchi, 2003], and BioPAX as a data exchange format for biological pathways [Bader and Cary, 2005]. These representations can adequately describe diverse biological pathways because they encode considerably more detailed information. Unfortunately, there is no formal ontology that both covers a wide range of biological pathways and encodes quantitative models.

Recent approaches demonstrate the limitations to representing dynamic and complex biological pathways. These limitations are caused by the chosen format itself or the covered scope. In order to resolve these limitations, we have developed the Cell System Ontology (CSO), a representation that models, visualizes, and simulates biological pathways. CSO is based on the hybrid functional Petri net with extension (HFPNe) [Nagasaki et al., 2004], which enables us to describe dynamic models for biological networks. However, the representation capability of CSO is not restricted to HFPNe models. More specifically, the mathematical simulation includes discrete events at an instant time with a time interval, continuous events performed by differential equations, and more complicated events by using object-like programming language. Furthermore, CSO can represent the visualization of pathway components with graphical properties such as graphical shape and geometrical position. CSO aims to support diverse pathway types, including metabolic pathways, gene regulatory pathways, signal transduction, and cell-cell interactions in an integrated manner. As an exchange format, the ontology is implemented in the Web Ontology Language (OWL) [Smith et al., 2004], which enables reasoning about reactions such as the processing speed and reaction rates of each process, and the functionality of the pathway such as participants and their roles in the processes.

We describe our approach in the following sections. Firstly, we survey various representations of biological ontologies and discuss their functionalities and limitations. Secondly, we demonstrate the CSO design concept based on the survey outlined in the previous section and describe its implementation. Thirdly, in order to reveal the potential of CSO-based pathway modeling and simulation, an HFPNe model for the ASER/L regulatory model is demonstrated. Lastly, we conclude by describing our efforts to integrate biological pathway data and mention the tools of CSO.



Representations for biological ontologies

Recently, ontologies for knowledge representations have also appeared in the biological domain (for survey, see Stevens et al., 2000; Strömbäck and Lambrix, 2005; Bodenreider and Stevens, 2006). Biological ontologies are often used for annotation data, model description schemata, and data exchange formats. We present a review of some representations broadly accepted in the biological community: OBO format-based controlled vocabularies, XML-based model descriptions, and ontology-based knowledge representations. These representations use different format languages and reveal different advantages and drawbacks.


OBO format-based controlled vocabularies

The Gene Ontology (GO) [Harris et al., 2004] and the related ontologies in the Open Biomedical Ontologies (OBO) (http://obo.sourceforge.net/) are increasingly being utilized for annotation purposes. The OBO format [Day-Richter, 2004] originated from GO and is primarily used for other OBO ontologies. The focus of OBO ontologies has ranged widely from genotype to phenotype [Bodenreider and Stevens, 2006], although the major aim is to provide a shared vocabulary for describing biological concepts. For example, GO describes the principle attributes of genes and gene products across many databases.

The OBO format consists of tag-value pairs. Each term is defined as an item that requires 2 tags: the unique id and the term name. Other optional tags provide additional information about the given term; for example, definition, comment, synonym, and hierarchical relationships.

The OBO format is simple enough to assign one controlled term to a given concept. The problem is that one curated term is often a textual description holding compound concepts. For example, in INOH event ontology [Kushida et al., 2006] (one of the OBO ontologies), "translocation" (IEV:0000009) is a molecular event meaning the directed movement of an entity. Hence, all directed movements from one cellular location to another location could be a child term of "translocation," e. g., "translocation from the cytosol to the mitochondrial membrane" (IEV:0000412). In turn, any entity moving from the cytosol to the mitochondrial membrane appearing in the literature becomes a child term of IEV:0000412, e. g., "translocation of Bad from the cytosol to the mitochondrial membrane." This type of description enforces a natural language processing for extracting data such as which entity is involved in translocation and which cellular compartments are related to translocation.

A further problem derives from the definition and usage of relationships between terms. The relation "is-a" can mean "a subclass of" or "an instance of" in the same ontology. Other relationships, such as "part-of," "derives-from," "related-to," and "develops-from," are also used ambiguously with no clear definitions. The drawbacks of the OBO ontologies are well described in [Schulze-Kremer, 2002]. These problems are also related to multiple inheritance, particularly when a term is inherited from multiple parents with different types of relationship. This type of multiple inheritance may cause ambiguity in meaning and make logical reasoning considerably more difficult.

The efforts to overcome these shortcomings have been described recently by [Smith et al., 2005]. However, the OBO format is still less expressive in capturing complex and dynamic biological pathways as ontological representations.


XML-based model description

There are several model exchange formats based on XML, including CellML [Lloyd et al., 2004], INOH [Fukuda and Takagi, 2004], PSI MI [Hermjakob et al., 2004], and SBML [Hucka et al., 2007]. In SBML and PSI MI, each model contains a process-centered data structure. All participants in a model are listed and then referenced in the corresponding processes with additional information. For example, SBML represents quantitative models for the storage of kinetic parameters and initial conditions for simulation purposes, while PSI MI describes protein-protein interactions with additional annotations such as experiment description, publications, and the role of proteins. The purpose of CellML is similar to that of SBML for mathematical models, while CellML has a slightly different structure to others, so as to represent model structure and mathematics. The INOH format has a relatively simple structure. A pathway model consists of nodes and edges. Any molecular entity and its compounds and any process and its compound processes can be nodes. The nodes are connected by edges whose type is tagged as either "in" or "out." Therefore, the INOH pathway model can be viewed at multiple levels of abstraction.

Because XML is a convenient language for creating user-defined tags and organizing them along with the requirements, it has been broadly adopted as a data exchange format for biological pathway models. Although XML provides syntax for structured documents through the use of a Document Type Definition, XML does not contain semantic constraints on the meaning of those tags. In order to overcome this problem, the model components are associated with controlled vocabularies that clearly define their meaning. Each XML format uses external sources in which the necessary term is defined. If no pre-existing term is available, then there is a need to separately develop and maintain internal terms tailored to the specific requirements. Some approaches are shown in SBO for SBML, MI for PSI MI, and the INOH pathway ontology for INOH.

However, the association with controlled vocabularies still does not resolve the underlying problems, i. e., semantic validation. These problems may occur when the controlled terms are incorrectly applied. For example, the PSI MI format defines interactors, each of which can be given an interactor type and a biological role. For each interactor participating in an interaction, a specific experimental role is designated. These values are all derived from MI. Because the correctness of references is not guaranteed by the PSI MI format, any inconsistency between references renders a model invalid. In order to ensure the semantic needs, additional validation checks as a part of the application are required; this entails significant time and effort.


Formal ontology-based knowledge representation

We need to distinguish between ontology and controlled vocabularies. Ontology has several subtypes, including controlled vocabularies or taxonomies. In this paper, we refer to knowledge representation formalisms with a logic base as formal ontology. A formal ontology defines classes and relationships between classes as key components. Classes are organized into taxonomies and associated with a set of slots as their properties. A slot is a binary relation, which relates one class to another. The values of the slots are an instance of another class or a primitive data type. Each class may have instances as members of the class. Further, constraints are established to define allowable values and connections within an ontology. To summarize, an ontology contains classes, generalized hierarchies of classes, relationships defined for classes, and constraints.

BioPAX (http://www.biopax.org/) is a data exchange format intended to facilitate sharing and integration of pathway data from multiple sources. The main ontology has a top class Entity that has 3 subclasses: physicalEntity, interaction, and pathway. BioPAX separately defines a utility class for an organizational purpose to describe additional information of an entity or to increase compatibility with other ontologies. The recent version "level 2" represents only metabolic pathway and molecular interactions. Additional types of pathway data, such as signal transduction pathways and genetic regulatory networks, have yet to be captured.

Compared to XML-based ontologies, the formal ontology-based approach is relatively new in the biological community. It is desirable to extend this ontology to support dynamic models and diverse types of pathway.



Cell System Ontology

Biological pathways are very complex, heterogeneous, and autonomous in nature; therefore, a correct representation is essential for describing the semantics of the data, facilitating knowledge inference, and specifying annotations. As a language for describing cell system ontology, we adopt OWL [Smith et al., 2004]. Protégé [Noy et al., 2003] is used for the development of CSO.


Concept and technology of the Cell System Ontology

We first discuss several considerations pertaining to the development of CSO. Firstly, the representation is sufficiently expressive to handle complex and dynamic cell systems. Secondly, it resolves the relevant issues at the syntactic and semantic levels described in the previous section. Lastly, it can reason across data to check internal conflict and missing steps in pathways. These abilities become more important when it is necessary to provide a complete and consistent biological pathway database. From this viewpoint, we consider the formal ontology to be suitable for addressing problems in XML-based formats and controlled vocabularies.

A controlled vocabulary itself resembles an ontology in that it has a generalized hierarchy of terms. The OBO viewers, such as OBO-Edit (http://oboedit.org), present different semantic relationships as parent-child links with the same tree structure. For example, association relationships (e. g., "related-to") that link terms having similar meanings are represented in a hierarchical manner. This misleads users into thinking that even if the terms are subsumed by nothing, the relationship still appears valid [Wroe et al., 2003].

As described in the previous section, the curated terms are text-based and are formulated from a combination of several concepts. The term "translocation of Bad from the cytosol to the mitochondrial membrane" describes a process (translocation), a participant (Bad) in the process, and 2 cellular locations (cytosol and mitochondrial membrane). A text-based description, however, needs time-consuming parsing to retrieve information.

In order to implement CSO, we make a clear distinction between class and instance. Because the controlled vocabulary has no definition for instances, a term may be an instance or a subclass of its parent depending on relationships in the same structure. We also distinguish hierarchical relationships from associative ones. Two types of hierarchical relationships are used: "is-a" to define generic relationships in a generalization hierarchy and "instance-of" to define instances of a class. Many types of association between classes are specified as slots of classes. This will also solve multiple inheritance problems in controlled vocabularies. Multiple inheritance is helpful in making the representation more compact. However, multiple inheritance from different types of relationship may provide implicit and incomplete information [Stevens et al., 2000]. We only permit multiple inheritance with the same relationship, i. e., "is-a," when indispensable.

Referencing terms in external controlled vocabularies is used in many approaches to provide semantics or to maintain consistency. However, to independently parse and reason with external sources is time-consuming. We chose to include controlled vocabularies directly in CSO as instances similar to some of the selected GO terms included in Reactome [Joshi-Tope et al., 2005]. It is also important not to lose the pre-existing valuable semantics in the hierarchy of the external vocabulary. Thus, in CSO, we retain the information concerning the relationships between selected terms from external sources.

As an ontology representation language, we use OWL, which is an emerging standard for the Semantic Web of the World Wide Web Consortium. OWL is constructed on Resource Description Framework (RDF) (http://www.w3.org/RDF) and RDF Schema (RDFS) (http://www.w3.org/TR/rdf-schema/). RDF is intended to specify semantics for data based on XML and RDFS is an extension of RDF that can declare classes and properties and structure them in a hierarchy. OWL is a sophisticated language as an extension of RDFS [Horrocks et al., 2003]: classes can be stated as logical combinations of other classes; classes can be defined as enumeration of specified objects; slots (properties) can be declared and restricted in a variety of ways; slots have values as classes (or datatype). Furthermore, OWL is influenced from Description Logics, a family of class-based knowledge representation formalisms. Formalizing the meaning of the language enables automated reasoning to check and reason the consistency of classes and ontologies, and to check entailment relationships. The ability to monitor internal conflict and inconsistency in pathways is essential for the development of a reliable pathway knowledge base. OWL can provide such reasoning support.


Ontology based on hybrid functional Petri nets with extension

In order to fully understand CSO, one must first understand the basic model for CSO. CSO is based on a mathematical model called HFPNe [Nagasaki et al., 2004], the hybrid functional Petri nets with extension. Petri nets have graph-like structures consisting of places, transitions, arcs, and tokens. It is easy to represent all kinds of interconnections of biological entities using Petri net components. In addition, Petri nets are readily available for systems to store, edit, visualize, analyze, and simulate [Küffner et al., 2000]. However, the conventional Petri net is limited to modeling only discrete features in biological pathways, e. g., logical regulatory relationships between genes.

We developed HFPNe in order to model and simulate more complicated biological pathways, e. g., the activities of enzymes for multi-modification proteins, alternative splicing, and frameshifting [Nagasaki et al., 2004]. We term the Petri net elements entity, process, and connector instead of place, transition, and arc, respectively. An entity represents a biological molecule or object and holds some values, e. g., concentration of protein or copy numbers of mRNA, as its content. A process defines interaction among entities and is linked to entities by connectors that are incoming from an entity and outgoing to an entity. A process defines interaction among entities and is linked to entities by connectors that are incoming from an entity and outgoing to an entity. A process has a speed that depends on the concentration of the incoming entity. In addition to the discrete elements of the traditional Petri net, HFPNe has 2 further element types – continuous and generic entities and processes. A continuous entity can hold a real number as its content. A continuous process fires continuously at the speed of the parameter assigned to it. The continuous features can be used for enzyme reactions represented by differential equations. A generic entity can hold any object, e. g., an mRNA sequence, or the phosphorylation state of a protein. Moreover, a generic process handles complex reactions by updating the state of connected entities, e. g., degradation, translation, and phosphorylation, for complex pathway modeling. HFPNe supports 3 types of connector – a process connector, an association connector, and an inhibitory connector – that define the role of a specific entity that participates in a specific process. Process connectors activate all types of processes by consuming a certain number of tokens, which are transferred to the process only if the evaluated result of the threshold script is true. The activity rule of an association connector is identical to that of a process connector in terms of the threshold, except that association connectors do not consume tokens of the input entity. An inhibitory connector with a threshold script enables the process to remain active only if the evaluated result of the threshold script is false.


Ontology for cell systems

Based on HFPNe, we have built CSO, which consists of 195 classes, for a representation of cell systems (see Supplementary A for all classes). One of the major design considerations is that all classes have their own explicit definition; at the same time, the ontology can fully and unambiguously represent the diverse and complex nature of biological information. We have concentrated on the construction of an "is-a" hierarchy without multiple inheritance. If multiple inheritance is inevitable, we consider using it in CSO at a later stage. As the biological pathway data is increased, it becomes more important to identify incomplete or previously unknown pathways via verification of simulations. CSO enables us to build biological pathways with dynamic functions, rarely achievable with other currently available biological ontologies. In the description below, we use the following notation. The name of a CSO class starts with an uppercase letter typeset in bold font. If the class name is a compound word, then the second word in the compound is capitalized without a space. For slots of classes, all capital letters are used.

The main classes of CSO are shown in Fig. 1. CSMLBase is the root class for all classes in CSO. All data in CSO is structured around Project, which has slots to represent the comprehensive environment of a pathway model. Project is required to have only one Model, which describes pathways via a set of processes. SubModels can be defined as a subset of a given model. Each SubModel contains some selected elements of a model, which may be grouped to convey any meaning. Project has further slots to save the results of model simulation as 2D plots and graphical representation of a model as instances of ChartBase and ViewBase, respectively. CSO allows users to create user-defined properties related to simulation, view, and any biological information in Project.



Click on the thumbnail to enlarge the picture
Figure 1: The main classes of CSO.


A model comprises a set of biological processes connected to entities via connectors, and facts to provide more information related to biological processes. The ElementBase class contains fundamental concepts for the Model, which has 4 subclasses: Entity, Process, Connector, and Fact. All subclasses of ElementBase have common slots to define a reference to an object in an external data source, including literature citations. The subclasses for Petri net elements, i. e., entity, process, and connector, are all related to dynamic simulation, whereas Fact represents other properties that cannot be described with HFPNe.

First, we provide an explanation of the subclasses for 3 Petri net elements, called basic elements in CSO. Each subclass is further classified as biological or non-biological. Non-biological elements are defined for each element to represent biologically unrelated ones, i. e., the traditional Petri net-based model. In this paper, we focus on describing the biological elements. Each biological element has slots to define animation, simulation, and view-related properties. CSO supports different simulation properties for discrete and continuous model of each element. Some of the Petri net-related properties for discrete models are as follows:

The basic elements all have common view properties defined in ViewBase for visualization, including geometric position and position type; graphical shape, including background color, overlap depth, visibility, and size; image file-related properties such as file format, file type, and width and height of the image that link each element to graphical representation. The ViewBase class connects graphical properties in text with the corresponding image files for visualization. A detailed explanation of the properties is provided in [Nagasaki et al., 2004]. Animation-related properties are defined to save simulation results and then used to create a visualization of the simulation for further investigations, e. g., testing or creating hypotheses.

In CSO, Process, accounting for transition in HFPNe, represents any interaction between physical entities in a biological organism. The type of process is already defined as instances of BiologicalEvent (a subclass of BiologicalBase), described in detail later. For each process, one correct term can be assigned to annotate the type of process from previously defined controlled vocabularies. If a suitable term is not found, a new term can be added and defined in Project as a new user-defined biological property. In CSO, a process means any interaction, not including any information related to participants of the process, or cellular location of the participants. A biological process can be described using the location of the process, several thermodynamic properties, and experimental evidence.

Importantly, Process has a slot for connectors that connect between the process and the entities involved in the process. The entity's role in a process has been addressed in articles [Mizoguchi and Kitamura, 2000; Burek et al., 2005]. In CSO, we distinguish an entity type and an entity's native role from an entity's role in a process. The entity type and its native role are not changed along with processes. However, the entity in a process may have different roles. Consider the situation where a protein complex generated by a binding process activates one chemical reaction and inhibits the phosphorylation process in another. The same protein could be the product of one process, the activator of another process, or the inhibitor of a third. This type of role concept changes along with the process in which the entity is involved. This concept is very important in the simulation of pathway models. Depending on the entity's role in a process, whether the connector transfers a certain number of tokens from the input entity to the process (or from the process to the output entity) is decided in the HFPNe architecture. For this reason, the Connector class is divided into Input and Output, standing for arc heading for transition and arc heading for place in HFPNe, respectively. In turn, Input consists of 3 subclasses – InputAssociation for activators, InputInhibitor for inhibitors, and InputProcess for reactants – while Output has OutputProcess for the products of a process. Each subclass of Connector has slots to store information for the connected entity and other properties related to view, animation, and connector-specific simulation.

The Entity class, accounting for place in HFPNe, is defined to describe biological entities, cellular compartments, and the biological environment (e. g., UV, temperature, and pH). The biological entity is any physical entity such as a cell and other molecules, including Protein, Complex, DNA, SmallMolecule, and RNA. Among these, RNA is further divided into the subclasses mRNA, rRNA, and miRNA. Each biological entity may be annotated with BiologicalRole, which means certain characteristics of an entity. For example, a protein may serve as a ligand, a receptor, or a cofactor molecule in processes. This type of concept is useful in understanding a biological process, though not directly related to simulation. These terms are defined as instances of ProteinRole to describe a protein's role in CSO. Similarly, RNA could be a catalyst or a signal recognition particle in a process. We also defined these terms as instances of the RnaRole class. This entity's native role has to be distinguished from the role changed along with a specific process to select a proper connector. The details of controlled vocabularies defined in CSO will be explained later. Each subclass of Entity also has common slots such as view, animation, and simulation properties. There are also class-specific slots. For instance, Protein has additional slots for describing cellular location, biological role, molecular weight, organism, sequence, and sequence feature. Since a protein is annotated with several properties, changes of its subcellular location or its state after modification lead to 2 different proteins in CSO.

As another subclass of ElementBase, Fact is defined. The Fact class is designed for understanding the pathway functionalities and evaluating the status of dynamic simulation; it is not supported by HFPNe for the modeling process in the pathway. Fact is used to describe restrictions that should be satisfied among variables in a model during simulation. Any view and biological elements that do not affect the simulation steps are also described in the Fact class. For example, the effect of a drug's efficacy binding to plasma protein measured by capillary permeability and a pathway consisting of several subpathways may be described as Fact; these provide insights into the underlying biological pathways.

As described in this section, CSO defines a class for annotating biological properties, called BiologicalBase. This class is also required to accommodate other representations, such as BioPAX. As a subclass of BiologicalBase, ControlledVocabulary (CV) is defined to investigate the reuse of existing structured information from other sources. This class provides a predefined common vocabulary to describe several categories of biological information. This design strategy will reduce the time required to query and parse the external sources. For distinctive usage and rapid parsing, CV is divided into several subclasses: BiologicalEvent for biological processes, BiologicalRole for the entity's native role, CellComponent for cellular components, CellType for cell types, DBBase for database names, EvidenceCode for experimental and other evidence for determining the interaction, and FeatureType for a sequence property relevant to an interaction. The BiologicalEvent class is classified into 4 events to reflect different interactions at different levels: cell, molecule, organism, and physiology. The BiologicalRole class is also divided into the same 4 subclasses as BiologicalEvent. The subclass MoleculeRole is further classified into 3 subclasses in order to define the roles of protein, RNA, and small molecules. In the hierarchical structure of CSO, it is easy to constrain the allowable values of biological events and roles for annotation, thus avoiding the incorrect assignment of terms.

In order to build a CV, terms as instances are selected from freely available sources including BRENDA tissue [Schomburg et al., 2000], GO [Harris et al., 2004], INOH [Kushida et al., 2006], NCBI (http://www.ncbi.nlm.nih.gov/), OBO (http://obo.sourceforge.net/), and PSI MI [Hermjakob et al., 2004]. The selected terms are reorganized and redefined as instances in CSO from a system dynamics centered view. For example, terms for cellular location are selected from GO cellular component. GO defines a macromolecular complex as a cellular component. Therefore, intracellularly located protein complexes are also defined as a cellular component. However, in CSO, the entity itself and its location are considered separately, not as a combined term. The CellComponent class defines the contents of a cell or its extracellular environment as its instances, not including involved entities and processes. In addition, if there are no available terms from external sources, new terms are introduced to suit the CSO purpose.

In the development of CSO, we provide standard icons for the core terms in CV which enhances user understanding of the pathway models, easy GUI modeling, and the exchangeability among applications. In CSO, all instances of BiologicalEvent (274 in total) and CellComponent (47 in total) have standard icons. Fig. 2 shows part of a hierarchy of instances in BiologicalEvent on the right and the icons on the left (not in order of the hierarchy of instances). For example, the top left icon in the figure depicts phosphorylation. CSO currently supports both non-scalable and scalable image formats. In particular, PNG (Portable Network Graphics: http://www.libpng.org/pub/png/) and SVG (Scalable Vector Graphics: http://www.w3.org/TR/SVG11/) are recommended.



Click on the thumbnail to enlarge the picture
Figure 2: Some of the icons depicting instances of BiologicalEvent and the hierarchy of instances defined in CSO.


By considering terms as instances of CSO, the relationship already defined in external sources is likely to be lost. In order to avoid this problem, we retain this information in the RelationOf class. One term and its child terms are described with 4 relation types: "is-a," "part-of," "develops-from," and "related-to."

Other information related to a model is stored in AnimationBase for a simulation animation, ExternalReferenceBase for pointing to an external object, LogBase for logging properties of dynamic simulation, and SimulationBase for simulation-related parameters.


HFPNe model in CSO

Here, we demonstrate how the HFPNe model is represented in CSO. HFPNe can model regulatory networks that involve microRNA (miRNA), a key regulator of gene expression. As an example, we selected the cell fate determination model of 2 gustatory neurons (ASEL and ASER) of Caenorhabditis elegans reported in [Saito et al., 2006]. The ASEL/R neurons are bilaterally symmetric on a morphological level but exhibit asymmetric functions. This model is based on the fact that the ASEL/R cell fate is determined by a double-negative feedback loop involving the lsy-6 and mir-273 miRNAs. The new upstream regulator lsy-2 of lsy-6 is also integrated into this model for the mechanism of switching between ASEL and ASER without any contradictions. The main double-negative feedback loop in [Saito et al., 2006] is shown in Fig. 3. This simplified model reveals that (1) the activation of die-1 leads to the activation of lsy-6 and the suppression of cog-1 and mir-273 and that (2) the activation of cog-1 leads to the activation of mir-273 and the suppression of die-1 and lsy-6.



Click on the thumbnail to enlarge the picture
Figure 3: A graphical model of the ASEL/ASER pathway with simulation result generated by Cell Illustrator (http://www.cellillustrator.com/).


Click on the thumbnail to enlarge the picture
Figure 4: A graphical model diagram in CSO. Notation: blue circles for classes, several objects in classes for instances, arrows between instances for slots to represent relationship, and characters preceded by a colon for primitive data type values.



As an example, we select a part of the model (for the whole model encoded in CSO, see Supplementary B). The box located in the bottom left of Fig. 3 depicts the translation of die-1 mRNA suppressed by mir-273 miRNA. From this description, we can identify one type of biological process (translation), participating entities (die-1 mRNA, dir-273, and die-1 protein), and connections between the process and the entities (2 input connectors and 1 output connector). Fig. 4 shows CSO representation in the abstract. For brevity, not all classes and slots are depicted in Fig. 4. The diagram shows several classes, e. g., Project and Model in blue circles. Some classes, such as Connector, Entity, and SimulationBase have several subclasses. For example, SimulationBase has 3 subclasses to define the simulation properties specific to each of the elements, Entity, Connector, and Process. Instances in classes are depicted as graphical objects, whose names are given arbitrarily. Slots connecting 2 instances are represented in arrows with names. The value of primitive data type is preceded by a colon character.

The class Project has an instance named "Cell fate" in the top right of the diagram. The cell fate project has 2 slots MODEL and CHART, which have values "ASEL/ASER pathway," an instance of Model and "mir-273 chart," an instance of Chart. We assume that there are many processes in the model. The focused process "Translation" is an instance of ProcessBiological, a subclass of Process. The process "Translation" has 3 connectors "c1," "c2," and "c3," which are instances of Connector subclasses. The instance "c3" of OuputProcess has an output entity "die-1." The other connectors, "c1" and "c2" are connected to "mir 273" and "die-1 mRNA," respectively, though not shown in the diagram. The ProcessBiological, OutputProcess, and Protein classes are related to their own simulation properties ProcessSimulation, ConnectorSimulation, and EntitySimulation, respectively. Each element has different simulation-related properties such as kinetics, thresholds, and variables. This information is used for dynamic simulation of the model.



Conclusion

Massive amounts of biological data represented as biological pathways present novel challenges for developing a well-defined representation of these data. For a general framework to understand the behavior of cell systems in an integrated way, we have developed the Cell System Ontology (CSO), a new system dynamics centered ontology. The 3 main features of CSO are as follows:

First, CSO allows manipulation of different levels of granularity and abstraction of pathways, e. g., metabolic pathways, gene regulatory pathways, signal transduction pathways, and cell-cell interactions. In these pathways, interaction events take place in myriad ways among molecules and cells. In order to cope with this diversity, we designed CSO, which has a hierarchical structure to explicitly give definition to classes and the relationship among these classes. In particular, the classes are demarcated by disjointness between classes. In a hierarchical structure, the slots representing the attributes of each class define constraints on slot values. This is important because it ensures that the relationship between classes is treated in a correct and consistent manner.

Secondly, CSO can capture both quantitative and qualitative models by using the hybrid functional Petri net with extension (HFPNe). CSO can capture not only qualitative aspects of a model, such as the enzyme active site, the catalyzed reaction, and the involved reactants, from a knowledge that a biochemical reaction is catalyzed by an enzyme, but also quantitative attributes such as concentration, the behavior of genes, and the number of molecules synthesized per unit of time during transcriptional activity.

Thirdly, CSO is an ontology that encodes information related to visualization and simulation of biological pathways. Given a well-designed representation, the development time of special applications will be reduced because there is no need to redefine data structures. This strategy also improves the communication between software tools such as that required to exchange or query data. The new ontology also predefines mature core vocabularies, which will be necessary for creating models with system dynamics. In addition, each term for cellular component and biological event has the corresponding standard icon for easy modeling and accelerating the exchangeability among applications.

As an exchange format, the ontology was implemented in the Web Ontology Language (OWL), which enables semantic validation and provides complete and consistent biological pathway models. Recently, considerable biological pathway data has been generated and is available in several formats, including BioPAX, SBML, and PSI-MI. In order to facilitate data integration, we have made an effort to convert existing pathway representations to CSO, particularly from BioPAX to CSO [Jeong et al., 2007]. BioPAX is based on a formal ontology and many pathway databases export their data to BioPAX format. The conversion BioPAX to CSO enables data integration from other databases such as Biocyc, Reactome, INOH, and BioModels. In future work, the practical simulation and inference of models should be elaborated.

CSO is compatible with the Cell System Markup Language (CSML), the native model description language of Cell Illustrator (http://www.cellillustrator.com/) [Nagasaki et al., 2003]. CSO can provide a framework to integrate ontology-based representations and enables other representations to benefit from CSML-supported platforms, including BioGraphLayout for automatic layout [Kojima et al., 2007] and Cell Illustrator for visualization and simulation.

The CSO specifications, the on-line ontology viewer developed in Perl, rules for mapping between CSO and CSML, and other related information are available at http://www.csml.org/.




References