| In Silico Biology 4, 0007 (2003); ©2003, Bioinformation Systems e.V. |
| Ontology Workshop Tokyo 2003 |
1Computational Biology Research Center (CBRC), National Institute of Advanced Industrial
Science and Technology (AIST),
2-43 Aomi, Koutou-ku, Tokyo 135-0064, Japan
Email: fukuda-cbrc@aist.go.jp
2Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology
Agency (JST),
3-14-4 Shirokane-dai, Minato-ku, Tokyo 108-0071, Japan
Email: snowfox@hgc.jp
3Graduate School of Frontier Sciences, University of Tokyo,
5-1-5 Kashiwanoha, Kashiwa-shi,
Chiba 277-8562, Japan
Email: tt@k.u-tokyo.ac.jp
* corresponding author
Edited by E. Wingender; received August 31, 2003; revised and accepted January 23, 2004; published February 22, 2004
An intelligent system for signal transduction pathways and other higher order functional knowledge is presented. Molecular mechanisms of biological processes are typically represented as diagrams ("pathways") that have a graph-analogical network structure. However, due to the diversity of topics that pathways cover, their constituent biological entities are highly diverse and range from metal ion to protein to biological processes in general. In addition, the kinds of interactions that connect biological entities are likewise diverse. Consequently, current knowledge about pathways is highly heterogeneous both in the sense of the types of constituents and the granularity of descriptions. To cope with this problem, the proposed system adopts a recursive and hierarchical representation model that enables the annotation and query of pathways or sub-pathways of arbitral granularity. By combining the use of this hierarchical structure and biological ontologies, literature-based information regarding biological mechanisms becomes accessible by computer.
Key words: pathway database, ontology, signal transduction, textual knowledge
In the post-genomic era, the target of biological knowledge acquisition has shifted from elucidating the features of bio-molecules to discovering the combinations of bio-molecules as well as the combinations of their interactions that constitute a biological function. Many studies have attempted to decipher molecular interactions or functional relations by computational analyses of high-throughput functional-genomics data by utilizing machine-learning techniques. However, the resultant experimental data are noisy and lack significance measurements. Besides, they do not contain contextual information such as "when" and "where" a protein exists. As a consequence, the biological significance of the results cannot be determined. Therefore, as background knowledge that places the data in a biological context is necessary for the proper assessment of the analytical results, experts with the appropriate knowledge must be consulted for validation of the results. Not surprisingly, this consulting phase is becoming the rate-determining step in the overall process of identifying functional relations from large amounts of data and emphasizes the importance of computational methods able to process biological knowledge.
As biology is a knowledge-based rather than axiom-based discipline, biologists use knowledge about already known cases to make decisions about a current case. Traditionally, this information has been stored in databases as sets of gene and protein sequences of known functions. The rapid spreading of computational methods as an indispensable technology was supported largely by the existence of backbone databases that store comprehensive sequence data.
In the post-genomic era, the required background knowledge pertains to the underlying mechanisms of biological phenomena, an ensemble of collaborating genes and proteins. However, as such knowledge is shared via the scientific literature in the form of illustrations or natural-language narrative, the required information is buried in journals. The blossoming of Natural Language Processing (NLP) research in biology is the result of the high demand for this type of knowledge. However, the fundamental problem to be solved is the use, in the scientific literature, of implicit semantics to address objects or interactions. This hampers not only the dissemination of precise information but also the development of computational methods to process this knowledge.
These problems are intensified in the signal transduction pathway domain, an area that attempts to describe the mechanisms of various life phenomena in terms of interactions of proteins and other bio-molecules. Inevitably, current knowledge remains as a highly divergent and fragmented archive of small pathways.
Therefore, there is an ever-increasing demand for the development of technologies that represent and process heterogeneous knowledge about functional relations of bio-molecules, biological processes, and for shared semantics in the biological literature.
In the following sections, we first overview the different features of different pathways. Then we introduce our data model, explain the query mechanisms, and describe the system architecture, the query interface, and the data. Lastly, we discuss related work and conclude with suggestions on future directions.
Cells perform their required functions through networks (biological pathways) involving gene interplay and proteins that regulate each other. Since pathways explain how the information coded in DNA is decompressed into a phenotype or biological process, e. g. a particular developmental stage, it can be said that they are the blueprint for the biological mechanisms of a cell. In this section, we provide an overview of several categories of pathways and illustrate the type of knowledge that a signal transduction pathway database has to handle.
Metabolic pathways
Historically, the computational analysis of biological pathways first focused on the study of metabolic pathways because biologists already possessed mature consensus knowledge in this area.
Metabolism is the sum of anabolism and catabolism. It consists of a pathway of chemical reactions that are catalyzed by enzymes. These pathways are largely conserved among species and are well understood. Pathway maps that serve as references are categorized into a taxonomy, and the role players (enzymes) are categorized by their function according to the EC classification system [1].
Genetic networks
Research regarding "genetic networks" has been strongly motivated by the emerging availability of micro-array data, which provide large-scale profiles of an organism's gene-expression pattern. The model is defined mathematically by means of a graph structure, where nodes represent genes and edges their regulation. The genetic network model is based on the conviction that much (or all) of the information for constructing and maintaining a living organism is encoded in its gene sequences. The product of an activated gene interacts with a series of other bio-molecules, most of which are also products of genes, to form a complex signaling cascade that then regulates the activation of another gene. It may therefore be said that the pattern of gene expression determines the functional state of the cell system [2]. Based on this insight, the genetic network model focuses only on the signaling loop of gene regulation and neglects the intervening cascades.
Consequently, we derive a simple and clear mathematical model of a biological system, but neglect the actual underlying mechanism by which the gene is activated.
Protein interaction maps
Protein interaction maps are networks of protein-protein interactions. While a genetic network focuses only on the regulation of genes and omits the physical mechanisms of the intervening cascades, a protein interaction map focuses instead on the physical bindings of proteins, which constitute the regulations. There are several methods to detect interactions of proteins. These include computational methods that predict from structure or sequence whether two proteins will, or will not, interact [3, 4, 5, 6], molecular biological methods such as two-hybrid screening [7, 8, 9], and biochemical methods.
By combining the results derived from these methods, a network of protein interactions (a protein interaction map) can be constructed. Such a map would be expected to reveal interactions that map functionally unclassified proteins in a biological context, and to identify interactions between proteins involved in the same biological function or interactions that connect biological functions to form a larger cellular process. However, as the map itself does not provide information about biological contexts, it has to import additional knowledge from other sources, e. g. signal transduction pathway databases. Even worse, these interactions result in a graph with a high degree of connectivity, including artifacts and false-positive predictions, and often yield only a single huge connected component. Thus, network interpretation is difficult.
Signal transduction pathways
The area of signal transduction pathways (STPs), which is now vigorously researched, attempts to describe the molecular mechanisms of life phenomena. It can be said that signal transduction pathways cover the broadest range of concepts from concrete to abstract levels. Current knowledge in this domain remains incomplete and uncertain. Table 1 compares the features of each biological pathway.
| Table 1: | Features of biological pathways |
| Metabolic pathways | Genetic networks | Protein interaction maps | STPs | |
| biological objects |
|
|
|
|
| relations of objects |
|
|
|
|
| pathway |
|
|
|
|
| # of objects per pathway | ~ x 102 | ~ x 103 | ~ x 103 | ~ x 10 |
| knowledge representation model |
|
|
|
|
Compared to other biological networks such as metabolic pathways, the kinds of biological entities that constitute signal transduction pathways are highly diverse. The domain ranges from metal ions, DNA sequences, and proteins to phenotypes or biological concepts such as cellular responses to external stimuli. The nodes in a metabolic pathway, on the other hand, denote simply enzymes, substrates, or products. The semantics of interactions that connect the biological entities in a signal transduction pathway are likewise diverse, in contrast to metabolic pathways where the relations simply denote enzymatic reactions. For example, in a signal transduction pathway, relations of different types such as causality relations of different phenomena, phenotypes, physico-chemical interactions, translocations, and secretions of molecules may appear in a single context (Figure 1).
Due to this heterogeneity, entities of different types and different granularities are related to each other in an intuitive manner. A simple graph theoretical analogy to these pathways is not suitable or sufficient to handle the knowledge contained in the scientific literature and a more ontology-conscious representation must be developed. For example, different types of semantics are typically admixed in a single context without explicit explanations such as "gene A activates phenotype B". The meaning of "phenotype activation" is unclear because a phenotype is an observation while a gene is a biological object. What is required here is a hierarchical representation of pathways. Biological processes that result in a phenotype are an ensemble of molecular interactions and other processes. Therefore, they may be decomposed into arbitral levels that refer to phenomena of different granularities. As a signal transduction pathway is the actual biological implementation of each specific process, the representation model for pathways requires a recursively decomposable structure.
Compound graph representation
A recursive and hierarchical representation structure is required for biological pathways, especially signal transduction pathways.
In the proposed system, we use a data structure called compound graph [10]. A compound graph is an extension of a graph definition in which each node can contain a graph inside itself. A compound graph CG = (G,T) is defined as the pairing of a graph G = (V, EG) and a rooted tree T = (V, ET, r) that share the same set of nodes. We refer to graph G of CG as an interaction graph and to tree T of CG as a decomposition tree. An edge eIG
EG is called an interaction edge and an edge eIT
ET a decomposition edge. A fragment Frag (a) of CG is defined as a compound subgraph derived from the nodes of the subtree T' of T, rooted by the internal node a of T.
Figure 2 is an example of a compound graph Footnote 1. Readers should note that in a compound graph, an internal node of the decomposition tree can directly interact with any nodes in the graph. This renders compound graphs very suitable for representing biological knowledge of heterogeneous granularities (Figure 3). Readers should also note that any sub-structure of a compound graph (e. g. a node) is a compound graph. This feature facilitates recursive querying of pathways in a pathway database.
|
Figure 3: A process diagram. Subprocesses are annotated by rectangles with bioprocess ontologies. Each rectangle represents a process of different granularity. |
One compound graph CG represents one signal transduction pathway. Then, a knowledge base of signal transduction pathways KB is defined by (S, R) , where S is a set of CGs and R is a set of rules to manipulate the data. Every node and edge of a compound graph has an object ID, including its decomposition edges. Each node and edge has a type and each type has a set of attributes that specifies biological information on the object such as localization, modification sites, cell-line, etc.
Query examples
As a result of this ontology-conscious representation of pathways, our user can query any pathways or biological components of pathways, including subpathways, by specifying attributes of nodes, edges, pathways, and their values (Figure 4a). For example, a query "find subpathways of allergy responses that have both IL-13 and IL-4" can be translated into "return diagrams that have a Protein-Node named IL-13 AND a Protein-Node named IL-14 AND a Bio-process-Node with the ontological definition of allergy response"(Figure 4b).
System architecture
The system called FREX (Functional Relation EXplorer), is a query system for higher-order functional knowledge. It focuses on data of functional relations such as signal transduction pathways. By accessing the database Footnote 2 one can query molecules, interactions, and pathways or subpathways that are involved in pathways already known in the scientific literature.
The system has a three-tier architecture that consists of a backbone database, an XML middle-ware, and a query processor (Figure 5). PostgreSQL 7.3.4 serves as a backbone database in which several databases such as pathway databases and bio-molecule databases are stored. A Java-based middle-ware provides APIs to the web-query processor and wraps the RDBMS as an XML database. Both the query processor and the client software for data registration communicate with the backbone database through this middle-ware. The FREX query processor is constructed on Apache (version 1.3.29) and Tomcat (version 4.1.24). It provides the client with the query interface. The interface is implemented as a Java applet using yFiles (version2.1.0.4) Footnote 3. The queries are submitted from a Web browser through the internet. Data are sent between the query processor and the middle-ware by RMI (Remote Method Invocation) technology. FREX runs on Solaris or Linux. The Java environment is 1.4.2.
FREX interface
In the top page of the FREX web query interface, one can choose three search modes, i. e. Diagram Search, Node Search, and Edge Search. In the current version, the user can specify five attributes and their corresponding values. These are processed as a logical "AND" query.
Diagram search
The example (Figure 6) shows that a diagram search can be conducted by selecting appropriate attributes from pull-down menus and specifying the corresponding values. The result is a list of matched diagrams from which the user can select one or several diagrams for display. Figure 7a is the result of Figure 6 and Figure 7b is an example of displaying multiple pathways. The thumb-nail at bottom-right provides a bird's-eye view of the entire pathway.
Using the area at bottom-left, the user can specify nodes to search k-shortest paths [11]. At the top of the right frame is a toolbar through which the user can zoom in/out and automatically fit the size of the diagram to the window. Two buttons are provided to toggle between diagram edit- and browse-mode. In the left frame, one can learn the attribute values of a node or an edge. Selection is via the mouse pointer.
Binary interaction search
High-throughput binary relation (HTBR) data such as protein-protein interaction data can be displayed in the same way as diagrams. By selecting data sets from the initial diagram query result, the user can display data from several data sources.
The interface is similar to that of the diagram mode described above, but in the binary relation mode, the layout of nodes is computed automatically by a cluster layout- or a hierarchical layout algorithm (Figure 8). The layout program calculates the coordinates of nodes according to their attribute values (e. g., localization) to help the user understand the data. The color of nodes can also be specified according to attribute values. The k-shortest path function is also available in this search mode.
Node search
In the node search mode, the user specifies attribute values for nodes. If the node is an internal node of the decomposition tree, the node itself is a compound graph and therefore a diagram. This recursive structure renders our knowledge representation model suitable for biological pathway databases. From the list of query results, the user can choose to display (1) the attribute values for a selected node or (2) the entire diagram or (3) only the child structure of a selected node (Figure 9).
Super-imposing data from different sources
While diagram data in FREX are "authentic" knowledge extracted manually from the scientific literature, HTBR-data such as two-hybrid screening or automatic text-mining results are too massive to interpret; they also contain many false-positives. Because of this complementarity, their comparison is an interesting task. We expect that super-imposition of HTBR-data onto authentic diagrams will bridge the gap between fragmented diagrams and that super-imposition of authentic diagrams onto HTBR-data will add contextual knowledge. Figure 10 shows the result of super-imposing a protein-protein binary relation data set onto a set of diagrams.
Ontology-based relaxed search
Other interesting functionalities that are not shown here include ontology-based query relaxation. In the case of zero hits, the database can relax the specified query by traversing the ontologies from each specified concept towards its root according to a pre-defined default- or a user-specified parameter set.
Each diagram data is a single XML file. One file represents independent pathway knowledge reported in the scientific literature or a set of high-throughput experimental data such as protein-protein interaction data. Typically, pathway knowledge is extracted from review articles that compile pathway information from a set of different sources. Figure 11 shows the diagram data stored in the August 2003 version of FREX. Each curated diagram contains around 70 nodes; their decomposition depths are about 5. If the original review does not specify the required attributes such as species, the curator investigates the underlying primary articles, employing a citation-depth restriction. Since a "complete" review pathway could be a mosaic of fragments from different species, it is possible for a human pathway to contain mammal- or mouse proteins. High-throughput data are provided by several laboratories on their websites [7, 8, 9] and each data set is represented as one XML file.
The types of nodes and edges are shown in Figure 12. Each node has a slot to fill-in an ontology ID. For example, a Protein-Node has the following attributes: disp-name, localization, tissue, molecule, subunit, modification (phosphorylation, acetylation, glycosilation, ubiquitination, methylation,). The edge types in Figure 12 roughly define the meaning of edges. Each edge has an ontology slot "reaction" so that the user can further specify the precise meaning of each edge. For example, a Metabolysis edge can be specified further by the reaction ontology, in which "metabolysis" has a child concept "protein metabolysis" that has a child "protein modification process" that has a child "state-change phosphorylation". Figure 13 is a list of the ontologies utilized to annotate the data model; the ontologies have links to other databases or ontologies (Table 2) Footnote 4. In FREX, the ontologies are DAG-structured categories of biological concepts. All ontologies required to annotate pathway components are stored in the backbone database.
| Table 2: | Links to other ontologies |
|
|
Pathway databases such as KEGG [12], EcoCyc [13] and aMAZE [14] have a strong focus on metabolism and micro-organisms. For example, EcoCyc (or BioCyc Footnote 5) is a well-known pathway database with ontology-based, well-defined semantics. Metabolism knowledge has been accumulated and is standardized in textbooks. The proposed database system is designed to handle signal transduction pathway data which are more problematic. This is because most of the knowledge about human diseases such as cancer is provided and shared in the vast scientific literature.
Conventional databases of signal transduction can be classified into two categories. One is based on hand-curated interaction data that constitute a set of binary relations, like TRANSPATH and CSNDB [15, 16]. The other is based on hand-drawn clickable illustrations, like SPAD, STKE, and BIOCARTA [17, 18, 19].
In binary relation set-based databases, a pathway is literally a graph. The upstream/downstream pathway of a protein is comprised of the paths from that protein to receptor proteins or transcription factors. There are several drawbacks to this model. First, it is difficult to define subpathways or processes. Second, it is unclear how to treat nodes that refer to concepts of different granularities. In other words, it is difficult to annotate contextual information. Even a restricted upstream/downstream search of 3 steps can involve over 400 nodes; this renders interpretation difficult. Since there is no negation information, the connectivity of pathways continues to increase by adding further knowledge. The ultimate result may be a single graph where everything is connected to everything. In illustration-based databases, each picture represents its contextual information. Since curation is performed with drawing tools, heterogeneous concepts appear in the way they do in the literature, resulting in very limited computability. Another drawback is the lack of inconsistency-checking mechanisms. It is difficult to decompose pictures into subpictures and therefore it is difficult to ascertain whether different pictures describe the same subprocess with the same structure.
The proposed system solves these problems by adopting a hierarchical and recursive structure that is annotated with a set of ontologies.
Gene Ontology (GO) [20] is an indispensable knowledge foundation that defines controlled vocabulary for functions and biological processes. This vocabulary can be applied to all organisms and provides semantics for biological databases. However, it does not yield knowledge regarding the underlying mechanisms of biological processes or provide the connecting diagrams of bio-molecules that constitute those processes.
The system proposed here aims at providing access to information to fill this gap.
A query system for a higher-order knowledge-base was presented. Currently, the primary target is signal transduction pathway data in the scientific literature. By adopting a hierarchical and recursive representation model and annotating each biological entity in the model with ontologies, the FREX system provides a highly flexible and powerful querying facility that is hard to realize in traditional clickable-map based diagram databases.
It is important to understand the difference between a keyword search and an ontology-based search. In an ontology-based search, the system knows to which ontology each term belongs. In a keyword search, on the other hand, although a user can specify a set of terms connected by logical ANDs, the user cannot specify what is meant by each term.
Due to the inherent characteristics of natural science, knowledge about the underlying mechanisms of biological processes is highly heterogeneous in its granularity. The hierarchical specification of an artificial object can be defined in a bottom-up manner. In biology, on the other hand, the first recognizable thing is the observation of experiments, i. e., the behavior of a genetically modified cell, etc. And very typically, some sub-processes of the behavior are well known while others are not. The proposed hierarchical representation model based on a compound graph makes it easy to curate data from literature and to query these curated data.
We plan to curate more data. In the first phase we will curate more diagrams based on published reviews. In the second phase we will curate small fragments of pathways that appear regularly in the scientific literature and connect them back to review-based diagrams. Induction or inference of new pathways, conflict detection and knowledge update remain to be implemented. As pathway objects are strongly annotated with ontologies, we believe these functionalities should be provided by implementing a logic based inference module [21]. The current ontologies are based on a format designed in-house. Their conversion into a more standardized format is desirable to facilitate the use of different software developed by other communities. Benefits could be derived from data exchange formats to import and export diagram data among other interaction databases. Other future plans include improving the performance of the binary relation browser. As the amount of accumulating data threatens to become overwhelming, an alternative choice is to distribute a Java application that stores data inside itself and communicates with web services on our web server for additional data. Traditionally, higher-order biological knowledge (functional knowledge) was available only through the scientific literature. By making this knowledge accessible and "understandable" to computers, computation will come to play another indispensable role in biology.
This work was supported in part by BIRD of Japan Science and Technology Agency (JST), and Grant-in-Aid for Scientific Research on Priority Areas "Genome Information Science" from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
Footnote 1: There are other extensions of the graph definition that yield a hierarchical graph structure. A clustered graph is a compound graph where edges in the interaction graph are allowed only between leaves of the decomposition tree. Another graph model related to compound graphs is a nested graph in which edges exist only between children of the same parent. In this sense, "compound graph" is the most general definition.
Footnote 2: Currently, http://www.ontology.jp/FREX/jsp/
Footnote 3: yFiles is a class library that provides algorithms and components for analyzing, viewing, and laying out graphs, diagrams, and networks. http://www.yworks.com/index.htm
Footnote 4: TissueDB is served at http://tissuedb.ontology.ims.u-tokyo.ac.jp/
Footnote 5: http://www.biocyc.org