Edited by E. Wingender; received May 25, 1998; revised September 18, 1998; accepted September 24, 1998
We report on a knowledge-based pathway-finding system that builds on the cell-signaling networks database, CSNDB, which we developed previously. This new system, PaF-CSNDB, uses a general inference engine to apply rules for finding and coupling pathways between or around specific biomolecules from the CSNDB database. We show how PaF-CSNDB finds relationships in a large but fragmented collection of cell-signaling knowledge by filtering out and composing together those sections of pathways specified from an extensive and complex set of binary or pair-wise cell-signaling reactions. The system can be accessed over the World Wide Web.
Keywords: cellular signal transduction, knowledge-based expert system shell, pathway representation
Cellular signal transduction regulates many important biological responses in multicellular organisms (Alberts et al., 1994). Recent advances in biotechnology have discovered that cellular signal transduction involves an elaborate network of biological reactions. This network can be illustrated as a complex graph that includes many nodes, connections, and loops.
The problem of modeling such complex biological networks on computer has been addressed recently. The metabolic pathways of some bacteria and plants have been represented as directed graphs with appropriate labels, yielding networks that appear similar to those found in textbooks [EcoCyc; Karp et al., 1998; SoyBase; Letovsky, and KEGG; Goto et al., 1997]. These methods all represent pathways by referring to their component reactions, which are encoded as binary relationships between reactants, products, and enzymes that catalyze the reactions. Karp et al. and Letovsky developed automated drawing software for pathways in their systems, while Goto et al. used manually drawn pathways, although they developed a path computation based on a deductive database method. The systems of both Karp et al. and Letovsky also encoded interconnections between reactions. In this sense, they automated the generation of graphs based on pre-specified interconnections between nodes. The algorithms of both systems appear to be quite similar [Karp and Paley, 1994].
The modeling of regulatory pathways has also been studied [GeneNet; Kolpakov et al., 1998, SPAD, and BRITE]. Kolpakov et al. developed automated drawing software for pathways to describe gene regulation networks, using JAVA. However they used pre-encoded interconnections between biomolecules. In SPAD and BRITE, pathways are drawn manually.
We report a system that infers the interconnectivity between nodes using a knowledge-based expert system shell that does not require pre-encoded connections. The system, called Pathway Finding for the Cell Signaling Networks Database (PaF-CSNDB), is based on our database for cell signaling networks (CSNDB) in human cells [Takai-Igarashi et al., 1999]. CSNDB includes information on signaling pathways and molecules. We used the same algorithm as Letovsky to draw the pathways automatically [Letovsky]. Modeling cell signaling networks with Letovsky's algorithm was successful initially, when only well known pathways described in textbooks or reviews were modeled. However, when newly reported data was added, problems arose in modeling pathways due to the (1) fragmentary nature, (2) complexity, and (3) variability of pathways. This motivated us to develop PaF-CSNDB.
(1) Pathway Fragmentation. Most information stored in a database is not chained but fragmented. Most scientific articles only report one or a few steps rather than entire cascades. For example, a report that Grb2 binds epidermal growth hormone (EGF) receptor and mediates a signal to activate mitogen-activated (MAP)-kinases [Yamauchi et al., 1997] focuses on the reaction between Grb2 and the EGF receptor and suggests that succeeding cascades lead to MAP-kinase. When a biologist reads this article, a pathway from Grb2 to Sos to Ras to c-Raf-1 to MAP-kinase-kinase to MAP-kinase is likely to come to mind, although no such description is included in the article. We believe that this sort of knowledge needs to be available from databases. As we describe below, this pathway can be retrieved by PaF-CSNDB. Our knowledge-based expert system shell infers the pathway to MAP-kinase just like a real biological expert would. A detailed explanation of the system configuration is given in subsequent sections.
(2) Pathway Complexity. In humans, pathways are much more complicated than in other phyla such as bacteria. In mammals in particular, many signaling pathways are interconnected. For example, interconnections have been reported between the peptide- and steroid-hormone signaling pathways [Grazzini et al., 1998]. The steroid hormone progesterone is bound to the oxytocin receptor, which is a member of the G-protein-coupled receptor family, and inhibits the binding of the peptide hormone oxytocin in the uterus. Their findings provided evidence for an interaction between a steroid hormone and a G-protein-coupled receptor and neighboring pathways. In multicellular animals, interconnections between pathways are considered to be important for the evolution of elaborate mechanisms that enable individual cells to communicate with one another to coordinate their behavior for the benefit of the whole organism. In metabolic pathways, consensus sectioning is accepted based on their function. Examples are chlorophyll synthesis, fatty acid desaturation, and tyrosine synthesis [Grant et al.]. However, in cell signaling networks of higher animal phyla, pathway interconnections make it difficult to section them into functional units, and consensus sectioning is not generally accepted. In the case of signaling pathways, this means that it is difficult to encode connections into databases. This has prompted us to develop an interactive pathway drawing system. Assuming that there is no consensus sectioning, we believe that users will need the system to infer possible connections between pathways and to section them according to their requests. PaF-CSNDB produces pathways inferred for a given target molecule based on restrictions specified by the user. The detailed algorithm is discussed below. We also discuss the representation of cyclic interconnections, the feedback loops, in subsequent sections.
(3) Pathway Variability. All the reactions are initially described as pairs of molecules with an explicit direction. One molecule must be the signal transmitter and the other must be the receiver. The direction is explicitly defined from the transmitter to the receiver. In the second stage of processing, however, we often encounter reactions that have no explicit information on direction, but only on polymerization. For example, hetero-dimerization has been reported between the glutamate receptor interacting protein (GRIP) and the AMPA receptor in the brain [Dong et al., 1997], without information on succeeding or preceding pathways. We cannot assign a direction to these kinds of reactions. However, molecular complexes are essential elements of cell signaling pathways where many signaling reactions occur. In this case, Dong et al. (1997) suggest that this binding may be critical to anchor AMPA receptors at excitatory synapses in the brain. They also suggest that succeeding reactions involving the molecular complex lead to biological responses. This example shows that pairing information is valuable and must be represented. Our approach to representing polymerization in pathway drawings is given below. In addition, we also discuss the representation of metabolic reactions in regulatory pathways.
Knowledge-based expert system
Our expert system consists of an inference engine, also called the expert system shell, and a knowledge base. The inference engine is a general-purpose program that can infer consequent assertions from antecedent assertions without any additional programming. The inference engine performs forward chaining of the rules and assertions contained in the knowledge base. The rules are in antecedent-consequent form, or if-then rules or production rules [Winston and Horn, 1989].
In PaF-CSNDB, the assertions are reactions, which are the individual steps in the pathways. The rules find interconnections between the reactions. There are two procedures for connecting reactions: (1) Finding pathways around a target molecule, and (2) Finding pathways between two target molecules. An example of assertions and rules for Procedure (1) follows:
[A1] Start Pathway (starting at "Grb2" with a maximum of "6" connecting steps)
[A2] Define Reaction (from "EGF" to "EGF receptor") (in sequence "EGF -> EGF receptor")
[A3] Define Reaction (from "EGF receptor" to "Grb2") (in sequence "EGF receptor -> Grb2")
[A4] Define Reaction (from "Grb2" to "Sos") (in sequence "Grb2 -> Sos")
[A5] Define Reaction (from "Sos" to "Ras") (in sequence "Sos -> Ras")
[A6] Define Reaction (from "Ras" to "c-Raf-1") (in sequence "Ras -> c-Raf-1")
[A7] Assertion Pathway (sequence "Grb2 -> Sos" " Sos -> Ras" "Ras -> c-Raf-1")
(with "3" connection steps)
[A8] Assertion Pathway (sequence "EGF -> EGF receptor" "EGF receptor -> Grb2")
(with "2" connection steps)
[A9] Assertion Pathway (sequence "EGF receptor -> Grb2" "Grb2 -> Sos")
(with "5" connection steps)
[R1] Initiate Pathway
IF
(Start Pathway with Target Molecule "T")
(Find Reaction whose Transmitter is "A" and Receiver is "B",
where "A" and "B" are multiple slots)
(in Sequence "P")
(test whether "T" is a member of "A")
THEN
(Assert Initial Pathway) (From "A" To "B") (in Sequence "P")
(with "1" Connection Step)
[R2] Produce Succeeding Pathway
IF
(Maximum Number of Connection Steps is "M")
(Find Pathway From "A" To "B") (in Sequence "P1")
(with "N" Steps) (No Cycle)
(Find Reaction From "C" To "D") (in Sequence "P2")
(test whether "N" is less than "M")
(test Exact-string-pattern-match between "B" and "C")
THEN
SWITCH ("P2" is a member of "P1")
TRUE (Assert Pathway (From "A" To "D") (in Sequence "P1" "P2")
(with "N+1" Steps) (Cycle Included)
FALSE (Assert Pathway (From "A" To "D") (in Sequence "P1" "P2")
(with "N+1" Steps) (No Cycle)
[R3] Produce Preceding Pathway
[R4] Combine Preceding and Succeeding Pathways
[R5] Prune Pathway, if it is longer than Maximum Length after [R4]
[R6] Remove disused Assertions
A1 to A6 are antecedent assertions stored in the database and retrieved as needed by the inference system. A reaction is a structured type instance in which the arguments in the slot are constants. R1 and R6 are rules for producing pathways around the target molecules. The rules for producing pathways connecting two target molecules (Procedure (2)) are as follows:
[R1] Initiate Pathway with Target Molecule "S" and a Maximum of "M" connections [R2] Produce Succeeding Pathway [R3] Stop the inference if the Receiver molecule of the connected Reaction is equal to the End Target "E", or if the Maximum number of Connections is equal to "M" [R4] Remove disused Assertions
The rules consist of antecedent components before the "THEN", which are referred to as the Left-Hand Side (LHS), and consequent components after the "THEN" which are referred to as the Right-Hand Side (RHS). R2 is a recursive procedure because it references the same procedure in both the LHS and RHS.
The inference system can infer new facts that are not stored explicitly in the database. By matching the LHS of R1 and R2 with given facts such as A1 to A6, new facts like A7 can be derived. By matching the LHS of R1 and R3 to the given facts, the new assertion A8 can be derived. Matching the LHS of R4 with assertions A7 and A8 can derive A9. The inference engine stops when no new facts can be derived. As a final step, the facts that match the rules are returned to the user as answers. In this example, A7, A8, and A9 are returned.
Cycles are allowed in preceding or succeeding pathways. When a cycle is produced, the inference stops, in order to prevent it from producing the same cycle repeatedly. In biology, a cycle in a cell signaling pathway is generally called "feedback regulation". Feedback regulation finely balances a whole system in which any temporary changes would be disastrous. For example, the interaction of p53 and MDM2 has been reported [Haupt et al., 1997]. The MDM2 oncoprotein is a potent inhibitor of p53, which binds to the transcriptional activation domain of p53 and blocks its ability to regulate target genes and to exert antiproliferative effects. On the other hand, p53 activates the expression of the MDM2 gene in an autoregulatory feedback loop. In this case, the inference engine produces a pathway consisting of "MDM2 -> p53", "p53 -> MDM2", and "MDM2 -> p53". This sequence of reactions is converted into a cycle in the graphical representation. The inference stops at this step. If there are other pathways connected to either "p53" or "MDM2", they will be produced. If pathways share the same reaction, the Graph Drawer combines them. The Graph Drawer is explained below.
The rules can be implemented in a procedural language such as C. However the advantages of an expert system shell are its declarative notation and superior built-in pattern matching ability for lists of strings. In particular, in this denotational style, recursive queries can be executed easily by writing rules recursively, as illustrated for [R2].
System configuration
Figure 1 shows the system configuration. This system consists of two types of files (Database and Rules) and four main modules (ACEDB, Extractor, Inference Engine, and Graph Drawer).
|
Figure 1: PaF-CSNDB system configuration. |
[Database] The database is CSNDB. It includes information on signaling reactions and molecules [Takai-Igarashi et al., 1999]. All the attributes stored in the database are listed in the appendix. Data in CSNDB are managed by ACEDB. The data in the "Signal_Reaction" class are used to produce assertions for the Inference Engine. In this sense, facts are stored in the database.
[Rules] The rule file is the text file that contains the rules used by the Inference Engine.
[ACEDB] ACEDB is an object-oriented database management system specifically designed for biological systems [Thierry-Mieg and Durbin]. Data are organized into objects and classes, where each object belongs to exactly one class and has an object identifier. A model is represented as a directed tree. The root of the tree is an object identifier, and the nodes are rooted subtrees of the model or an attribute specification consisting of an attribute name and the corresponding attribute domain. Data are updated or removed by the user through an X-window interface. When it receives commands from the Extractor, ACEDB searches the database, retrieves objects, and returns them to the Extractor. ACEDB is written in the C programming language, and all the source code is available to researchers.
[Extractor] The Extractor accepts queries from the user interface, reads the data from the Database via ACEDB, checks for synonyms, sorts the data, adds restrictions, and transfers them to the Inference Engine. First, query strings specified by the user are checked to determine whether they are precise object names or synonyms, which the Extractor converts into object names. This conversion is based on the synonym table included in CSNDB. Individual sets of arguments are stored in the "From_Molecule", "To_Molecule", "Component", or "Enzyme" of "Signal_Reaction" classes and sorted after retrieval, because the pattern matching in the Inference Engine is performed between concatenated strings produced by individual sets of arguments. Restrictions are added in three ways: (1) by specifying the maximum number of connecting steps, (2) by narrowing the domain to be covered, and (3) by eliminating subtrees. (1) The total number of connecting steps is specified by the user and embedded in the Rules, as described in the last section. (2) Preceding or succeeding pathways can be selectively disregarded. The Extractor regulates this by eliminating the corresponding rules from the Rule File. (3) Subtrees whose roots are molecules specified by the user are removed from the graph. The Extractor regulates this by filtering out corresponding assertions in the process of transferring the assertions to the Inference Engine.
[Inference Engine] The Inference Engine is CLIPS [Riley et al., 1988], a production system shell with a syntax similar to that of LISP. CLIPS is written in the C programming language, and has the advantage that it can be embedded into other applications written in C, which is a major reason that we used CLIPS instead of LISP.
[Graph Drawer] The Graph Drawer converts consequent assertions evaluated by the Inference Engine into a pathway graph. Pathways are represented as labeled, directed graphs. In a pathway graph, nodes indicate biological molecules, arrows indicate signaling reactions, and both nodes and arrows are labeled. Drawing pathways is a graph layout problem, for which we use an algorithm developed by Letovsky [Letovsky]. The algorithm consists of two steps:
(1) Determine whether the topology of the input pathway is a cyclic or an acyclic graph.
(2) Apply the graph layout algorithm that is appropriate to the topology of the pathway, and to the nodes of the pathway, thus assigning positions to the nodes.
The graph can be connected or disconnected, and cyclic or acyclic directed graph. As we described in the last section, the Inference Engine independently produces pathways that connect to a cycle. The Graph Drawer combines them if the pathways share a reaction. The display is generated by a HyperObject approach that combines both text and graphics, based on the information stored in the database. When a user clicks on an object, the system opens a new window for the object. In the pathway graph, clicking on a node calls up information on a signaling molecule, and clicking on an arrow calls up information on a signaling reaction from the Database. There are two types of arrow: solid and broken arrows. Solid arrows represent reactions stored in CSNDB, and broken arrows represent connections inferred by the Inference Engine. In most cases, the graph produced is a portion of the extended network. Therefore information on adjacent reactions needs to be available from the system. The Graph Drawer searches for reactions adjacent to root- and leaf- nodes, and indicates the number of reactions in the graph.
A typical session between a user and PaF-CSNDB might proceed in the following way:
Relationships stored in the Database
Reactions are stored in the Signal_Reaction class of CSNDB. Arguments for molecules that consist of a reaction are translated into an assertion about an object in the class. The record specifications distinguish three types of reaction: (1) standard reactions, (2) polymerization reactions, and (3) metabolic reactions.
(1) Standard reactions:
Most signaling reactions fall into this group. This includes signaling reactions that consist of two elements, transmitters and receivers. It excludes metabolic reactions that consist of three elements, such as reactants, products, and enzymes. The signal transfer direction is explicit in the standard reactions, from transmitters to receivers. In CSNDB, transmitters and receivers are stored in the attributes named "From_Molecule" and "To_Molecule", respectively. Records for either attribute can be a value or a set of values if the signal is transferred between more than one molecular complex. An example of the objects stored in CSNDB is:
Signal_Reaction: "EGF receptor -> Grb2" From_Molecule "EGF receptor" To_Molecule "Grb2" Tissue "liver" Effect "activation" Interaction "SH2+phosphorylated Tyr" "Tyr1068 of EGF receptor" Activity "growth-hormone-induced activation of MAP-kinase" Reference "[Yamauchi_1997]"
These attributes mean that a reaction from EGF receptor to Grb2 is observed in the liver. This is an active signal that can induce the growth-hormone-induced activation of MAP-kinase. The interaction that occurs between molecules is of the "SH2+phosphorylated Tyr" type at "Tyr1068 of the EGF receptor". We abstracted this knowledge from the reference, Yamauchi et al. (1997). A corresponding assertion is:
(reaction (from "EGF receptor") (to "Grb2") (sequence "EGF receptor -> Grb2") (step 0))
As this example shows, the assertion is produced by referring to attributes of the objects in the Database.
(2) Polymerization reactions:
As mentioned in the introduction, there are reactions with no directionality consisting only of pairing information. Consider the hetero-dimerization of GRIP and the AMPA receptor [Dong et al., 1997] mentioned in the introduction. Dong et al. suggest that the molecular complex leads to succeeding biological responses. For the predecessor reactions, both GRIP and the AMPA receptor can receive signals independently. For example, the AMPA receptor can receive signals from L-glutamate [Barria et al., 1997] and CaM-kinaseII [Mulle et al., 1998], while GRIP can make a different hetero-dimer with the thyroid hormone receptor [Feng et al., 1998]. These reports suggest that this type of reaction has two kinds of adjacent predecessor reactions that point to the component molecules individually, and at least one kind of adjacent successor reaction that is caused by the molecular complex. Based on these considerations, we convert this reaction into two assertions:
(reaction (from "AMPA receptor")(to "AMPA receptor" "GRIP")(sequence "->AMPA receptor + GRIP") (step 0))
(reaction (from "GRIP")(to "AMPA receptor" "GRIP")(sequence "-> AMPA receptor + GRIP")(step 0))
For the corresponding object in the database, we designated the object name as starting at the "->" symbol, to distinguish it from other kinds of reactions. We call an "association type" reaction, and the corresponding entry in CSNDB is as follows:
Signal_Reaction: "-> AMPA receptor + GRIP" Component "AMPA receptor" "GRIP" Tissue "brain" Effect "association" Interaction "PDZ domain" "GRIP binds with C-termini of AMPA receptor" Activity "It is critical for clustering AMPA receptors at excitatory synapses." Reference "[Dong_1997]"
On the other hand, there are quite different polymerization reactions. The complex of aryl hydrocarbon receptor (Ah receptor) and Heat-shock protein 90 (HSP90) is an example [Powell-Coffman et al., 1998]. The Ah receptor in the complex is inactive, and the molecular complex retains the components in pre-activation state [Alberts et al., 1994]. When activated, the complex dissociates and the individual component molecules function independently. The Ah receptor binds to dioxin, a carcinogenic and teratogenic chemical [Powell-Coffman et al., 1998] and HSP90 associates with various molecules, which it activates or inactivates. For example, HSP90 binds to endothelial nitric oxide synthetase (eNOS) [Garcia-Cardena et al., 1998], c-Raf-1 [Tzivion et al., 1998], and various nuclear receptors, such as estrogen and progesterone receptors [Alberts et al., 1994]. These reports suggest that this type of reaction has one kind of adjacent predecessor reaction that points to the complex, and two kinds of adjacent successor reactions that are caused by the individual components independently. Taking this into account, we convert this reaction into the two following assertions:
(reaction (from "Ah receptor" "HSP90")(to "Ah receptor")(sequence "Ah receptor+HSP90 ->")(step 0))
(reaction (from "Ah receptor" "HSP90")(to "HSP90")(sequence "Ah receptor+HSP90 ->") (step 0))
For the corresponding object in the database, the object name is designated as ending at the "->" symbol to distinguish it from other types of reactions. We call this a "dissociation type" reaction, and the corresponding entry in CSNDB is:
Signal_Reaction:"Ah receptor + HSP90 ->"
Component "Ah receptor"
"HSP90"
Effect "dissociation"
Interaction "PAS domain" "of Ah receptor"
Activity "inactivation of Ah receptor"
Reference "[Powell-Coffman_1998]"
(3) Metabolic reactions:
This group includes reactions consisting of three elements: reactants, products, and enzymes. A record in CSNDB is as follows:
Signal_Reaction:"phospholipase C-beta + PIP2 -> IP3" From_Molecule "PIP2" To_Molecule "IP3" Effect "metabolism" Enzyme "phospholipase C-beta" Reference "[Dove_1997]" "[Alberts_1994]"
The object name is designated in the style "enzyme + reactant -> product". In this case, the enzyme is stored in a separate attribute, "Enzyme". In order to convert metabolic reactions into the same type of assertion as used with standard reactions, three elements need to be reduced two elements. We divided the three elements into two pairs, "reactants and products" and "enzyme and products". This division is based on the consideration that adjacent predecessor reactions can point to either the reactants or the enzymes separately. For example, the enzyme phospholipase C-beta is activated by G-alpha-q ("G-alpha-q -> phospholipase C-beta") [Offermanns et al., 1997], and the reactant PIP2 is produced by another metabolic reaction, "PIP + PIP5-kinase -> PIP2" [Hall, 1998]. As this example shows, the enzyme and the reactant can be processed independently before the reaction. For the successor reactions, the product IP3, only causes the successor reaction, as indicated by the example, "IP3 -> IP3 receptor" [Alberts et al., 1994]. We therefore convert the reaction into the two following assertions:
(reaction (from "phospholipase C-beta") (to "IP3") (sequence "phospholipase C-beta + PIP2 -> IP3") (step 0))
(reaction (from "PIP2") (to "IP3") (sequence "phospholipase C-beta + PIP2 -> IP3") (step 0))
We use four examples to illustrate pathway discovery by PaF-CSNDB.
Graphical representation:
A graph of a pathway is illustrated in Figure 2. The graph consists of nodes, arrows, and labels. The nodes represent molecules. Clicking on the label of a node retrieves information from CSNDB, and displays it in a new window. Clicking on a solid arrow or its label will display information about the reaction, in the same way as clicking on a node label. Broken arrows represent connections inferred by the Inference Engine. A boxed label indicates that the node is a target molecule, which is a specified molecule to start the inference of connections. A number circled at a node indicates reactions adjacent to the graph. The number denotes extended pathways that are eliminated from the graph by restricting its domain. A green label under a reaction-label represents the tissue where the reaction is observed.
MAP-kinase cascade:
As described in the Introduction, the pathway between the EGF receptor and MAP-kinase was searched. Figure 2 shows the graph produced for the conditions: find pathways between the two target molecules, starting molecule is "EGF receptor" and ending molecule is "MAP-kinase", and the maximum number of steps is "6". This graph indicates that two pathways connect "EGF receptor" to "MAP-kinase"; one pathway consists of "EGF receptor->Grb2", "Grb2->Sos", "Sos->Ras", "Ras->c-Raf-1", "c-Raf-1-> MAP-kinase-kinase", and "MAP-kinase-kinase-> MAP-kinase" and the other consists of "EGF receptor->Grb2", "Grb2->Sos", "Sos->Ras", "Ras->Raf", "Raf-> MAP-kinase-kinase", and "MAP-kinase-kinase-> MAP-kinase". Two pathways were found because Raf and c-Raf-1 are stored as independent objects in the database. "Raf" stands for "Raf family", and the Raf-object includes biological attributes common to the Raf family. c-Raf-1 is a member of the Raf family, and the c-Raf-1-object includes biological attributes specific to the subtype. These two pathways indicate routes from Grb2 to MAP-kinase that are omitted in Yamauchi et al. (1997). In this example, PaF-CSNDB draws together knowledge omitted in the original article. In this case, all the individual reactions come from different references. In the graph, the numbers "2" in the node labeled "EGF receptor" and "4" in the node labeled "MAP-kinase" indicate that two and four reactions follow the respective nodes, although they are omitted from the graph.
Metabolic reactions:
We next consider "phospholipase C-beta + PIP2 -> IP3" mentioned in the last section as an example of a metabolic reaction. Figure 3 shows the graph produced for the conditions: find pathways around the target molecules, target molecule is "IP3", the maximum number of preceding connecting steps is "3", disregard successor reactions to the target molecule, and remove the subtree with the root LAT (linker for activation of T cells). In this graph, one pathway pointing to phospholipase C-beta and three pathways pointing to PIP2 are produced. The former pathway starts at Gq and ends at the reaction of "G-alpha-q -> phospholipase C-beta". The latter pathways start at Rac, Cdc42, and PI kinase and PI and end at the reaction of "PIP + PIP5-kinase -> PIP2". All these pathways converge on the pathway "phospholipase C-beta + PIP2 -> IP3". These are examples of independent pathways that activate either the reactant or enzyme of a metabolic reaction. In this graph, the "LAT"-subtree is removed, because a predecessor tree that connects to PLC-gamma1 through the reaction of "LAT -> PLC-gamma1" made the graph complex.
Polymerization reactions:
The hetero-dimerization of HSP90 and the Ah receptor is an example of a polymerization reaction. Figure 4 shows the graph produced for the conditions: find pathways around the target molecules, target molecule is "Ah receptor", and maximum number of steps is "2". Three arrows are derived from the reaction "Ah receptor + HSP90 ->" that is shown as the upper-right arrow in the graph. The first arrow connects to "-> Ah receptor + Arnt". This connection is mediated by the Ah receptor, while HSP90 mediates the second and third connections. The second arrow is "-> eNOS + HSP90" and the third one is "HSP90 -> c-Raf-1". These are examples that the components in dissociation type polymerization reaction lead independently succeeding reactions. In addition, the "Ah receptor + HSP90 ->" and "-> Ah receptor + Arnt" are a pair of dissociation and association type polymerization reactions, respectively. As this example shows, if the same molecule plays a role in dissociation and association type reactions, the reactions can be connected. The occurrence of such a connection has been reported [Powell-Coffman, et al., 1998], demonstrating that our representation works properly.
Feedback regulation:
Finally, we consider the interaction of "p53" and "MDM2" [Haupt et al., 1997] as an example of feedback regulation. Figure 5 shows the graph produced for the conditions: find pathways around the target molecules, target molecule is "p53", the maximum number of steps is "3", disregard preceding pathways, and remove two subtrees whose roots are "Bax (Bcl-2-associated X protein)" and "hTAFII31 (human TFIID TATA box-binding protein-associated factor)". The feedback loop of "p53" and "MDM2" is represented by the square consisting of two solid and two broken arrows. In a cycle, half of the arrows are reversed or upside down. Five pathways succeeding p53, "p53 -> Bax", "p53 -> p85", "->HIF-1alpha + p53", "->E1B + p53", and "p53 -> hTAFII31", are connected to the cycle. The pathways consisting of and succeeding the cycle are produced independently by the Inference Engine, and are combined by the Graph Drawer, as this example shows.
The user interface:
Figure 6 illustrates our user interface. In the window, the user can select the following conditions:
(1) find pathways around a target molecule or between two target molecules,
(2) target molecule,
(3) maximum number of connecting steps,
(4) domain range, to remove preceding or succeeding reactions around the target molecule, and
(5) subtree elimination, to delete a subtree whose root is specified by the user.
Options (3), (4), and (5) simplify the graph. The user interface allows the range of pathways to be widened, narrowed, increased or decreased easily.
In this example, "Find the pathway between two molecules" is selected, and two molecules, "EGF receptor" and "MAPK", are specified as the initial data. This query produces the graph illustrated in Figure 2. "MAPK" is a synonym rather than a correct object name. The Extractor accepts this name, and converts it into the correct name "MAP-kinase", since "MAPK" is included in the CSNDB synonym list.
A
|
B
|
Figure 6: The PaF-CSNDB user interface. This example(A) shows the query that produced the graph shown in Figure 2. Two molecules, "EGF receptor" and "MAPK", are selected as the initial data. The maximum number of steps is specified as "6". Because "MAPK" is a synonym, it is converted into the precise object name "MAP-kinase". This query produces pathways starting at "EGF receptor" and ending at "MAP-kinase". The user can search molecules by using "CSNDB query" and "Alphabetical lists of molecules". In CSNDB query, one can execute partial pattern matching. This example shows the retrieval for the query "*receptor*", and the retrieved information is displayed in a new window (B). |
Data Content:
As of August 21, 1998, CSNDB contains 827 Signal_Reaction objects. Of these, 699 are standard reactions, 94 are polymerization reactions, and 34 are metabolic reactions. It contains 1514 Signal_Molecule objects and 685 Synonyms. All the data were collected from the original articles. We searched Nature from 1996 onwards and Science from 1997. The MEDLINE ID [Benson et al., 1993] is attached to every entry, so that the user can refer to the original articles.
We developed a system that infers pathways using a knowledge-based expert system shell without any explicitly encoded connections. A user can obtain possible connections starting and ending at, or around, specified molecules. The inferences provided by PaF-CSNDB will help users in collecting together knowledge about neighboring reactions, since newly reported signaling reactions are frequently fragmentary.
In order to represent signaling reactions as binary relationships, we considered the representation of three types of reactions: standard, metabolic, and polymerization. In this paper, we showed that our representations work consistently within our system. To represent the complexity of pathways, we developed three methods of simplifying the graphs: setting limits on the number of connecting steps, restricting the calculation domain, and eliminating subtrees from the graph. Our methods enable a user to filter the pathways and find appropriate connections. However our representations and methods are still limited. Cell signaling reactions have essentially multiple axes involving tissues, developmental stages, cell-cycle stages, external stimuli, and mutations. In addition, interactions between pathways make the connections much more complex. Since our method reduces multi-dimensional reactions into two dimensions, the graphs are naturally complicated. Our sectioning of pathways depends on arguments in the inference rule. Because it considers only one of the multiple axes, external stimuli, our sectioning is still primitive. We are investigating the nature of pathways further, to consider tissues, development, cell cycle, and mutations, in order to include the other multiple axes in the inference system and the graph representation.
Currently PaF-CSNDB has three major limitations. (1) Retrieval takes too long if there are many answers, because of the limited memory (64M) available in our implementation machine. Even worse, the calculation process is sometimes aborted by the operating system because of memory flow problems. The current system is therefore restricted to no more than two target molecules. We are trying to optimize the algorithm to use much less memory. (2) The representations of signaling reactions are still unsatisfactory as in the above discussion. (3) The data are still limited. We started collecting data two years ago, and are continuing to update it. We regularly check two journals, Science and Nature. Since these two journals cannot contain all available knowledge, the pathways produced by PaF-CSNDB are not guaranteed at the present. However, as we illustrated in some examples, PaF-CSNDB can draw together knowledge omitted in the articles. We believe that PaF-CSNDB provides a positive method to find relationships within the amassed knowledge.
The authors thank Dr. J. Thierry-Mieg and Dr. R. Durbin who helped us to use ACEDB. We are grateful to Dr. C. Kulikowski for critically editing the manuscript. This work was partly supported by the Japanese Human Science Foundation Research Aid Program and by the Science Research Promotion Fund of the Science and Technology Agency
Lists of attributes contained in CSNDB
| Class name | Attribute name | Definition | |
| Signal_Molecule | Category | Synonym | Synonyms |
| Family | Family | ||
| Homolog | Homologue of other species | ||
| Type | Type | Hormone, Cytokine, Neurotransmitter, Receptor, Ion Channel, Enzyme, Effector, Messenger, or Transcription Factor | |
| Superfamily | Biological groups that represent common functions or structures | ||
| Tissue | Tissue | ||
| Cellular location | Cellular location | ||
| Activity | Miscellaneous information concerning functions | ||
| Function | Subunit | Subunit | |
| Domain | Domain | ||
| Molecular weight | Molecular weight and amino acid length | ||
| SwissProt | SwissProt ID | ||
| GenBank | GenBank ID | ||
| 3D structure | References and PDB IDs | ||
| 2D structure | GIF images of the skeletal structure | ||
| Chromosome | Location on human chromosome | ||
| Pathway | Cross-reference to Class: Signal_Reaction | ||
| Disease | Disease | Disease name | |
| Mutation | Related mutation sites | ||
| Chemical | Formula | Formula | |
| CAS | CAS ID | ||
| Binding constant | IC-50, EC-50, or Ki | ||
| Type | Activator, Inhibitor, Agonist, or Antagonist | ||
| Source | Animals that produce the chemical or Industrial Products | ||
| Reference | Cross-reference to Class: Paper | ||
| Class name | Attribute name | Definition |
| Signal_Reaction | From_Molecule | Transmitter in a standard reaction, or Reactant in a metabolic reaction |
| To_Molecule | Receiver in a standard reaction, or Product in a metabolic reaction | |
| Component | Component in a polymerization reaction | |
| Enzyme | Enzyme in a metabolic reaction | |
| Tissue | Tissue | |
| Effect | Activation, Suppression, Metabolism, Association, Dissociation, or Inducement | |
| Activity | Miscellaneous information concerning functions | |
| Interaction | Molecular interaction | |
| Reference | Cross-reference to Class: Paper |
| Class name | Attribute name | Definition |
| Paper | Journal | Journal name |
| Year | Year | |
| Volume | Volume | |
| Page | Page | |
| Author | Author | |
| Type | Article or Review | |
| MEDLINE | PubMed ID | |
| Signal_Molecule | Cross-reference to Class: Signal_Molecule | |
| Signal_Reaction | Cross-reference to Class: Signal_Reaction |