Pathway Analysis in Metabolic Databases via Differential Metabolic Display (DMD)

Robert Küffner, Ralf Zimmer and Thomas Lengauer




GMD -- German National Research Center for Information Technology
Institute for Algorithms and Scientific Computing (SCAI)
Schloß Birlinghoven
D­53754 Sankt Augustin, Germany
E-mail: {Robert.Kueffner,Ralf.Zimmer,Thomas.Lengauer}@gmd.de






SUMMARY

A number of metabolic databases are available electronically, some with features for querying and visualizing metabolic pathways and regulatory networks. We present a unifying, systematic approach based on PETRI nets for storing, displaying, comparing, searching and simulating such nets from a number of different sources. Information from each source is extracted and compiled into a PETRI net. Such PETRI nets then allow to investigate the (differential) content in metabolic databases, to map and integrate genomic information and functional annotations into such nets, to compare sequence and metabolic databases with respect to their functional annotations, to define, generate and search pathes and pathways in nets. Finally, via so called differential metabolic displays (DMDs) specific differences between systems i.e. different developmental or disease states, or different organisms, can be exhibited. DMDs could be of potential use for target finding and function prediction, especially in the context of the interpretation of expression data.


INTRODUCTION

In addition to the large and quickly increasing amounts of genomic and protein sequence and structure data, expression data becomes available in electronic form. This includes both expression levels for either mRNA or proteins. In order to systematically interpret such expression data knowledge about actual or potential relations between the proteins, RNA, and DNA could be helpful. Knowledge about proteins and their relation is provided in a number of metabolic databases containing the biochemical facts acquired over the last decades. Unfortunately, in these databases, regulatory relations such as inhibition and activation of proteins by other proteins, proteins repressing or activating the transcription of other proteins, and signaling cascades are currently much less covered.

In order to facilitate a system (cell, tissue, organism, species)-wide holistic interpretation of sequence and expression data we compiled the available data of metabolic databases into PETRI nets, graph like structures, which lend themself naturally to representing all kinds of relations and interconnections of distributed interacting entities (substrates, proteins) in a metabolic/ regulatory network. In addition, systems for storing, editing, visualizing, analysing, and simulating PETRI nets are readily available for a broad range of computer platforms. They also are accompanied by a quite mature theory developed over the last 30 years, resulting in established definitions and concepts, and methods and algorithms for classifying certain subtypes of nets as well as for demonstrating properties and invariants of specific nets. The PETRI nets derived from metabolic databases serve several demands:

PETRI nets are particularly well­suited to allow for an overview of the information contained in metabolic databases in order to investigate whether the coverage, detail and accuracy of the relationships could already efficiently be used for the massive and holistic interpretation of genomic, proteomic, expression, and DNA chip data as a first step towards target finding. In particular, the system should also be useful to identify and exhibit what is missing towards these goals, and to integrate future data in a systematic way.

PETRI net have been proposed for representing metabolic knowledge [5, 13] and for simulating the dynamic behaviour of reaction networks [5] as an alternative to the use of systems based on differential equations [10]. It has been demonstrated [13] that PETRI nets and their invariance properties could be applied to model small biological systems. In contrast, the focus of our work is not on detailed simulation of PETRI nets as discrete event systems, but on the exploitation of the knowledge available in current metabolic databases for functional predictions and the interpretation of expression data on the level of complete genomes.


METHODS AND RESULTS

Sources of Pathway Information The main sources of information about metabolic pathways are databases like BRENDA [4], ENZYME[2], MPW[14], WIT[12], EcoCyc[8], HincCyc[7], and KEGG/LENZYME[11] containing textual descriptions of reactions (see for example Figure 1).


Figure 1: Example entry from KEGG/LENZYME.


Compilation into Petri Nets

In this paper we describe the compilation of BRENDA, ENZYME, and KEGG into individual PETRI nets (Figure 2) and into a unified PETRI net. The individual nets are employed to compare the databases with each other on the level of reactions, substrates, and edges. An edge is defined between two transitions (EC numbers [6], enzymes) if they occur subsequently in a path, i.e. the first enzyme produces a product which is consumed as educt by the second. All comparisons are based on string matches of the substrates contained in the reaction equation subsections of the databases. The key problem here is the unification of the substrate names. For this purpose, we used the LCOMPOUND section of the KEGG database, which contains aliases for about 5000 of the 15000 substrates covered in the three databases. The alias list has been augmented manually by some 700 alias names. The compilation process involved the detection of typos, the removal of comments and references and unification of compound names using the augmented alias list. The chemical reaction equations of the database entries (Figure 1) can be used, first, to define transitions of a PETRI net and, second to define, via matching and identifying the educts and products of reactions a network of enzyme­substrate­enzyme interconnections (enzyme­enzyme edges).
Figure 2: Differences between the examined databases: PETRI net representations of two equivalent entries form KEGG and BRENDA. In addition to the main reaction contained in KEGG (A), BRENDA contains a generalization (B), a number of special cases (C) and (D), and additional side reactions (E).

Comparison of Metabolic Databases

PETRI nets for single databases are the basis for a thorough comparison of the content of individual databases. Therefore, difference PETRI nets are defined representing the difference and/or the mutually shared part of such databases in a uniform way via notions of substracting nets from one another. As can be seen from the respective intersection and difference sets (Figure 3), the main body of reactions (2.621), substrates (3.018), and edges (64.311) is contained in all three databases, whereas BRENDA provides about three times more additional reactions and substrates resulting in 89.460 more enzyme­ enzyme edges in the PETRI net.

These differences almost vanish if we restrict the PETRI net to BRENDA's main reactions, which are comprised in a special section in the database. These comparisons show that the database contain the same type and ­ more or less ­ the same amount of information concerning metabolic pathway relationships (see "main reactions" in Figure 3).


Figure 3: Results of the comparison of the KEGG, BRENDA and ENZYME databases.


Coverage of genome sequences by metabolic databases

In order to utilize metabolic/regulatory data for drug target finding and interpretation of expression data the unified PETRI net is used, which after eliminating inconsistencies, should represent the relations contained in all the metabolic databases.

This net can be analysed with respect to its coverage of sequence information, i.e. for which reactions are genomic and/or sequence information available for what organisms, and, conversely, for which sequences can a detailed function in the network be assigned.

Unfortunately, the current metabolic databases contain only a very limited subset of the functionally characterized proteins, especially very few regulatory relationships. The currently covered part of the protein functions used by an organism and encoded in its genome is shown in Figure 4 for yeast by plotting the fraction of proteins contained in the unified PETRI net for each category of the yeast functional catalog (http://www.mips.biochem.mpg.de/). From 3464 annotated yeast ORFs only 962 ORFs (grey) have an associated EC number. Out of these, only the categories metabolism and energy are covered more than half. Conversely, only 50% of the EC entries in metabolic databases have associated sequence information. This can be increased to 64% (dark grey) by screening sequences databases for EC number annotations. The rest is not covered (white bars).


Figure 4: Fraction of yeast proteins contained in current metabolic databases for the categories of the yeast functional catalog.


Enrichment by regulatory relationships from sequence and genomic databases

The information in available current databases can be enriched by functional annotations derived from sequence and genome databases. This exploits a type of functional information quite different from the function (i.e. biochemical reactions) stored in metabolic databases (Figure 5) and tries to make them available in a uniform PETRI net setup (Figure 6). We hope to extend the metabolic relationships significantly with regulatory relations or broader functional specifications. For the use of function predictions for proteins sequences, say drawn from some expression experiment, this should narrow the gap between broad function predictions and the quite detailed knowledge contained in the current databases biased for metabolism, energy, and cellular organization in order to get a clue of possible causes of certain expression levels in the context of the network knowledge.


Figure 5: Swissprot entry P39113 describing an activator protein for the transcription of two of the key enzymes in the glucogenesis pathway.


For example, the Swissprot entry shown in Figure 5 can be exploited to extend the nets derived from metabolic databases by additional regulatory transitions to construct the enriched PETRI net shown in Figure 6.


Figure 6: PETRI net extending the PETRI net of the glucogenesis (black part) by the regulatory transition (grey part, left side) defined by Swissprot entry P39113.


Systematic Generation and Analysis of Pathes and Pathways

The main purpose of the compiled PETRI nets for pathway databases is the systematic generation of pathes and pathways in such nets to facilitate the analysis of differences between certain environmental states and between different organisms (genomes) an the basis of gene annotations and/or gene expression data.

Naive graph constructs are not immediately usable to to encode complex situations in biochemical networks in a natural way. In contrast, PETRI nets allow for representing and distinguishing different situations, e.g. a branching reaction from conflicting reactions in pathways, i.e. one reaction producing several products from a set of educts and several reactions using the same educts and producing their set of products independently and concurrently (see Figure 7). PETRI nets and their associated underlying semantics [9 ] (the so called "firing rule") have specifically been developed and are well­suited for representing the differences of such situations.


Figure 7: PETRI net representations of a branching reaction with educt A and products B and C compared to two reactions being in conflict with respect to educt A producing B and C, respectively. According to the PETRI net firing rule for one instance of educt A only one of the transitions T_2 or T_3 could fire resulting in only one instance of either B or C but not both as in the first case of firing transition T_1.


Additionally, the firing rule and associated restrictions together with the use of additional user defined and biologically motivated restrictions [9 ] enable us to drastically reduce the number of valid pathes leading from a set of educts to a set of products.

Based on such restricted valid pathes we define a concept of pathways: Given a PETRI net , a pathway associated with a path is a partial net, which contains the path and is closed. Closed pathes account for the availability of educts and take care of the consumption of intermediate products. User definable, ubiquitious sets of educts and products then allow to determine pathways of different extent. Additionally, we introduce notions of length, diameter, area, and width of pathways, to enable the generation and analysis of pathways with specific, pre-defined properties (Figure 8). The length of a pathway is the length of the longest path from source to sink contained in the pathway, its diameteris defined as the length of its longest path, its area as the number of transitions in the pathway, and its width as the size of its maximal ST-cut.


Figure 8: Pathes and Pathways: The figure shows two pathes from source to sink of length 4 and 5, respectively. The pathway for a path contains the path and is closed, i.e. contains all non ubiquitious places connected to the path. In this case, the pathways (the subnet indicated by the shaded region) of both pathes from source to sink are the same.


In this paper, we have developed search and generation algorithms for pathes and pathways of PETRI nets, which allow for user defined constraints using these parameters in order to restrict the valid pathes for further analyses. The search process [ ] is subdivided into four steps:

  1. a Dijkstra like search [3 ] starting from the source to determine the minimum number of steps enabling a transition (using the PETRI net firing rule)
  2. a Dijkstra like search starting from the sink to determine the minimum number of steps from a certain transition to reach the sink The results from these two searches are used to delete all transitions from the PETRI net, which could not be member of a valid path from source to sink for the given length. This results in the subnet containing all possible candidates of valid pathways.
  3. a Branch&Bound enumeration of all pathes of the subnet satisfying the additional constraints required for valid pathways. This search eventually constructs a path p from source to sink. The enumeration has to take into account the two types of situations in PETRI nets (Figure 7): for branching reactions at least one path has to lead to the sink within the given constraints and all other pathes have to be closed with respect to the former path (note that in this case several alternatives could result in different valid pathways, all of which are valid solutions). For conflicts any path could result in a valid pathway solution (independent of conflicting pathes).
  4. for any such path p, all open nodes have to be closed with respect to the current path and the given constraints to produce the associated pathway of path p. Store valid pathways.

As an example, we consider the pathes and pathways of the glycolysis:
Usually, textbooks show simplified and clean pictures of metabolic pathways (see Figure 9).
Figure 9: The general layout of the glycolysis as described in standard biochemical textbooks implemented with PETRI nets using the CPN software system [1 ].


In contrast, given the connectivity of the metabolic databases, the number of potential, unrestricted pathes connecting two proteins or metabolites in the network, appears to be very large (some 500.000 pathes for pathes of length at most nine from glucose to pyruvate, not shown). This prohibits the systematic analysis of all potential pathes. The application of the PETRI net firing rule (step 1 and 2) in the Dijkstra search reduces the number of pathes to (still) about 80.000 pathes involving some 800 enzymes (not shown). Exploiting additional pathway constraints (restricting the cut--width to 2 and 1) in steps 3 and 4 results in 541 and 170 pathways, respectively (Figure 10).
Figure 10: Pathways of the glycolysis as computed from the unified PETRI net derived for the metabolic databases KEGG, ENZYME, and BRENDA containing all pathways with with a maximum width of 1 and 2, respectively. The width has to be at least 2 in order to include the textbook--glycolysis (thick lines) in the set of valid pathways.


Differential Metabolic Display

On the basis of such restricted pathways we can systematically compare different networks, i.e. different developmental or disease states of different organisms. For this purpose so called Differential Metabolic Displays (DMDs) are introduced: system specific subnets are extracted from the unified PETRI net and the respective intersection and difference sets for different systems are determined. A DMD can be represented as a PETRI net containing pathways, colored according to the above system­specific sets to simultaneously exhibit shared, missing, and specific pathways for each system under consideration.

DMDs allow to display significant differences, to identify gaps in specific pathways, and to enable the interpretation of expression data by making predictions for proteins of unknown function and to propose the existence and/or absence of specific proteins or protein functions in certain systems. Nets restricted to specific genomes (see Figures 10 and 11), after having mapped the sequence data, can be used to find and exhibit detours and gaps in organism-specific metabolic pathways and to propose protein functions to be searched for in genomic data to complete apparently disrupted pathways. E.g., for the yeast and Mycoplasma Genitalium (MG) genomes in the current metabolic databases we find 550 pathways with 225 reactions. Out of these, we could assign sequence information to 185 of the 225 enzymes. The light grey lines indicate all edges contained in the remaining 140 pathways consisting of enzymes with known sequences represented in the current sequence databases. The thick edges indicate pathes contained in pathways present in both organisms (MG and yeast), thin black lines are found only in yeast pathways, but not in MG, dotted lines would have indicated pathes in MG not present in yeast (none in this example).
Figure 11: Differential Metabolic Display (DMD) of the glycolysis for yeast and MG (Mycoplasma Genitalium) genomes containing all pathways for all pathes from starting educt glucose to ending product pyruvate.


CONCLUSION AND FUTURE WORK

The study of networks of simple textbook models like the glycolysis shows how complex the connectivity in biological networks presumably is. In this work, we provide a framework to study metabolism and regulation on a global, genome­wide level.

In the future we will pursue 2 strategies: From annotated sequence databases it is possible to extract many more enzymatic or regulatory reactions not yet contained in metabolic databases as an EC number has not yet been assigned by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB). This should complement the metabolic information in the current databases to get a more complete coverage of the functions coded in a genome and should allow to make better use of networks in holistic function predictions.

The second strategy is to exploit proteomics, gene expression (DNA­Chips) and quantitative biochemistry data (kinetic enzyme values). Furthermore, we envision a theory/calculus which allows to compute conditional probabilities of pathways given the known metabolic/regulatory relationships and given the data from (a set of) expression experiment(s) using a bayesian inference framework.


ACKNOWLEDGEMENTS

Part of this research has been funded by the BMBF under contract TargId (0311615). We thank our colleague Klaus Voss for providing Figure 9.


REFERENCES


  1. http://www.daimi.au.dk/cpnets/.
  2. Bairoch A. The ENZYME data bank in 1999. Nucleic Acids Res, 27(1):310--311, 1999.
  3. T.H. Cormen, C.E. Leiserson and R.L. Rivest. Introduction to Algorithms. McGraw­Hill, New York, 1990.
  4. Schomburg D., Salzmann D. and Stephan D. Enzyme Handbook, Classes 1­6. Springer, 1990­1995.
  5. R. Hofestädt and S. Thelen. Quantitative modelling of biochemical networks. In Silico Biol­ ogy, 1, 1998.
  6. IUBMB. Enzyme Nomenclature 1992. IUBMB, San Diego, California, 1992. with Supplement 1 (1993), Supplement 2 (1994), Supplement 3 (1995) and Supplement 4 (1997) (in Eur. J. Biochem. 223, 1­5, 1994; Eur. J. Biochem. 232, 1­6, 1995, Eur. J. Biochem. 237, 1­5, 1996, and Eur. J. Biochem. 250, 1­6, 1997, respectively.
  7. P.D. Karp, C. Ouzounis, and S. Paley. Hinccyc: A knowledge base of the complete genome and metabolic pathways of h. influenzae. In Proceedings of the ISMB'96 conference, pages 116--129, 1996.
  8. Peter D. Karp, Monica Riley, Suzanne M. Paley, Alida Pellegrini­Toole and Markus Krummenacker. Eco Cyc: Encyclopedia of escherichia coli genes and metabolism. NAR, 27(1):55--58, 1999.
  9. R. Küffner, R. Zimmer and T. Lengauer. Pathway analysis in metabolic databases via differential metabolic display (DMD). Full paper in preparation, 1999.
  10. P. Mendes and D. Kell. Non­linear optimization of biochemical pathways: applications to metabolic engineering and parameter estima­ tion. Bioinformatics, 14(10):869--83, 1998.
  11. H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 27:29--34, 1999.
  12. R. Overbeek, N. Larsen, W. Smith, N. Maltsev and E. Selkov. Representation of function: the next step. Gene, 191(1):1--9, 1997.
  13. V.N. Reddy, M.N. Liebman and M.L. Mavrovouniotis. Qualitative analysis of biochemical reaction systems. Comput. Biol. Med., 26(1):9--24, 1996.
  14. E.J. Selkov, Y. Grechkin, N. Mikhailova and E. Selkov. MPW: The metabolics pathway database. Nucleic Acids Res, 26(1):43--45, 1998.