In Silico Biology 2, 0006 (2002); ©2002, Bioinformation Systems e.V.  
Dagstuhl Seminar "Functional Genomics"

High quality visualization of
biochemical pathways in BioPath

Falk Schreiber




Universität Passau, Fakultät für Mathematik und Informatik,
Innstraße 33, D-94030 Passau, Germany
schreibe@informatik.uni-passau.de

present address:
Basser Department of Computer Science, University of Sydney,
Madsen Building (F09), Sydney, NSW 2006, Australia
falk@cs.usyd.edu.au





Edited by E. Wingender; received November 1, 2001; revised December 17, 2001, accepted January 28, 2002; published February 5, 2002


Abstract

Biochemical reactions form large and complex networks. Comprehensible visual representations of these networks help biochemists understand the relationships between the chemical components. Typically pathway diagrams are manually produced drawings. Because of the steady progress of knowledge and the complex relationships in these networks, automatic visualizations are necessary. BioPath is a system for the exploration and automatic visualization of biochemical pathways. It has been developed to obtain an electronic version of the well-known Boehringer Biochemical Pathways poster and offers new possibilities to find information and to navigate through pathways. BioPath has a specific database containing reactions and a hierarchical clustering of reactions and reaction networks. One feature is the automatic generation of pathways from the database and their high quality visualization. This paper states the requirements for the visualization of biochemical pathways, presents a layout algorithm and shows how BioPath can be used to explore biochemical reaction networks.

Keywords: biochemical pathways, metabolic pathways, visualization, graph drawing



Introduction

Biochemical reactions are fundamental to life processes. Substances, co-substances and enzymes form networks with multiple interconnections representing reactions and their regulation. These networks are very large and grow rapidly with the steady progress of knowledge. Biochemists are familiar with visual representations of reactions and reaction networks. These representations help them to understand the complex relationships between the components of the networks. A survey and global view are presented by [Michal, 1999] and the Biochemical Pathways poster [Michal, 1993]; see Figure 1 for a cutting of the poster.


Figure 1: A cutting of the Biochemical Pathways poster [Michal, 1993] showing the complexity of the network.

A pathway diagram means a visual representation of a reaction network. Pathway diagrams, as in the poster, in textbooks on biochemistry [e.g. Stryer, 1995] and in information systems such as KEGG  [Kanehisa and Goto, 2000] and ExPASy   [Appel et al., 1994] are made manually. ExPASy was the first electronic version of the Biochemical Pathways poster. It has scanned pictures of the poster and provides links between the occurrences of enzymes in the pathway diagrams and additional information. Notice that the drawings in textbooks and information systems are created once. They are used very often and they are usually hardcopies. These diagrams represent the knowledge at the time of generation. This type of pathway visualization is called static visualization [Brandenburg et al., 1998].

Although static visualization is often used for the visualization of reaction networks, it has some severe drawbacks. First, static pathway diagrams cannot be updated with a reasonable amount of work. For example, Michal [Michal, 2000] reported that he and his team spent about one year for the design of the third edition of the poster [Michal, 1993]. However, the poster represents only a small fraction of the current knowledge about biochemical reactions. On the other hand, static visualizations are often overloaded with information. A mass of information is given by different text colours and by the style and colour of arrows as illustrated in Figure 1. Such visualizations are hard to read, particularly for beginners, and it is difficult to find specific information such as all pathways between two substances. Furthermore, there is no way to specify the amount of detail of each reaction to be displayed. For example, the user cannot choose between the use of enzyme names or enzyme classification (EC) numbers in the pathway diagram. Finally, static visualization is restricted to only one level of representation providing either an overview or detailed information. However, there is more knowledge than this aspect and different views are needed. In the extreme, one needs an overview diagram with a detailed view of some parts such as an overview diagram of all catabolic pathways with the citrate acid cycle in a detailed view. Using static visualization the user is not able to interactively change the depth of information in the diagram and browsing through pathways from abstract overview diagrams to detailed pictures is not possible.

From a formal point of view a biochemical pathway is a directed graph consisting of nodes and edges. The nodes represent the substances, co-substances and enzymes within a pathway. A substance can be a reactant or product; a co-substance can be a co-reactant or a co-product. A compound means all elements of a reaction (substances, co-substances and enzymes). The edges of the graph represent all sorts of relations such as reactions, regulation or "is part of" relations. A reaction can be represented by a directed hyper-edge (an n-ary relation), which connects all compounds of the reaction. More common is the modelling of such relations by directed bipartite graphs, which is also used in Petri-net representations of pathways [Reddy et al., 1993; Hofestädt and Thelen, 1998]. Here the reactions themselves are nodes and edges are binary relations connecting compounds of reactions with reaction nodes. Without loss of generality, reactions can be represented by directed bipartite graphs. This notation has the advantage that graph algorithms and standard drawing techniques can be used.

Our focus is on dynamic visualization in general and on pathway visualization in particular. Dynamic visualization is the generation of a diagram on demand at the time the drawing is needed. The placement of objects (e.g. substances) and the routing of their connections is a typical graph drawing problem. Aesthetic criteria and quality measures for good drawings have been established and algorithms for their automatic construction have been developed [Di Battista et al., 1999]. Relevant aesthetic criteria include:

The algorithms attempt to achieve these goals and typically consist of at least two phases: The placement of the objects with the computation of their coordinates and the routing of the edges. Both phases interfere with each other.

In graph drawing, several major techniques have been elaborated, such as layered layout (hierarchical layout), force directed layout and planar layout for the placement of the objects. Examples for edge routings are straight-line and orthogonal drawings. There are general algorithmic concepts and problem-specific algorithms [Di Battista et al., 1999]. Recent progress in the field is reported in the proceedings of the annual Symposium on Graph Drawing. However, the known graph drawing algorithms are inadequate for visualizing biochemical reaction networks according to the established conventions of biology and chemistry; see Figure 2 for examples. Even combinations of standard graph drawing algorithms [Karp and Paley, 1994; Becker and Rojas, 2001] are insufficient. Some solutions focus on the placement of the main reactants and products only [Becker and Rojas, 2001], others place co-substances and enzymes in an unusual way like in PFBP. Up to now the best results have been obtained by using labelling techniques, where the enzymes and co-substances are considered as edge labels and placed separately [Karp and Paley, 1994; Mendes, 2000]. Nevertheless, these algorithms can produce crossings between co-substances and edges. They also tend to cluster all co-reactants into one node and all co-products into another node. This is unacceptable for some reactions, especially for more complex reaction mechanism as shown in Figure 3. Furthermore, these drawings of reaction networks often contain many unnecessary edge crossings [Karp and Paley, 1994] thus reducing their readability.


(a)
(b)
Figure 2: Two visualizations of the same biochemical pathway (a) using a layered layout technique with poly-line edge routing [Sugiyama et al., 1981] and (b) using a force directed layout technique with straight-line edge routing [Fruchterman and Reingold, 1991]. These drawings are very different from typical drawings in textbooks.

Interactive navigation through biochemical pathways is a particular advantage of electronic information systems over textbooks and posters. The concept of a hierarchy of reactions is a crucial feature when browsing through reaction networks on different levels of detail. Reactions should be separable into sub-reactions and able to be combined to pathways. Thus sub-reactions, reactions and pathways should also be considered as reactions. In some systems this hierarchical organization is realized by linked web pages. The information in KEGG  [Kanehisa and Goto, 2000] is structured by web-links from pathway classes (e.g. carbohydrate metabolism, lipid metabolism) to single pathway diagrams. UM-BBD  [Ellis et al., 2000] provides a clickable overview picture and a list of pathways organized by functional groups. But it is also necessary to represent this hierarchy in the database. EcoCyc and MetaCyc [Karp et al., 2000] represent a hierarchy of reactions by linking pathway objects to objects representing the reactions.


Figure 3: For the understanding of the reaction mechanism the order of co-reactants and co-products is crucial and should be clearly visualized.


Existing visualization techniques do not meet the conventions of biochemistry; they are far from the typical drawings in textbooks. Moreover, existing databases do not include all relevant information such as the hierarchy of reactions and layout constraints for specific pathways. These constraints are necessary to distinguish pathways like the citrate acid cycle from arbitrary cycles in the network. Therefore, a new layout algorithm and an adapted database scheme are necessary.



Methods


Database

The underlying database has been developed by Kanne [Kanne, 2000]. His object-oriented database scheme contains 115 classes. Different attributes and relations for molecules and reactions are stored in the database: chemical structures on the atomic level, compound names in German and English, classifications of the compounds (e.g. carbohydrate or amino acids), reactions with the involved compounds and their current rules (e.g. co-reactant), specifications of the conditions for reactions (enzymes and regulation), information about the reaction mechanism by specifying a mapping of reactant atoms to product atoms and an order of co-substances, hierarchy information and layout constraints for specific pathways (e.g. citrate acid cycle). The database has different search capabilities such as searching for substances, enzymes, pathways, named reactions and sub-structures (e.g. which molecule contains a pyridine ring) [Gasteiger and Trümbach, 2000]. The data is taken from the Boehringer Biochemical Pathways poster [Michal, 1993] and the atlas [Michal, 1999] is used as a source of supplementary information.

Reactions are represented as objects consisting of possibly empty sets of sub-reactions. A biochemical pathway or reaction network can be seen as a single reaction. This allows the representation of the common textbook pathways in the database along with others that may have never been explicitly represented elsewhere. The model supports the definition of different hierarchies such as the classical hierarchy of reactions and pathways as on the poster or hierarchies computed by flux analysis [Schilling and Palsson, 1998]. The depth of these hierarchies is not restricted. Reactions can consist of reaction mechanisms and pathways can be clustered into pathway classes.


Visualization requirements

How shall we draw a biochemical pathway? What criteria must be met? What are the requirements for a visualization algorithm? The main criteria are:

Compounds.  The visualization of compounds of a reaction is user-specific. According to existing drawings in the literature, substances should be drawn with their name or their structural formula or both, co-substances should be displayed using their name or abbreviation and enzymes should be represented by their name or their EC-number. In BioPath the conventions of the poster [Michal, 1993] are used. It displays the names and the structural formulas for substances and the names or abbreviations (e.g. ATP, NADP) for co-substances and enzymes.

The data for the compounds is retrieved from the database. It is transformed into nodes of the graph using varying node sizes. The size of the node provides the space needed to display the information of the object such as the name or the structural formula. As a consequence, the visualization algorithm must support different node sizes.

Reactions.  A reaction is visualized by a reaction arrow from the reactants to the products. Enzymes and co-substances are placed on different sides beside this arrow. For co-substances their temporal order, which depends on the reaction mechanism, is important. Therefore they are placed according to this order. Overview diagrams often disregard enzymes and co-substances. Thus the visualization algorithm has to deal with reactions with and without enzymes and co-substances.

Reaction networks.  As the temporal order of reactions in the networks is crucial, the main direction of reactions should be clearly visible. In BioPath the main direction is from top to bottom. This is better than from left to right since it yields more compact representations and a better placement of objects with long textual information.

There is an exception from the top to bottom direction, which is used for the visualization of specific pathways such as the citrate acid cycle or the fatty acid synthesis. The structure of these cyclic reaction chains should be emphasized. These pathways are characterized by the continuous repetition of a reaction sequence in which the product of the sequence re-enters in the next loop as a reactant. There are two mechanisms. First, the reactant and the product of the reaction sequence are identical from loop to loop (e.g. citrate acid cycle). This is called a closed cycle. Second, the reactant of the reaction sequence varies slightly from the product (e.g. fatty acid cycle). This is called an open cycle.

Repetitions in cyclic structures must be clearly visible. In textbooks they are displayed as a circle (closed cycles) or as a spiral (open cycles). In a spiral equal reaction steps and corresponding substances are placed side by side to emphasis the cyclic structure. This drawing style has some drawbacks. It needs much space and it is difficult for a user to trace the reaction sequence, particularly in the outer part of the spiral. Furthermore, if only a part of the diagram can be shown on a restricted viewing area such as a computer screen, extensive scrolling in all directions is necessary; see Figure 4 (a). Figure 4 (b) shows the same pathway in a more compact visualization. The spiral has been unravelled and related reactions and substances have been horizontally aligned. This new drawing style avoids the above-mentioned disadvantages.


(a)
(b)
Figure 4: Two drawings of an open cycle (degradation of fatty acids). For simplicity only the reactants and products are shown. In (a) the typical visualization of this pathway as a spiral. In (b) the same pathway is shown in a more compact visualization that also emphasizes the corresponding reaction steps and substances.

Sequences of reaction networks.  Browsing through pathways can be very useful for the study of reaction networks. One example is to start with an overview diagram and to refine a specific part of this diagram. Another mechanism of browsing is to add reactions to an existing reaction network interactively. In this case a sequence of pathway diagrams is generated. When a user knows a diagram of this sequence the next diagram is easier to understand if small changes between the successive reaction networks imply only small changes of the corresponding pathway diagrams. The drawing of the unchanged part of the network should be preserved, however, this is not always possible. For example, there may not be enough space for the insertion of a new reaction. In this case the relative positions of the old objects should be preserved. In graph drawing this technique is called preserving the mental map [Eades et al., 1991]; see Figure 5 for an example.


(a)
(b)
(c)
Figure 5: As an example a user wishes to extend the reaction network in (a) by the reaction from ß-D-glucose to D-sorbitol. The diagram in (b) shows the new reaction network, but this drawing is not mental map preserving. The left and right reactions as well as the substances -D-glucose and -D-glucose-1-phosphate are changed. A user who knows the drawing in (a) will need some time to understand the new diagram in (b). In contrast, the pathway diagram in (c) is a proper and mental map preserving drawing of the new reaction network preserving as much as possible of the former diagram in (a).



Layout algorithm

We model reaction networks as ordered bipartite graphs, where substances, co-substances, enzymes and reactions are represented as nodes and their connections are represented as directed edges. The visualization algorithm is based on the graph drawing algorithm by Sugiyama et al. [Sugiyama et al., 1981] for the computation of layered layouts of directed acyclic graphs. The main extensions of the algorithm are:

Clustering.  Clustering means the grouping of disjoint subgraphs into new nodes. Each grouped subgraph can be drawn with its own layout algorithm. The size of the corresponding new node is determined by the space of the drawn subgraph. In this context the routing of edges from nodes of the subgraph to nodes outside the subgraph is a difficult problem.

Node sizes.  The Sugiyama algorithm has only a coarse view of the size of nodes. It uses a global layering mechanism by placing all nodes in horizontal layers depending on their topological order. The distance between two layers is defined by the highest node of the above layer. Our algorithm is adapted to varying node sizes and is extended to a local layering mechanism where all nodes are placed in layers depending only on the placement of all predecessors; see Figure 6.


(a)
(b)
Figure 6: Layering mechanisms: (a) the typical global layering; (b) the local layering. The local layering mechanism considers the node sizes correctly and leads to compact placements.

Layout constraints.  Layout constraints are additional application-specific requirements on the placement of objects. The layout algorithm considers the following types of layout constraints: top-bottom constraints to place one node below another one; left-right constraints for the horizontal order of two nodes; horizontal constraints to force the same layer for two nodes; and vertical constraints to force the same x-coordinate for two nodes. Restricted by layout constraints, distinguished paths such as open and closed cycles are drawn according to the specified requirements and the mental map in related drawings is preserved.


The main steps of the layout algorithm are:

Given: A directed graph (representing a reaction network) with node sizes, node types (substance, co-substance, enzyme, reaction) and an indication of open and closed cycles.

Result: The drawing of the graph.

  1. Compute local layouts for co-substances and enzymes of each reaction and cluster these nodes into large nodes; see Figure 7. These large nodes are called reaction nodes.
  2. Insert layout constraints and temporary nodes for open and closed cycles; see Figure 8.
  3. Reverse some edges to remove all cycles from the graph under the restriction of the top-bottom constraints. For each reversed edge: if possible (no new cycles occur by the following operations) turn the local layout of the adjacent reaction node by 180 degree (instead from top to bottom this reaction goes now from bottom to top) and change the direction of all edges adjacent to this reaction node. Otherwise insert a node on each edge adjacent to the corresponding reaction node to allow additional bends for these edges.
  4. Assign nodes to horizontal layers under the restriction of horizontal constraints. All edges should be directed from top to bottom. For each node the distance to its predecessors should be minimized. This step computes the y-coordinates of nodes.
  5. Compute a proper layering by inserting temporary nodes for long spanning edges and long spanning nodes; see Figure 9.
  6. Permute the order of nodes within each layer to reduce the number of edge crossings in the layered graph such that the left-right and the vertical constraints are fulfilled.
  7. Compute x-coordinates for the nodes without changing the pre-computed order in the layers and under consideration of the vertical constraints.
  8. De-cluster large nodes, compute an edge routing by using the temporary nodes as additional base points for edges.
  9. (Additional) Insert constraints for preserving the mental map in the next drawing.


Figure 7:Clustering of co-substances and enzymes of each reaction into a large node. The positions of the co-substances and enzymes are computed separately. Enzymes are placed to the left of the reaction arrow, co-substances to the right according to their order. This determines the size of the new node. In the subsequent steps of the layout algorithm the new large node is used instead of the clustered nodes. The order of the co-substances and their current role (co-reactant or co-product) are retrieved from the database.


(a)
(b)
Figure 8: (a) The visualization of a part of an open cycle (two loops of the pathway).  (b) Layout constraints to receive this layout. 1: Horizontal constraints between corresponding substances in different loops of the pathway. 2: Top-bottom constraints between consecutive nodes of each loop of the pathway (the node at the end of the arrow should be placed below the node at the begin of the arrow). 3: Left-right constraints between nodes in the some layer in neighbouring loops. 4: Vertical constraints between consecutive nodes of each loop to force a vertical placement of nodes. For technical reasons some additional nodes are necessary. For closed cycles similar constraints are used.


(a)
(b)
Figure 9: (a) Assignment of nodes to horizontal layers. Long spanning edges are edges that cross at least one layer, long spanning nodes belong to at least two layers. (b) These edges and nodes are replaced by chains of temporary nodes.

It is known that the steps 3, 6 and 7 are NP-hard [Garey and Johnson, 1979]. Therefore we use heuristics for these steps. The time critical part of the complete algorithm is step 3 to step 7. Let g be a connected graph, n be the number of nodes and e be the number of edges of g. For the steps 3, 6 and 7 different heuristics with complexities from linear time to O(n4) (for step 6) can be chosen, where a higher running time usually gives better results. However, most important are steps 4 and 5, because during these steps up to n*e temporary nodes and edges will be inserted in the graph. This influences the running time of the subsequent steps of the algorithm. The layout algorithm has been tested with more than 200 graphs. These graphs represent reaction networks with up to 50 reactions (or correspondingly up to 300 nodes). Using a standard PC we were able to layout these graphs within a few seconds using the heuristics with the higher complexities.



Results

BioPath  [Kanne et al., 1999; Brandenburg et al., 2001] has been developed to obtain an electronic version of the well-known Boehringer Biochemical Pathways poster. It demonstrates the capabilities of the layout algorithm. The web-based system displays the results of user queries as HTML pages and clickable images. BioPath is based on Graphlet  [Himsolt, 2000], a universal graph editor and graph layout system. Access to BioPath and the layout algorithm is available upon request from the author.

BioPath offers several ways to access the data and to browse through biochemical pathways [Forster et al., 2002]. Upon a request to a pathway BioPath computes and returns a graphical representation. Reactions are visualized according to the established conventions and open and closed cycles are displayed as in Figure 10. All elementary objects of the diagram such as substances, enzymes or reaction arrows are interactive. For more information the compounds and reactions displayed on the image can be clicked to show detailed information about the compounds or the sub-pathway on the lower level of the hierarchy. The result also gives access to all pathways on the higher level of the hierarchy that the reaction belongs to; see Figure 11 for an example. Given two substances, BioPath computes the reaction network between them showing all pathways up to a specified length as shown in Figure 12.


(a)
(b)
Figure 10: Visualization of (a) an open cycle (degradation of fatty acids) and (b) a closed cycle (urea cycle) computed by the layout algorithm.


Figure 11: The information shown for a pathway or a reaction. The entry Parents provides links to all reactions on the higher level of the hierarchy that are related to the shown reaction.


Figure 12: The reaction network between maltose and D-fructose. For the colour of the reaction arrows the convention of the poster is used. In particular general pathways are black, reactions occurring in animals are blue, reactions occurring in unicellular organisms (prokarya) are red and reactions occurring in higher plants are green. If a reaction is located in more than one of the previous organism groups, mixed colours are used, for example reactions occurring in higher plants and unicellular organisms are shown in brown.



Discussion

Typically pathway diagrams are manually produced drawings. Because of the steady progress of knowledge and the complex relationships in the reaction networks, dynamic visualizations are necessary. The requirements for these visualizations are the basis for the layout algorithm we have developed. BioPath is an electronic version of the Biochemical Pathways poster offering easy access to the information and progressive navigation through pathways. The database contains information about compounds and reactions, especially a hierarchy of reactions. A new layout algorithm for the automatic visualization of pathways visualizes reaction networks according to the conventions of biochemistry. Despite the NP-hardness of several algorithmic sub-problems the generated pathway diagrams show the high quality of the layout heuristic. The layout algorithm and the BioPath system can be very useful for the analysis of biochemical pathways.



Acknowledgement

BioPath has been developed by a group of researchers from the universities of Mannheim, Erlangen and Passau and the Spektrum Akademischer Verlag. The database was designed and built by C.-C. Kanne and G. Moerkotte (Mannheim). D. Trümbach and J. Gasteiger (Erlangen) provided the data. The visualization component and the design and implementation of BioPath were done together with M. Forster, A. Pick, M. Raitner and F.J. Brandenburg (Passau) in cooperation with BRAINSYS Informatiksysteme. The BioPath Project has been supported by the German Ministry of Education and Sciences (BMBF).



References