| In Silico Biology 5, 0013 (2004); ©2005, Bioinformation Systems e.V. |
| Dagstuhl Seminar "Integrative Bioinformatics" |
Bioinformatics / Medical Informatics, Technische Fakultät, Universität Bielefeld,
Postfach 10 01 31
D-33501 Bielefeld, Germany
* Corresponding author
Email: mchen@techfak.uni-bielefeld.de
Edited by E. Wingender; received September 30, 2004; revised and accepted December 09, 2004; published February 05, 2005
Metabolic pathway alignment represents one of the most powerful tools for comparative analysis of metabolism. It involves recognition of metabolites common to a set of functionally-related metabolic pathways, interpretation of biological evolution processes and determination of alternative metabolic pathways. Moreover, it is of assistance in function prediction and metabolism modeling. Although research on genomic sequence alignment is extensive, the problem of aligning metabolic pathways has received less attention. We are motivated to develop an algorithm of metabolic pathway alignment to reveal the similarities between metabolic pathways. A new definition of the metabolic pathway is introduced. The algorithm has been implemented into the PathAligner system; its web-based interface is available at http://bibiserv.techfak.uni-bielefeld.de/pathaligner/.
Keywords: metabolic pathways, alignment, algorithm, PathAligner
During the past decade bioinformatics has undergone rapid development. A vast amount of information on different organisms, both on genetic and metabolic levels, has been accumulated and systematically stored in specific databases that are available via the Internet [Collado-Vides and Hofestädt, 2002]. This rapid accumulation of biological data provides the possibility of studying metabolic pathways systematically. Analysis of metabolic pathways is an essential topic in understanding the relationship between genotype to phenotype. Metabolic pathway alignment is of importance to study biology evolution, pharmacological targets and other biotechnological applications [Dandekar et al., 1999].
Research on genomic sequence alignment has, so far, been intensively conducted [Needleman and Wunsch, 1970; Smith and Waterman, 1981]. Applications and tools, such as FASTA [Pearson and Lipman, 1988] (http://www.ebi.ac.uk/fasta3) and BLAST [Altschul et al., 1990; Altschul et al., 1997] (http://www.ncbi.nlm.nih.gov/BLAST) have been developed to further understand the biological homology and estimate evolutionary distance. Although the information provided by sequenced genomes can yield insights into their evolution and cellular metabolism, knowledge of the genome sequence alone is really only the starting point of the real work. The true functional units of metabolic systems are metabolic pathways. In fact, the emphasis of research efforts begins to turn back from gene sequences to cell functions as the completion of a huge number of genomes and the accumulation of knowledge about metabolism have made the comparison of metabolic pathways possible.
Several approaches to metabolic pathway alignment have been made in the past years. Forst and Schulten, 1999, 2001, extended the DNA sequence alignment methods to define distances between metabolic pathways by combining sequence information of involved genes. They combined sequence information of involved genes with information of the corresponding pathway. For each functional role of the pathway, all genes in the genomes that code for this functional role were used. Dandekar et al., 1999, compared glycolysis, Entner-Doudoroff pathway and pyruvate processing in 17 organisms based on the genomic and metabolic pathway data by aligning specific pathway-related enzyme-encoding genes on the genomes. Liao et al., 2002, developed a computational method to compare organisms based on whole metabolic pathway analysis. The presence and absence of metabolic pathways in organisms was presented as a Boolean vector. Based on this methodology and by using some specific distance measures on these profiles, pairwise comparison of a set of completed genomes were performed. Tohsato et al., 2000, proposed a multiple (local) alignment algorithm by utilizing information content that was extended to symbols having hierarchically structured EC numbers. They considered that reaction similarities can be expressed by the similarities between EC numbers of the respective enzymes and applied their method to pathway analyses of sugar, DNA and amino acid metabolisms. Heymans and Singh, 2003, presented a technique for the phylogenetic analysis of metabolic pathways based on the topology of the underlying graphs. A distance measure between graphs was defined using the similarity between nodes (enzymes) of the graphs and the structural relationship between them. The distance measure was applied to enzyme-enzyme relational graphs (two enzymes are related if they activate reactions which share at least one chemical compound) derived from metabolic pathways. Using this approach, pathways and groups of pathways of different organisms were compared to each other and the resulting distance matrix was used to obtain a phylogenetic tree. Kelley et al., 2003, devised a method called PathBLAST to align two protein interaction networks. The alignments were scored by the degree of protein sequence similarity at each pathway position and by the quality of the protein interactions they contain.
We are going to present a new pathway alignment strategy to analyze and fully characterize metabolic pathways in the cell. The alignment algorithm differs from classical string alignments as well as previous approaches. It considers functional similarities of enzymatic reactions involved in two metabolic pathways. This paper is structured as follows: the next section explains definitions of metabolic pathways. In the main section "Metabolic Pathway Alignment" section, firstly, theory basics are described. Then, similarity function, strip and index function are introduced. Finally, the alignment algorithm, its properties and its implementation are presented. A discussion about the algorithm is addressed in the form of conclusions.
Cellular metabolism is most often described and interpreted in terms of the enzyme-catalyzed biochemical reactions that make up the metabolic network. Traditionally, metabolic pathways have been explained in the context of their historical discovery, often named after key molecules (e. g. "glycolysis", "urea cycle", "pentose phosphate pathway", "citric acid cycle" and so on). The classification of molecules into pathways has historical reasons and is not explicitly based on qualitative differences of the interactions. One reason for this is that it is easier to think about the network as pathways which are densely connected.
So far, metabolic pathways are defined in literature [Mavrovouniotis, 1995; Voet and Voet, 1995; Hartwell, 1997; Selkov et al., 1998; Forst and Schulten, 1999; Schuster et al., 2000] in various ways with varying degrees of formality. Mavrovouniotis, 1995, defined a biochemical pathway as an abstraction of a subset of intricate networks in the soup of interacting biomolecules. A prevailing definition is that a metabolic pathway is a special case of a metabolic network with distinct start and end points, initial and terminal vertices, respectively, and a unique path between them, i. e. a directed reaction graph with substrates as vertices and arcs denoting enzymatic reactions [Forst and Schulten, 1999]. Typical metabolic pathways are given by the wall chart of Boehringer Mannheim [Michal, 1999] and KEGG [Kanehisa and Goto, 2000] (http://www.genome.ad.jp/kegg/metabolism.html), which have been verified with a number of printed and on-line sources. Some databases such as KEGG or WIT [Overbeek et al., 2000] (http://wit.mcs.anl.gov/WIT2/) represent metabolic pathway graphs with labeled arcs indicating the involved enzymes. Schuster et al., 2000, provided a general definition of metabolic pathways based on the concept of elementary flux modes (or minimal T-invariants in Petri nets [Voss et al., 2003]). It allows to test whether sets of enzymes form a consistent pathway allowing mass balancing for each intermediate and complying with the directionality of reactions (irreversibility). However, it represents modes under idealized situations without regulation and feedback. We should emphasize that the definition of a metabolic pathway is not exact, there are always interactions among pathways.
Moreover, the basic strategy to represent and compute pathways is the reactant-product binary relation. Properties of the pathway that rely upon the integration of two or more input molecules and unrelated output molecules and feedback effects are ignored (Figure 1). In that context, pathway refers to a more or less linear path from substrate to product, even though usually some of the molecules involved are also found in other pathways.
As a result, the traditional well-known pathways such as glycolysis or TCA will not be considered as a well-defined pathway, because they often contain some branches or alternative pathways. In fact there are several pathways inside them. We emphasize that a metabolic pathway is a subset of reactions that describe the biochemical conversion of a given reactant to its desired end product. In other words, several biochemical reactions act together in a pathway to transform a set of initial substrates into products with very different structures.
Let M = {m1...mn} be a set of metabolites in cells. Let ei : M
M be a function for enzymatic reactions taking place in the cells. For all m1, m2, m3
M, the following property holds:
e2(e1(m1)) = m3Let e1(m1) = m2, e2(m2) = m3,..., ek(mk) = mk+1, we define e1e2...ek (m1) = ek(ek-1...e1(m1)) = mk+1. A new proposed definition of a metabolic pathway is presented and discussed in the following paragraphs.
M, a metabolic pathway is defined as a subset of successive enzymatic reactions P = e1e2...ek.For each ei (1
i
k), there exist a pair of metabolites (mi, mi+1), mi
M, mi+1
M, involved in the enzymatic reaction as substrate and product. P is a composition function of the reactions ei, and we have P(m) = ekek-1...e1(m). Given an initial substrate m1, the ending product mk+1 is predictable. If mk+1 = mi (1
i
k), i. e. the product of ek is one of the metabolites involved in the previous steps, then the pathway is called as a cyclic pathway. Obviously, if mk+1
mi (1
i
k), then it is a linear (non-cyclic) pathway.
Each enzymatic reaction ei (1
i
k) is catalyzed by a certain enzyme that is denoted as a unique EC number. The EC number is expressed with a 4-level hierarchical scheme that has been developed by the International Union of Biochemistry and Molecular Biology (IUBMB) [Webb, 1992]. The 4-digit EC number, d1.d2.d3.d4 represents a sub-sub-subclass indication of biochemical reaction. For instance, arginase is numbered by EC 3.5.3.1, which indicates that the enzyme is a hydrolase (EC 3.*.*.*), and acts on the "carbon-nitrogen bonds, other than peptide bonds" (sub-class EC 3.5.*.*) in linear amidines (sub-sub-class EC 3.5.3.*). These enzymes are normally separated-enzymes. Those enzymes may form a multienzyme complex (noncovalent aggregates of enzymes) or a membrane-bound system; a representative enzyme can be chosen unless there is a unique term for it. Thus, we can adapt the EC number as a unique name for the responding enzyme catalyzed reaction.
In brief, pathways are abstractions of sets of enzymatic reactions. They are substructures that are partitions of the metabolism.
Example 1
e3.5.3.1 means the biochemical reaction that is catalyzed by the enzyme 3.5.3.1., which catalyzes arginine into urea.
and
indicates that a metabolic pathway e2.1.3.3e6.3.4.5e4.3.2.1e3.5.3.1 starts from the enzymatic reaction 2.1.3.3 with carbamoyl-P as reactant and after a series of reactions (2.1.3.3, 6.3.4.5, 3.5.3.1) results in urea as the product.
Theory basics
In order to score the similarity (identity percentage) between two metabolic pathways, we define the similarity function that is adapted from standard alignment definitions on strings.
(E
{ε})
(E
{ε})\{(ε, ε)}.α and β denote 4-digit EC strings of enzymatic reaction function, e. g. α = e1.1.1.1 β = e2.3.4.5, ε denotes the empty string for null function. However, if α
ε and β
ε, then the edit operation (α, β) is identified with a pair of enzymatic reaction function.
An edit operation (α, β) is written as α
β (we simply write α, β as EC numbers). There are three kinds of edit operations:
ε denotes the deletion of the enzymatic reaction function α
β denotes the insertion of the enzymatic reaction function β
β denotes the replacement of the enzymatic reaction function α by the enzymatic reaction function β
ε never happens.
β1,..., αh
βh)Note that the unique alignment of ε and ε is the empty alignment, that is, the empty sequence of edit operations. Empty element ε can be inserted at any position, i. e. at the beginning or end. An alignment is usually written by placing the EC numbers of the two aligned pathways on different lines.
Example 2
The alignment A = (2.4.2.3
2.4.2.4, 3.5.4.5
ε, 3.1.3.5
3.1.3.5, ε
2.7.4.9) of the pathways e2.4.2.3e3.5.4.5e3.1.3.5 and e2.4.2.4e3.1.3.5e2.7.4.9 is written as follows, one above the other:

Example 3
Five alignments of E1 = e2.7.4.14e3.1.3.5e3.5.4.5e2.4.2.3 and E2 = e2.7.4.14e3.2.2.10e3.5.4.1e
2.4.2.3e3.5.4.5

β1,..., αh
βh) be an alignment of E1 = e1e2... em and E2 = e'1e'2... e'n. Then m + n
h
max{m, n}.
Proof
1.) The alignment

of E1 and E2 is of maximal length. Its length is m + n, hence m + n
h.
2.) Let m
n. Then

3.) The case m < n is similar to case 2. Hence h
max{m, n}.
To estimate the number of pathway alignments, we define that Aligns(m,n) is the number of alignments of one pathway E1 of m EC numbers with another pathway E2 of n EC numbers.
0, Aligns(0,0) = 1, Aligns(m,0) =1 and Aligns(0,n) = 1, then
Proof
The idea is to focus on the end of the alignment. If em is deleted, then there exist Aligns(m-1, n) alignments of the earlier part of the pathway. If e'n is deleted, then Aligns(m, n-1) alignments result. If em and e'n are aligned, Aligns(m-1, n-1) alignments result. Therefore,
If not to count
and
as distinct, the new way of counting alignments is to identify aligned pairs
and to ignore permutations of
. The notation of index alignment is introduced.
i1 < i2 < ... < ir
m and 1
j1 < j2 < ... < jr
n.For 1
h
r, the index pair (ih, jh) stands for the replacement eih
e'jh. We say that eih is matched or aligned with e'jh. All EC numbers in E1 and E2 not occurring in an index alignment are considered to be deleted in E1 or E2. In a graphical representation, the index pairs of the index alignment appear as lines connecting the EC numbers (Example 4).
Example 4
The following index alignment of E1 = e2.7.4.14e3.1.3.5e3.5.4.5e2.4.2.3 and E2 = e2.7.4.14e3.2.2.10e3.5.4.1e
2.4.2.3e3.5.4.5 represents the alignments of Example 3. In particular, P1 represents A1, A2 and A3, while P2 represents A4, and P3 represents A5.


Proof
1.) For each r
[0, min{m, n}] we have: for the ordered selection of the indices i1,...,ir there are
possibilities; for the ordered selection of the indices j1,...,jr there are
possibilities.
All these possibilities have to be combined:
.
2.) The key to the last equality is to consider the binomial expansion (x + y)n. For details see Appendix 1.
Obviously, Indexaligns(m, n) is a special case of Aligns(m, n), and it is possible to further reduce the number of alignments by requiring conditional matches. There are at least
different alignments between E1 and E2.
Similarity function
The notion of alignment requires some scoring or optimization criterion. A scoring scheme must account for replacements, insertions and deletions. Scores are measures of sequence similarity (similar sequences have high scores), which is given by a similarity function. The similarity function is measured by the following definition:
β) a nonnegative real number. The similarity σ (α
ε) and σ (ε
β) of the deletion operation (α
ε) and insertion operation (ε
β) is 0. For all replacement operations (α
β) α
ε, β
ε, say, α = d1.d2.d3.d4 and β = d1'.d2'.d3'.d4', then the similarity function σ (α
β) is defined by:

The definition is a heuristic device to measure the difference of enzymatic reactions that are catalyzed by known (classified) enzymes. It does not exclude the possibility that d4, d3.d4, and d2.d3.d4 can be respectively expressed as wide card symbols *, *.* and *.*.* which means no clear classification of the enzyme.
However, single pair of EC string comparison is simply measuring the difference between EC strings. Often it is additionally of interest to analyze the total difference between two strings into a collection of individual elementary differences. The most important mode of such analyses is an alignment of the pathways. The function σ can be extended to alignments in a straightforward way: the similarity σ (A) of an alignment A = (α1
β1,..., αh
βh) is the sum of the similarities of the edit operations A consists of.

Example 5
The similarity of the alignment A5 in Example 3 is:
| σ (A5) = | σ (2.7.4.14 2.7.4.14) + σ (3.1.3.5 3.2.2.10) + σ (3.5.4.5 3.5.4.1)
+ σ (2.4.2.3 2.4.2.3) + σ (ε 3.5.4.5) |
| = | 1 + 0.25 + 0.75 + 1 + 0 |
| = | 3.0 |
When considering the lengths of pathways, an alignment scoring scheme is given.

Score(E1, E2)
1. Score(E1, E2) = 1 if and only if E1 = E2.
Proof
The worst case is that no index pair exists in the alignment, i. e. all edit operations are deletions and/or insertions. Hence similarity σ (A) is zero. The best case is that the similarities of all edit operations are 1, i. e. α = β, there is neither deletion nor insertion. Hence similarity σ (A) is n, and E1 and E2 share the same length, m = n. Therefore

The scoring scheme defined by Definition 6 is generic. Algorithms for optimal alignment can seek either to minimize a dissimilarity measure or maximize a scoring function. In order to achieve a maximum possible score of the alignment, the edit distance could be adapted to measure the similarity between two pathways by calculating the minimal cost of the edit operations [Levenshtein, 1966; Wagner and Fisher, 1974; Sellers, 1980]. However, our scheme is based on the index pair matching with strip functions (Definition 7) under the following consideration.
A classical dynamic alignment algorithm is to achieve the optimal matching by minimizing the edit distance. However, when taking the biological aspects of metabolic pathways into account, especially when considering that two evolutionarily related pathways are diverged for some certain biological purpose, the alignment with the edit distance is arbitrary. For example, the same metabolic pathway from two organisms may have diverged since the organisms evolved from a common ancestor, and individual metabolites and enzymes may have been changed or added or lost in one pathway. Two theories exist. The "retrograde evolution" theory [Horowitz, 1945] states that sequential disappearance of key intermediary metabolites induces the recruitment of similar available substrates via new enzymes. The "substrate ambiguity" theory [Jensen, 1976] indicates enzyme recruitment from a pool of ancestral enzymes with basic functions and substrate ambiguity. Therefore, metabolic pathways exist for certain cellular functions that transform substrates into (intermediate) products. Based on these considerations, we restrict the alignment algorithm and define a new function to perform the removal of unmatched elements from both ends of the pathway.
| where: | 1 i k m and 1 j l n,ei, ek E2,e1, e2,...,ei-1, e'k+1,...,e'm E2,e'j is the first element matching from left to right such that ei = e'j, and e'l is the first element matching from right to left such that ek = e'l. |
Also set
when ei
e'j (1
i
m and 1
j
n) or e1e2...em = Ψ or e'1e'2...e'n = Ψ or e1e2...em = e'1e'2...e'n = Ψ.
The result of δ (E1, E2) can be further performed with δ until δ (E1, E2) = (E1, E2) (Figure 2).
The set of all matched points will be denoted as λ* (E1, E2). Such that

where λ (δ 0 (E1, E2))= λ (E1, E2); 1
i < i' <...< k' < k
m, 1
j < j' < ... < l' < l
n.
According to Definition 4, we know that (i, j),(i', j'),...(k', l'),(k, l) is an index alignment of E1 and E2.
Example 6
Let two pathways E1 = e2.7.7.6e3.6.1.5e2.7.4.14e3.1.3.5e3.5.4.5e2.4.2.3e1.3.1.1e3.5.2.2e3.5.1.6 and E2 = e2.7.7.6e2.7.4.6e2.7.4.14e3.2.2.10e3.5.4.1e1.3.1.2e3.5.2.2e3.5.1.6, then
Proof
See Appendix 2.
Observation
Let E1 and E2 be two pathways of length m and n, there exists a minimum r that enables ![]()
, then
.
Algorithms and implementation
Before describing our algorithms, we introduce four different subscript notations of the Strip function δ and the Index function λ. From its definition we know that the Strip function δ is able to strip two pathways if there are two pairs of elements that are matched, e. g. ei = e'j and ek = e'l. Here, ei, e'j, ek and e'l are 4-hierarchical numbers. We count all four numbers by default. Now if we count only the first three numbers for matching, then the Strip function δ can be written as δ3 in order to distinguish the default setting δ that can also be written as δ4 in this case. The same goes for δ2 and δ1. Accordingly, λ4 is defined as the default 4-number indexing, and λ3, and so on.
Pairwise alignment
In general, we align metabolic pathways one above the other. The alignment algorithm is based on likelihood calculations of index pairs. Given two metabolic pathways P1 and P2 (Figure 3), the implemented algorithm is given by Figure 4 and the following pseudo-code:
Line 1-3: Initialize the set of unaligned EC number sequences, their lengths and score value.
Line 4-29: Starting from both ends towards the middle, align one sequence to another and attempt to find all EC numbers with same 4-level hierarchical numbers. Score the similarities. Recall the alignment positions where EC numbers are identical and cut the sequences into more sub-sequences by removing the identical EC numbers.
Line 7-27: Each pair of sub-sequences is initialized to begin a new round of 3-level hierarchical EC number matching until all pairs of sub-sequences are aligned. A similarity score is calculated afterwards.
Line 12-25: Applying the same rule again, find the similarities of rest unaligned sub-sub-sequences based on 2-level hierarchical EC number matching.
Line 17-23: Then sub-sub-sub-sequences on 1-level matching if any.
Properties
i
m and 1
j
n such that ei :::: e'j. The notation "::::" denotes that ei and ej' are compared to be identical in turn according to their 4-digit hierarchical patterns. A mapping is maximal if no other pair (l, k) exists such that el :::: e'k.
Obviously, each mapping site has two characteristics, site position (location) i and site feature name ei. The map E1 = e1e2...em is defined as a sequence of pairs ei = (ai, ri), where ai = the location of the i-th site in number of pathways and ri = the feature name at the i-th site. Similarly, E2 = e'1e'2...e'n is a map where e'j = (bj, sj). For further analysis, let the symbol
denote a natural order sequence of ai, and
denote the sequence of corresponding position bj. The number of E1, E2 maps is defined as
and
. Clearly,
=
m, n.
Example 7
The maximal mapping of two pathways E1= e3.6.1.5e2.7.4.14e3.1.3.5e2.7.1.48 and E2 = e3.6.1.8e3.1.3.5e2.7.1.48e2.7.4.14 is {(1,1),(2,4),(3,2),(4,3)}.
= 1, 2, 3, 4 and
= 1, 4, 2, 3. The mapping can be represented as line ![]()
over line
Suppose a mapping
(Figure 5), the following properties hold:
1.
=
= d,
2. ai1 < ai2 <...< aid-1 < aid
3. bj1 < bj2 <...< bjd-1 < bjd is not always true.
As emphasized above, locations of
is not necessary in natural order. We define a map alignment as a mapping where
is a sequence in natural order.
= ai1ai2...aid-1aid and
= bj1bj2...bjd-1bjd be two maps of M. Then we have:
in natural order with the longest length is the maximal map alignment of E1 and E2.
in natural order between bj1 and bjd is the index alignment.
Proof
1.) Suppose bjk0bjk1...bjkt is a sub-sequence of
in natural order with the longest length, we can obtain the responding positions of this sub-sequence: aik0aik1...aikt, so that they are a map alignment of E1 and E2. For contradiction, let us assume that there exists another mapping pair (a'ik, b'jk) excluded from the maximal map alignment, then bjk0bjk1...bjkt is not the longest subsequence, which is a contradiction.
2.) According to the definition of index alignment, the first step is to map from both ends of E1, bj1 and bjd are found. Next step is to map ai2 and aid-1, bj2 is one map of
only when bj2 is greater than bj1, and bjd-1 is one map only when bjd-1 is less than bjd and greater than bj2. Repeat the mapping till no more bjk is satisfied.
Example 8
The map of the two pathways in Example 6 is:

The maximal map alignment

The index alignment is

Proof
Given two pathways E1 = e3.6.1.5e2.7.4.14e3.1.3.5e2.7.1.48 and E2 = e3.6.1.8e3.1.3.5e2.7.1.48e2.7.4.14, then
2.7.4.14) + σ(3.6.1.5
3.6.1.8) = ¼*(1 + 0.75) = 0.43While the maximum alignment will be
The maximal matching is
Time complexity analysis
The best case to strip two pathways and to obtain their index pairs is that they are identical, which costs O(m) computing time. While the worst case is that they are not matchable, it needs O(mn) to cover all elements. In general, we define O(m
n) to show the "average" time complexity. Therefore, to create λ4*(E1, E2), it takes O(m
n). λ3*(E1', E2') takes
, λ2*(E1", E2") takes
and λ1*(E1"', E2"') takes
, where K denotes numbers of index alignment of (E1, E2), Rl denotes each stripped sub-pathway of (E1', E2'), Si denotes each stripped sub-sub-pathway of (E1", E2"), and jmil and jnil represent the length of each stripped sub-sub-sub-pathway (E1"', E2"').
Therefore, the total time complexity is
.
The best-case complexity of the algorithm is the minimum number of steps taken on any instance of size m and n. It represents the alignment of two identical pathways, which takes O(n).
The worst-case complexity of the algorithm is the maximum number of steps taken on any instance of size m and n. It represents the alignment of two pathways with no first 2-number is same, which takes at least O(3mn).
Multiple alignment
The multiple metabolic pathway alignment allows us to extract and represent biologically important but faintly / widely dispersed pathway similarities which, for instance, makes it possible to identify pathways preserved by evolution that play an important role in the cellular function and can give us hints about the evolutionary history of certain pathways.
By allowing the alignment of more than two metabolic pathways, the pairwise alignment algorithm can be extended. A multiple alignment of metabolic pathways E1, E2, ..., Ek, can be seen as a generalization of pairwise metabolic pathway alignment - instead of aligning two pathways, k pathways are aligned simultaneously, where k is any number greater than two. A heuristic algorithm is used to perform the multiple metabolic pathway alignment. The general idea of the method is to construct a succession of pairwise pathway alignments:
Step 1: Choose one pathway E1 from E = {E1, E2,...,Ek} and align with {E2,...,Ek} one after another to find the most similar pathway Ei (2
i
k).
Step 2: Choose the pathway Ei and align with E\{E1, Ei} to get the most similar pathway E'i.
Step 3: Iterate step 2 until all pathways are aligned.
The time complexity of multiple alignment of k pathways is
, where p
(1, k-1), the lengths of pathway p and p + 1 are pm and p+1n.
Implementation
The algorithm has been implemented in the PathAligner system [Chen and Hofestädt, 2004] (http://bibiserv.techfak.uni-bielefeld.de/pathaligner). Three web-based alignment interfaces are implemented: "E-E Alignment", "M-E-M Alignment" and "Multiple Alignment". "E-E Alignment" uses the basic algorithm to align two linear metabolic pathways (represented as EC number sequences). The user can also align any such a metabolic pathway against our pool database to find a list of hits. "M-E-M Alignment" considers the differences of metabolites in two pathways, which are presented as "Metabolite-EC number-Metabolite" patterns of sequence. It is possible to pick up two such pathways and align them to identify whether they are alternative pathways. "Multiple Alignment" allows the alignment of more than two metabolic pathways.
Identification and analysis of metabolic networks is a complex task due to the complexity of the metabolic system. Abstract pathway defined as a linear reaction sequence is practical for our alignment algorithm. We have presented an algorithm to study the problem of metabolic pathway alignment. Our algorithm calculates the hierarchical similarities of EC numbers mapping from both ends of the sequences. The entire processing for pairwise alignment takes time
, where m, n are the lengths of two (stripped) metabolic pathways. The algorithm described here has been successfully implemented into the PathAligner system.
Our algorithm calculates the similarity of reactions; it can be adopted to comparatively analyze other kinds of biopathways, such as signaling pathways. Although there is no nomenclature system for signal transductions at the moment, Chen et al., 2004, presented a classification of signal transduction events which can be used for this purpose. Since the ST number (each signal transduction is assigned a unique ST number) is also a 4-level hierarchical structure, it makes it possible to exploit our metabolic pathway alignment algorithm to perform signaling pathway alignment.
Currently a metabolic pathway is practically defined as a linear molecule sequence. However, when the topology of network is concerned, more information related to the components of the pathways like their length, size, number of feedback cycles, number of crosstalks between pathways and area reachable from any point in the network should be considered as much as possible. In this case, the metabolic pathway comparison will rather be the comparison of a network instead of a single linear pathway alignment. As a result, this leads to another type of pathway computation, which can be categorized as the comparison of biological networks. By comparing such types of networks to different biological systems, it is possible to identify similarities and variations among different species.
The work was supported by the German Research Foundation (DFG) graduate program "Bioinformatics" in the Uni-Bielefeld.
Proof of Lemma 3
The key to the last equality is to consider the binomial expansion (x + y)n. We have
Appendix 2
Proof of Lemma 5
With no loss of generality assume that the r matched point (i,j),(i',j'),...(k',l'),(k,l) (Figure 2), then
Since (x + y)m + n = (x + y)i + j (x + y)i' + j' - i - j ...(x + y)k + l - k' - l' (x + y)m + n - k - l, and
, we have
Compare the coefficient of xmyn, obtain
The least number is the case there is no more sub-path to align, so that