Fast Stochastic Simulation of Metabolic Networks

Markus Schwehm




Universität Tübingen
ZBIT - Zentrum für BioInformatik Tübingen
Sand 1,
72076 Tübingen, Germany
Email: schwehm@informatik.uni-tuebingen.de






INTRODUCTION

Many approaches exist for the simulation of biochemical cellular processes using deterministic and stochastic modelling approaches. Just recently the goal has been set for whole cell simulation. For example TOMITA et al. (1999) have announced E-Cell, a deterministic simulator based on ordinary differential equations. Stochastic simulators consume more computing power and are thus generally used for smaller models. Nonetheless ENDY et al. (2000) have managed to simulate a bacteriophage by stochastic simulation and GIBSON (2000) has improved the stochastic simulation algorithm with his Stochastirator such that whole cell simulation comes within reach.


MODEL COMPLEXITY

If we want to simulate models of whole cells, it is a good idea to estimate the complexity of whole cell models first. A lower limit for the number of different molecule species can be obtained from the number of genes in the genome. For Escherichia coli the total number of genes is currently expected around 4400 (see KARP et al. (2000)). Extrapolating an estimation by GOODSELL (1993), the cytoplasm of E. coli contains about 22.500 proteins, 15.000 ribosomes, 170.000 tRNA molecules, some mRNA molecules, 15 million small organic molecules (like amino acids, nucleotides, sugar, ATP, NAD and others), 25 million ions and about 70% of the volume contains water. The number of biochemical reactions executed during the cell cycle of E. coli has been estimated by ENDY and BRENT (2001) between 1014 and 1016. The complexity of other cell types is about a factor thousand smaller for the smallest cell types (mycoplasms) and about a factor of thousand larger for typical plant and animal cells.


STOCHASTIC ALGORITHM

A general stochastic simulation algorithm may be outlined as follows:
1: Initialisation, 2: Select reaction, 3: Execute reaction, 4: Update data, 5: Termination or repeat 1-3
Existing stochastic simulation algorithms differ basically in the way how the next reaction is selected.

The Direct Method for selecting a reaction was introduced by GILLESPIE (1976). For each reaction a probability is computed by multiplying the rate constant of each reaction with the concentration of its substrates. Then a random number is used to perform a roulette-wheel selection according to the relative probabilities of all reactions, and a second random number determines the execution time used for this reaction.The direct method uses two random numbers for each reaction selection. The computing complexity is linear in time and storage space with respect to the number of reaction types.

The First Reaction Method was also introduced by GILLESPIE (1976). For each reaction a tentative execution time is computed (using a random number for each reaction). Then the reaction with the smallest tentative execution time is selected. The First Reaction Method uses a random number for each reaction and during each iteration and the execution complexity is linear in time and space with respect to the number of reaction types. This method is not efficient due to the many random numbers needed.

The Next Reaction Method is an efficient alternative to the First Reaction Method and was introduced by GIBSON (2000). The tentative reaction times are computed for each reaction once during initialization. The reactions are stored in a priority queue with the smallest tentative execution time in root position. Then the first reaction is selected from the root position and executed. During update only the tentative execution time of the last executed reaction has to be updated using a single random number and the priority queue has to be updated accordingly. Finally all reactions that are affected by the substrate and product changes of the executed reaction can be updated without using a new random number. The next reaction method thus only needs one random number per iteration. If the priority queue is implemented by a heap, the update of a reaction can be done in logarithmic time. This reaction selection and update method thus outperforms the preceding methods when applied for large models with a large number of different biochemical reactions.

The Random Substrate Selection (as I call it) has been introduced by MORTON-FIRTH (1998) and follows a completely different approach. Each molecule in the system is treated individually (i.e. a mesoscopic modelling style is applied). For selecting a reaction, the substrates are selected randomly. Then a lookup in a reaction table determines whether a reaction is available for these reactions to be executed. A third random number determines if this reaction will actually be performed. Besides the update of the substrate and product quantities affected by the executed reaction there is no need for updating any data structure. The Random Substrate Selection uses three random numbers per iteration to determine whether a reaction occurs or not. If the number of molecule species rises, the probability that two randomly picked species can actually react with each other decreases so that an increasingly large number of iterations will be necessary to find the next executable reaction. The Random Substrate Selection is thus not suitable for large numbers of reaction types as they appear in whole cell models.


IMPLEMENTATION

We have implemented a sequential simulator engine in Java. The sequential Java simulator engine implements the stochastic algorithm as described in Section 3 using the Next Reaction Method of GIBSON (2000). A profiling of the algorithm has revealed that the simulator engine spends most of its execution time for maintaining the priority queue of the tentative reaction times.


PERFORMANCE EVALUATION

There is a lack of performance benchmarks for metabolic network simulations. While some simple models like a Simple Cascade, the Volterra Oscillator and the Michaelis Menten Model have intensively been studied on several simulators, a suite of large scale and scaleable model descriptions (preferably in SBML syntax, HUCKA et al. 2001) together with expected simulation results and corresponding measured performance results on different hardware/simulator combinations is missing. To begin with, we have implemented a scalable benchmark based on the simple cascade. This simple benchmark generator takes two arguments: the length of the cascade n and the number of molecules in the first molecule species. All other molecule species are initially set to zero. This benchmark suite is unrealistic in its use of a single type of reaction (substrate->product) only. On the other hand the simplicity of the simple n-cascade allows arbitrary scaling of the model with respect to the number of substance species/reaction types and with respect to the total quantity of molecules in the system.


Performance Results

The performance results for running a simple n-cascade benchmark on the sequential Java simulation engine are summarized in Table 1. The experiments were performed on a 500 Mhz Pentium II PC with 256 Mbyte RAM running Red Hat Linux 6.2 and Java JDK 1.3.0.

Table 1: Performance of the Java Sumulator Engine


Experiments No. 1 to 6 execute a constant number of one million reactions within a Cascade of increasing length (i.e. increasing number of different reactions) Simulation time grows slower than the cascade length as can be expected from the logarithmic complexity of the priority queue update for the next reaction method. Experiments No. 7 - 9 illustrate the performance of a model with a short (simple) cascade and a very large number of molecules. One can observe that the simulation time grows linear with the number of molecules. Experiments No. 10 - 13 are with a constant number of 100 million reactions to be executed and with increasing cascade length. Again the simulation time grows logarithmic. A cascade with one million reactions was not executable due to an out of memory error.

The last column of Table 1 lists the number of reactions executed per second. If we consider a model complexity similar to E.coli as discussed in Section 2, we can expect about 0.25 Million reactions to be executed per second on a Java simulation engine.


CONCLUSION

It is now time to ask the question how far we are away from whole cell simulation. Lets make a very optimistic estimation by ignoring the problem that whole cell models will not be around anytime soon. Lets ask the question - if we had an appropriate whole cell model - could we simulate it? Assume optimistic 1014 biochemical reactions per cell cycle of E. coli (ENDY & BRENT 2001). With the above performance results a stochastic whole cell simulation of a whole cell cycle would take about 4x1012 seconds or about 12 years on a single processor. Stochastic whole cell simulation is thus either the realm of massively parallel computing, or it needs new algorithms which can combine determinstic and stochastic simulation techniques. We will investigate both directions in our future research.


REFERENCES