In Silico Biology 7 S1, 07 (2007); ©2007, Bioinformation Systems e.V.  

Workshop "Storage and Annotation of Reaction Kinetics Data"
May 2007, Heidelberg, Germany


Good publication practice as a prerequisite for comparable enzyme data?


Carsten Kettner




Beilstein-Institut, Trakehner Str. 7 - 9
D-60487 Frankfurt/Main, Germany

Email: ckettner@beilstein-institut.de





Edited by I. Rojas and U. Wittig (guest editors); received and accepted March 21, 2007; published March 27, 2007



Abstract

Systems level investigation of genomic and proteomic scale information requires incomparably higher demands for data quality than in previous decades. Truly integrated databases that deal with heterogeneous data need to be developed to be able to retrieve properties of genes, for kinetics of enzymes, for behaviour of complex networks and for the analysis and modelling of complex biological processes. Despite the fast paced global efforts in biological systems research, the current analyses are limited by the lack of available systematic collections of comparable functional enzyme data. Besides its reliability, these data have to provide defined minimum experimental information, they must be available from the literature along with their accepted enzyme names, and must be as comprehensive as possible. However, the reality reveals a different picture: the quality of experimental data of enzymes is insufficient for the needs of systems level investigations. A 2003 founded working group, called STRENDA, recently published suggestions which intend both to improve the quality of reporting functional enzyme data and to support the comparability of inter alia enzyme kinetics for their application in the in silico investigation of biological systems.

Keywords: enzymology, systems biology, functional enzyme data, modelling, simulation, STRENDA, comparability of data, minimum experimental information



The dilemma of modern enzymology

The post-genomic era is significantly characterized by a high integration and interdisciplinary nature of research from such diverse fields as mathematics, computational biology, bioinformatics, functional genomics and proteomics, and structural biology. Furthermore, the enormous growth in the computation speed and data storage capability has fuelled new opportunities for both the accumulation of massive amounts of sequence, expression and functional data, and the characterization, analysis and comparison of larger biological systems.

Continuous technical and methodological advances and improvements have enabled biochemical pathway analyses with increased depth, efficiency and accuracy. This led to an increased flow of information so large that the use of the scientist's intuition alone to draw conclusions is insufficient.

The alliance of experimental and theoretical disciplines also caused the foundation of a new branch within the life sciences, called systems biology. Albeit the contemporary catchphrase systems biology is not a new concept since Ludwig von Bertalanffy applied general systems theory not only to biology but to psychology, economics, and social science as well [1], nowadays, this discipline combines experimental and theoretical biology to investigate basically metabolic networks, the regulation of developing and developed cells, cell and tissue specification and further highly complex cellular processes [2]. One step onward, responses of a given biological system and molecular interactions within this system after external disturbances are aimed at to be understood. One possible vision of systems biologists could be not only to depict the cellular metabolic pathways, such as those in the well-known Boehringer poster, but to do this in three dimensions with a higher level of information than for example the KEGG pathway map (http://www.genome.ad.jp/kegg/pathway.html). The application of these digitised maps may be found in the understanding and simulation of the treatment of diseases such as diabetes. The results from systems biological research could be used for both the development of new "intelligent" drugs [3] and for industrial applications such as the directed modification of microbial strains to improve their metabolic performance, the so-called white biotechnology [4].

The means of systems level investigation is a series of overlapping mathematical models to reconstruct the cellular metabolism after theoretical and experimental data have been analysed by computational, mathematical and engineering methods. The aim is to create predictive models of single biochemical processes, cells, tissues or even entire organs [5]. Furthermore, a model that works well will be useful for designing initial or further experiments that will verify or refute both working hypothesis of physiological functions of certain modelled systems and the predictions previously carried out. This is part of the classical research cycle in modern biology as illustrated in Fig. 1. The entry into this cycle depends on the discipline where the researcher comes from; either from theoretical (data- or hypothesis-driven) modelling or from experimental design related to a specific scientific problem. Eventually, experimental data are intended to be reconstructed by models which are determined by a computable set of assumptions and hypotheses, which in turn must be tested or confirmed experimentally. "Dry" experiments, such as simulations, prove the validity of assumptions and hypotheses. Inconsistencies between theoretical modelled data and established experimental facts are referred to inadequate models and therefore, these models will be rejected or modified. Those models that pass this test undergo system analysis resulting in a number of predictions which themselves suggest further "wet" experimental tests with established or modified experimental designs. Data from successful experiments either verify or show the inadequacies of the computational models and enter into the pool of biological knowledge that is source for further hypotheses. By contrast, analysed data from "wet" experiments pass also into this pool of biological knowledge and can be subjected to modelling and simulation. These dry experiments can range from simulations of the activity of single enzymes to the modelling of biological networks considering various aspects of regulation of enzymes and metabolic fluxes.


Figure 1: Hypothesis-driven research cycle in modern biology which is riddled by an open working site indicating a gap between experimental and successful theoretical research (modified according to [6]).


The combined power of new experimental as well as of computational power resulted in an explosion in the amount of information unparalleled in the history of biology.

Ironically, despite of the fast paced global efforts in biological systems research, comprehensive analyses are complicated by the lack of available systematic collections of comparable functional enzyme data. Functional data of enzymes include measurements of their behaviour in terms of catalysis, interaction and transport which all are described by parameters such as Kd, Km, Ki, rate constants etc. that are dependent upon pH, temperature, ionic strength, inhibiting and activating compounds, substrate specificity, etc.. These are usually numerical data which are required to describe, for example, the kinetics of a given enzyme, and subsequently, of entire pathways. If a number of enzymes in a given pathway is investigated under (at least) comparable experimental conditions, this data will also be comparable and will be suitable to feed further steps of analysis, modelling and experiments. However, researchers encounter increasing numbers of collections of data on enzyme characterizations, all of which should be used cautiously and leading to the incredible situation that sound numerical analysis of pathways will not be possible: the circle is broken or at least interrupted by an open working site which indicates a growing gap between experimental and theoretical research. Consequently, researchers require comprehensive, comparable, valid and reliable experimental enzyme data.

Whilst international genome sequencing projects spent much efforts and money in the creation of huge sequence databases and elicited considerable public attention and whilst protein structural information has also been rapidly accumulated in such databases, it has become increasingly apparent that little effort has been invested toward systematic characterization of enzyme functions. Two major reasons can easily explain this situation: (1) deriving data from experimental work for databases is expensive and very time consuming, and (2) it is inherently very difficult to collect, interpret and standardize published data since they are widely distributed among journals which cover a number of fields, and the data itself is often dependent on the experimental conditions. Additionally, research activities considering functional enzyme characteristics currently appear to be underrepresented. Therefore, the availability of comprehensive functional enzyme data is either limited or even not existent even though there are few projects which are concerned with the collection of functional and kinetic enzyme data: The BRENDA database (www.brenda.uni-koeln.de) for enzyme functions and properties, SABIO-RK (http://sabio.villa-bosch.de/SABIORK/) [7] for biochemical reactions within metabolic pathways along with their kinetic equations, KEGG (http://www.genome.ad.jp/kegg/pathway.html), BioCyc (http://www.biocyc.org) [8], and BioCarta (http://www.biocarta.com/genes/index.asp) for the representation of metabolic pathways.

The common property of in particular the enzyme data collections is that these functional data are not comparable due to the fact that values and information from functional characterizations are usually determined by individual laboratory-specific applications and implementations of the experimental designs which imply possibly misinterpretations of laboratory findings. In silico analysis and representations of metabolic systems are certainly impossible under these circumstances [9].



A brief insight into the enzyme data minefield

To proof the current disappointing situation with functional enzyme data which are not comparable and whose quality can be hardly appraised, we undertook a brief but certainly not representative examination of some enzymological and methodological data for the key enzymes involved in glycolysis, i. e. glucokinase (EC 2.7.1.1), 6-phosphofructokinase (EC 2.7.1.11), and pyruvate kinase (EC 2.7.1.40) from baker's yeast Saccharomyces cerevisiae and Escherichia coli as well as from Bacillus stearothermophilus and the slime fungus Dictyostelium discoideum [10]. The BRENDA database was chosen as the preferred data repository for the survey of these enzymes.

The glycolysis pathway was selected because it is almost certainly one of the best understood metabolic pathways and it was believed that for this pathway the best information about enzymes involved with respect to their functional characteristics in most organisms would be available. The main criteria for the functional description of these enzymes were data on turnover kinetics, information about activating and/or inhibiting compounds and molecules such as cofactors, allosterically acting compounds as well as ions. We also were interested in temperature and pH profiles. In detail, data for the description of the following characterizing parameters have been investigated: Km, substrates, products, activating compounds, inhibitors, molecular weight, co-enzymes, specific activity, temperature optimum, temperature stability, pH optimum, and pH range.

Finally, the experimental conditions were extracted from the original literature as referenced in BRENDA. We expected to see whether the functional data is comparable and suitable for modelling and simulation and whether the descriptions of the experimental conditions would allow comparison of the functional data.

However, reality looks less encouraging.

The result of our survey is that the functional enzyme data are fragmentary and that for some enzymes there is no functional information at all [10]. This is clearly not the fault of the BRENDA database but arises from the inadequacy of the data in the literature. Even our initial assumption that for the glycolysis would be a huge amount of data available was wrong. The study was commenced with queries for yeast and E. coli. But since the retrieval results were zero for glucokinase and low for pyruvate kinase, it was necessary to expand the investigations to Bacillus and Dictyostelium (Tab. 1).


Table 1: Functional enzyme data of the key enzymes of glycolysis from 4 different organisms.
Glucokinase 6-Phosphofructokinase Pyruvate kinase
S. cerevisiae 0 10 (77%) 7 (54%)
E. coli 0 9 (69%) 6 (46%)
B. stearothermophilus 11 (84%) 10 (77%) 13 (100%)
D. discoideum 9 (69%) 0 0
Only a fraction out of the 13 chosen parameters for the functional characterization of the key enzymes of glycolysis have been completely investigated (B. stearothermophilus, 100%). No data (0), data available (number of determined parameters), to a certain extent (%).

In particular, the results for S. cerevisiae and E. coli were surprising since these two organisms are well studied by both biochemical and molecular biological methods since a long time. Yeast plays an important economic role, its genome was completely sequenced ten years ago, and since then functional and structural proteomics have made great strides [11, 12]. E. coli is the main "workhorse" of molecular biologists who use this organism in expression studies and as the main transformation vector [13]. Thus, comprehensive data on their fundamental metabolic pathway was expected. The best-investigated enzyme of both organisms seems to be 6-phosphofructokinase as is reflected by several publications. In contrast, there are many missing data for pyruvate kinase for both organisms: for example, data on the inhibitors, temperature and pH range are missing for E. coli as well as for S. cerevisiae. But there are not kinetic data, such as Km value and specific activity, for E. coli. Interestingly, the best studied organism in consideration of glucose degradation seems to be B. stearothermophilus, which is a prominent member of extremophile organisms. All criteria for the functional description of the glycolytic key enzymes seem to be fulfilled. Finally, the slime fungus D. discoideum provided the least amount of data; there are only a few functional data available for the first step of the glucose degradation carried out by the enzyme glucokinase.

The investigation of the material and methods sections of the appropriate publications revealed fundamental differences in the application of obviously commonly used methods. At a first glance, the functional data of each enzyme of all the organisms have been obtained by comparable methods: the coupled optical test and/or the pH stat assay. However, the decisive differences within the applied methods are the basic experimental conditions. Measurements were performed under different temperatures, different wavelengths to record NADH oxidation (which might be less critical) were used, and, finally, the composition of the assay buffers range from simple compositions (e. g., for pyruvate kinase of E. coli) to rather "complicated" compositions (e. g., for pyruvate kinase of S. cerevisiae) with respect to the number and types of compounds which makes it hard to ascribe possible side effects to single components within the assay.

In conclusion, the result of our short study was surprising and alarming to us. What we observed was both incomplete descriptions of material and methods in the papers and difficulties considering method-dependent results. These issues have been recognized most clearly in the measurement of enzymes in terms of the catalysed rate of the reactions which is strongly dependent on the experimental method used. In particular, pH, temperature, assay buffers and substrate availability affect the kinetic behaviour of enzymes. The key enzymes of the glycolysis are not unknown species within a metabolic pathway. Quite the contrary, all these enzymes are well investigated from the view of their structures and sequences. Databases such as Swiss-Prot, KEGG, PDB and PIR provide a comprehensive collection of information on protein identification, subunit composition, stoichiometry as well as information about isolation and storage of purified proteins. On the basis of such poor functional data availability, fragmented descriptions of methods and broad differences in the application of commonly used methods it is hard to imagine that subsequent metabolic simulation and modelling could be carried out successfully. However, this problem affects not only researchers studying metabolism by in silico means but also experimentalists face the problem of the range of method-specific enzyme data ranges which are associated with individual methods and often poor descriptions found in databases or in the scientific literature.



Good publication practice as a first step out of the dilemma

Under these circumstances, it appears to be obvious that these problems must be solved because as long as the data quality of the input and the resulting modelling data cannot be improved, the chances of success for systems biology to escape from the verbally overused -omics-sciences is poor. About five years ago the Beilstein-Institut which has its tradition in the systematizing of data for chemists became concerned about the inconsistent way in which functional enzyme data are reported. Thus, it decided both to organize a workshop series to moderate a broad discussion about the necessity of the standardization of experimental designs and data reporting and to support the inaugurated STRENDA initiative. STRENDA stands for Standards for Reporting Enzyme data and aims at both defining guidelines for a sufficient practice of scientific publication and generating a comprehensive data acquisition systems. This system allows authors to submit electronically their experimental data to public databases such as BRENDA prior to publication [14]. Due to the fact that extraction of kinetic data from literature is necessarily carried out manually from the literature and since this process is expensive and time-consuming systems biologists recognized gaps in the available parameters and thus developed "high-throughput" extraction methods such as text mining to complete enzyme data sets. The alternative way is to establish a deposition system to which authors can submit their data to ensure maximal accuracy and accessibility and to replace possibly the traditional retrospective process of manual data extraction.

As a long-term vision, STRENDA aims at establishing experimental standard conditions to ensure that reliable, validated and comparable enzyme data will be generated. It would be a great asset to standardize experimental conditions which mimic the conditions that the enzyme experiences in it natural habit where it is active. Furthermore, it will be necessary to adopt these standards for a set of model organisms to define the required parameters or even assay conditions. This could be a great goal but requires the input and support of the entire scientific community since a series of potential problems have to be addressed.

The STRENDA commission decided to start its standardization work some steps earlier in regards to its long-time vision; due to the incomplete descriptions of the experimental materials and methods as well as of the results the commission developed guidelines for the reporting of data to publications. These guidelines are intended to pave the way to Good Publication Practice to ensure data quality and data identification to reach comparability of enzyme data. There are, actually, two lists of suggested recommendations. Both lists are published on the STRENDA web site (www.strenda.org/documents.html) and they are open for comments and advice from the scientific community.

The first list determines those parameters which are required for the material and methods sections of publications since any enzyme data should be connected to the description of the experimental conditions to allow the estimation of data themselves and of their quality. Thus, the following details are recommended to be given:

The second list is concerned with the kind of reporting data. It aims at describing the experimental results comprehensively together with adequately applied statistics to allow a quality check on the data and comparison their values to others. This list comprises



Outlook

It is hoped that the recommendations suggested by STRENDA will be adopted by the scientific journals since strictly comparable and reliable experimental and computational data of high quality are much required for

However, it should be emphasized that these recommendations on Good Publication Practice are not mandatory for the scientific community but the community is animated to keep these guidelines in mind as one possibility to create comparable enzyme data sets. The guidelines are also recommended to be of interest of the publishers to ensure high data quality and to keep the scientific impact of their journals.




References


  1. von Bertalanffy, L. (1969). General system theory, foundations, development applications. George Braziller, New York.

  2. Ideker, T., Galitski, T. and Hood, L. (2001). A new approach to decoding life: systems biology. Annu. Rev. Genomics Hum. Genet. 2, 343-372.

  3. Werner, E. (2002). Systems biology: the new darling of drug discovery? Drug Discov. Today 7, 947-949.

  4. Lee, S. Y., Lee, D.-Y. and Kim, T. Y. (2005). Systems biotechnology for strain improvement. Trends Biotechnol. 23, 349-358.

  5. Noble, D. (2002). Modeling the heart - from genes to cells to the whole organ. Science 295, 1678-1682.

  6. Kitano, H. (2002). Systems biology: a brief overview. Science 295, 1662-1664.

  7. Wittig, U., Golebiewski, M., Kania, R., Krebs, O., Mir, S., Weidemann, A., Anstein, S., Saric, J. and Rojas, I. (2006). SABIO-RK: Integration and Curation of Reaction Kinetics Data. In: Proceedings of the 3rd International workshop on Data Integration in the Life Sciences 2006 (DILS'06). Hinxton, UK. Lecture Notes in Computer Science 4075, 94-103.

  8. Karp, P. D., Ouzounis, C. A., Moore-Kochlacs, C., Goldovsky, L., Kaipa, P., Ahrén, D., Tsoka, S., Darzentas, N., Kunin, V. and López-Bigas, N. (2005). Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res. 33, 6083-6089.

  9. Stelling, J., Klamt, S., Bettenbrock, K., Schuster, S. and Gilles, E. D., (2002). Metabolic network structure determines key aspects of functionality and regulation. Nature 420, 190-193.

  10. Kettner, C. and Hicks, M. G. (2005). The dilemma of modern functional enzymology. Current Enzyme Inhibition 1, 171-181.

  11. Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J. D., Jacq, C., Johnston, M., Louis, E. J., Mewes, H. W., Murakami, Y., Philippsen, P., Tettelin, H. and Oliver, S. G. (1996). Life with 6000 genes. Science 274, 563-567.

  12. Fisk, D. G., Ball, C. A., Dolinski, K., Engel, S. R., Hong, E. L., Issel-Tarver, L., Schwartz, K., Sethuraman, A., Botstein, D. and Cherry, J. M.; The Saccharomyces Genome Database Project. (2006). Saccharomyces cerevisiae S288C genome annotation: a working hypothesis. Yeast 23, 857-865.

  13. Blattner, F. R., Plunkett, G. III, Bloch, C. A., Perna, N. T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J. D., Rode, C. K., Mayhew, G. F., Gregor, J., Davis, N. W., Kirkpatrick, H. A., Goeden, M. A., Rose, D. J., Mau, B. and Shao, Y. (1997). The complete genome sequence of Escherichia coli K-12. Science 277, 1453-1474.

  14. Apweiler, R., Cornish-Bowden, A., Hofmeyr, J.-H. S., Kettner, C., Leyh, T. S., Schomburg, D. and Tipton, K. (2005). The importance of uniformity in reporting protein-function data. Trends Biochem. Sci. 30, 11-12.