|In Silico Biology 2, 0007 (2002); ©2002, Bioinformation Systems e.V.|
|Dagstuhl Seminar "Functional Genomics"|
European Media Laboratory,
Edited by E. Wingender; received October 1, 2001; revised January 17, 2002, accepted January 18, 2002; published February 14, 2002
To provide support for the analysis of biochemical pathways a database system based on a model that represents the characteristics of the domain is needed. This domain has proven to be difficult to model by using conventional data modelling techniques. We are building an ontology for biochemical pathways, which acts as the basis for the generation of a database on the same domain, allowing the definition of complex queries and complex data representation. The ontology is used as a modelling and analysis tool which allows the expression of complex semantics based on a first-order logic representation language. The induction capabilities of the system can help the scientist in formulating and testing research hypotheses that are difficult to express with the standard relational database mechanisms. An ontology representing the shared formalisation of the knowledge in a scientific domain can also be used as data integration tool clarifying the mapping of concepts to the developers of different databases. In this paper we describe the general structure of our system, concentrating on the ontology-based database as the key component of the system.
Keywords: ontology; database; biochemical pathways; Mycoplasma pneumoniae
The biology and biochemistry research communities have long recognised the need for the creation of database systems to support their research activities. This recognition is motivated by the ever-growing amount of biochemical data, generated by the use of new high-throughput experimental technologies and methods. Already, the analysis of biological data using traditional database systems is leading to the discovery of new relationships among concepts, the definition of new experiments, the creation of new methods and the isolation and definition of new concepts, however there is still a lot to do to improve data integration, representation and understanding. Researchers need a sophisticated access to this wealth of biochemical and biological data, a demand that traditional data management techniques struggle to meet. A biochemical database system needs to support research and scientific activities, helping in the formulation of hypotheses which are subsequently to be corroborated or falsified by experiments.
Biochemical pathways are subject to intense experimental research due to their role in cell metabolism. Thus, when designing a system to support their modelling and analysis it is necessary to consider the rapid evolution of concepts and relations being represented in the database model.
Data associated with biochemical pathways is stored in many data repositories, with different access mechanisms, formats and structures. In order to make sense of these data it is necessary to integrate them in a comprehensive manner. This is a hard and cumbersome process, which in most cases is carried out by hand. The development of automatic parsers for the different data sources is something that has and is being addressed by many groups, e. g. [3, 5, 11]. These parsers provide an important support, but in many cases the actual data integration and (sometimes) data curation has to be carried out by biologists or biochemists, i.e. people with knowledge in the field who can infer the semantics of the data.
Data Alive is a project that focuses on the development of an integrated database system for the computational analysis and visualisation of biochemical pathways. It is defined within the frame of the ELSA project (http://projects.eml.org/elsa). The information included in the system relates to genes and genomes, proteins, enzymes and pathways, as well as experimental data related to these concepts, e. g. gene expression data (e. g. transcription analysis), data derived from protein analysis (2D-gel electrophoresis, mass spectrometry etc.) and detected activities of enzymes catalysing biochemical reactions. The initial prototype aims at building an integrated database system with information related to Mycoplasma pneumoniae. However, the system has been designed in such a way that it can be used to store, handle and compare data from many organisms. Similar projects to ours are PathDB  (focused on yeast data) and Metacyc  (focused on E. coli).
As the basis of the system we have built a domain-specific ontology, which in turn is based on a core ontology that includes basic (general) concepts related to events, time, space and complex relationships. The domain ontology includes the concepts and relationships characteristic of biochemical pathways and the related concepts. This conceptualisation of the domain is done independently of the data available and of the structure of these data. This semantic specification facilitates the handling and interpretation of the data, as will be explained later.
The ontology is used to automatically generate a deductive database system, together with the associated application programming interface (API) to access the database (Both the development of the ontology and the generation of the database and API are done using the ontology manager system (OMS) from Ontology Works Inc.). Apart from the creation of a deductive database we have done a parallel implementation of the model in a relational database system, in order to facilitate the process of database population and data curation.
The ontology also serves as the basis for the process of data integration and for the automatic extraction of information about biochemical pathways from free text sources.
In this paper we present a general description of the Data Alive system. We will first present a general overview of the system, highlighting the main features of the systems' components. We then describe the main characteristics of the ontology-based database model that has been built. In section "User interfaces" we present an overview of the different query and interface mechanisms that have been implemented. The main issues related to the population of the database are discussed in section "Database population". Finally we conclude our work with some remarks about the future directions of the project.
In Figure 1 we present the overall structure of the system. The core of the system is a database consolidating the data extracted from different data sources, such as KEGG  and SWISS-PROT  and containing references to complementary information in other databases. The data used to populate the database is obtained from databases available on the internet, from experimental results provided by project partners and extracted from literature sources. We will elaborate on the population of the database in the section "Database population".
|Figure 1: General structure of the Data Alive system to support the computational analysis and visualisation of biochemical pathways.|
The database has been implemented in a deductive database system and in parallel in a relational database system, namely Oracle. We took this approach for various reasons:
biochemPathway(biochempath_1). reactionName(biochempath_1,'Glycolysis'). chemicalReaction(chemicalReaction_1). subReaction(biochempath_1,chemicalReaction_1). reactionReactant(chemicalReaction_1,chemical_25,1). reactionReactant(chemicalReaction_1,chemical_34,1). reactionProduct(chemicalReaction_1,chemical_26,1). reactionProduct(chemicalReaction_1,chemical_35,1).
Our final goal is to combine the data stored in the relational database with the induction, deduction and reasoning capabilities offered by deductive database systems.
Pathway modelling and analysis programs can interact directly with the database or access data by means of an application programming interface or of the user interfaces provided. Both retrieval and input of data is supported. We plan to incorporate an XML-based platform to facilitate the exchange of data between the system and other applications and, at the same time, to describe the semantics of the data that is extracted or that needs to be stored.
As mentioned in the introduction, from the ontology we automatically generate a Java-based application programming interface (API). This API works together with the deductive database. In parallel we have implemented the database model in a relational database system and are adjusting the API to be able to work with both the deductive and the relational database. Figure 2 shows the general idea behind the generation of the deductive database and its corresponding API.
|Figure 2: General structure of the ontology-based system.|
From the end users point of view, the system offers a diversity of methods for querying and visualising the data. The first of these is based on predefined queries, corresponding to the most common data requests. This is a very common mechanism used in biochemical database pathway systems (e. g. KEGG , SRS , SWISS-PROT ) given that the number and types of operations that users tend to require is in general very limited. However, the main objective of this system is to offer the possibility of handling the data in non-conventional ways, in the hope of finding new relationships which could then lead to new methods or even concepts, i.e. support the process of scientific discovery. Therefore, we also offer very flexible query mechanisms by which the user can build "original" but meaningful queries, limited by the model representing the domain. In the section User interfaces we will eleborate more on the user interfaces offered by the system.
The lack of precision, the presence of inconsistencies and the evolution of concepts in the area of biochemical pathways are some of the problems that we face when defining a database model for this domain. These problems are complemented with the presence of exceptions and multiple and poorly defined classifications. Moreover, sparsity of the data and inconsistency between different data sources complicate matters even further when considering implementation issues.
The use of an ontology as a model for a given domain (biochemical pathways in this case) allows the definition of complex relationships amongst objects. These definitions are cumbersome or even in some cases not possible under conventional (relational, object-oriented or object-relational) database modelling approaches. In our case, we define an ontology as being a formal description of a set of concepts and relationships in a domain of interest, in such a way that it is rigourous enough to support reasoning.
The definition of an ontology, by using a language that is very precise and formally defined, compels both the biologists and the computer scientists to write precise and formal definitions of the entities and relationships of the domain being modelled. Thus, ontology modelling of the biochemical domain contributes both to the field of knowledge representation and can, with the introduction of new concepts, contribute to the development of the theory of biology.
Ontologies have become very popular in the bioinformatics area, mainly motivated by the need to integrate and reason about the vast amount of data available [7, 12]. They are used (amongst other uses) to describe controlled vocabularies (e. g. the Gene Ontology  facilitating knowledge exchange, to describe genes and metabolisms (e. g. Ecocyc ), or to provide a transparent access to multiple databases (e. g. TAMBIS ).
Although it could be said that every database designer has an ontology in his or her mind, most of it does not make it to the database model. For example, when talking about a reaction we refer to it not only in terms of its components (substrates and products), but also as an event in time and space, something that takes some time to occur and that takes place at a certain location. This subtlety of a reaction seen as spatio-temporal event is normally not reflected in the data model, where we often limit the modelling to the representation of concepts and relations of direct interest. Other examples are the concept of "substance" that has a whole chemical background behind it that is often not expressed explicitly in the model, or the concept of pathway or reaction chains that can be viewed as graphs as we mentioned already.
In addition to the concepts and relations strictly related to the domain, we have a set of concepts and relationships that form the basis upon which the domain-related concepts are built. For example, in the case of a biochemical pathway, we take a hyper-graph (graphs where two or more edges can be combined to form hyperedges with multiple origins and ends) as the underlying structure. In this way we can benefit from the definition and algorithms related to graph theory. Other underlying concepts are the mathematical properties of relationships, of numbers, of logic, set theory, formal partonomy, temporal concepts, as well as space and time ontologies.
To be able to reason about data, it is necessary to "understand" it, to be able to deduce its semantic. It is then necessary to represent the characteristics of the biochemical objects and processes as closely as possible. For example, to represent an enzymatic reaction (as an event that can occur at a certain place and time), one also needs to represent the fact that the catalyst of such a reaction is an enzyme and that an enzyme in turn is a protein; a protein is "obtained" from a gene, which is contained in the genome of a given organism. This requires that, apart from representing the concepts and relationships amongst these, we include explicit rules that define the objects, processes and their relationships in a clear and unambiguous manner.
We use logic statements in form of Horn clauses to represent: hierarchies of concepts and of predicates, specification of database facts, rules (including potentially recursive definitions and event-condition action rules), integrity constraints and even queries. Thus, we have a unique representation method for all the structures of the model. For example, we can express rules such as:
( => (and (enzymeKindPred ?P ?E1 ?0) (reactECClassificationPred ?R ?E2 ?O) (or (ecSuperClass ?E1 ?E2) (ecSuperClass ?E2 ?E1)) ) (isPotentialEnzymeIn ?P ?R ?O) )
meaning that a protein is a potential enzyme of a reaction if the EC-classification of the reaction is a subset of the EC-classification of the protein or vice-versa. Or we can formulate a hypothesis:
( => (and (isPotentialEnzymeIn ?P1 ?R ?O1) (reactionIn ?R ?O2) (orgFamily ?F ?O1) (orgFamily ?F ?O2) ) (and (homolog ?P1 ?P2) (isPotentialEnzymeIn ?P2 ?R ?O2) ) )
to check if it is true that if P1 is a protein that is a potential enzyme of a reaction R in an organism O1 and we know that this reaction also takes place in an organism O2, then there must exist a homologous protein P2 in O2, such that it is a potential enzyme of the reaction R in O2.
Apart from constituting a semantic specification of the domain and acting as a data model for a database system, the ontology-based system offers a series of advantages in terms of data handling, analysis and inference. For example, we can group all the relationships associated with assigning an EC-classification (one for each level) by defining a type of predicate EC-class and by making all the predicates being an instance of this class (notice that we talk about instances and not sub-types). We can then refer to all predicates that are EC-class predicates. The inference and analysis properties are supported by the deductive engine on which the ontology is implemented.
We also needed to introduce new concepts to describe and characterize sets of objects. This was the case, for example, for the term "genome fragment", which is any section of a genome which has been sequenced. In general, when we think of a genome fragment we would think of a chromosome (at least for eukaryotes). However, in reality we can refer to even smaller sections of the genome, just a sector that has been sequenced. We found that the creation of new concepts should be given more than a superficial attention. They have to have an ontological meaning and not to be something abstract or non-observable that is added as a convenient structure for storing data concepts. They need to have a serious epistemological status, because these concepts should refer (as in our simple case of genome fragment) to an entity that "really exists".
By having tools to turn ontologies into databases, it is then possible to use the ontology as a modelling tool for a databases, including the "implicit" ontology that modellers have brought into the model. Ontologies are hard to build, due to their level of generality, but it is precisely this what makes them useful. By using the Ontology Works system we profit from the clear and rigouros semantics of the extended "Semantics for the Knowledge Interchange Language" (SKIF)  (used in the system) and from the core ontology of higher level concepts. This core ontology allows us to express and make explicit all the assumptions constituting and implicit in the modelled reality. Our domain-specific ontology is then defined on top of this core ontology.
The domain model
The domain ontology that has been created is composed of a set of sub-ontologies each associated with a certain "main" concept or group of concepts (see Figure 3). The first of these sub-ontologies sets is associated with the concepts of biochemical pathways. A biochemical pathway can be seen as the highest expression of a biochemical reaction, having some characteristics which are commonly not associated with biochemical reactions, such as its classification with respect to its cellular function (catabolic, anabolic, etc.) and its classification with respect to its metabolic function (carbohydrate, amino acid metabolism, etc.). A biochemical pathway can be composed of one or several biochemical reactions, which constitute the core of the second sub-ontology. A biochemical reaction is defined by its participants (substrates and product), its regulators (if any, catalysts, activators and inhibitors) and co-factors, all of which are compounds. The same biochemical reaction can participate in one or several biochemical pathways and which compounds are considered as the main compounds and which as co-factors may vary from one pathway to another. The third sub-ontology relates to compounds. Compounds are proteins/peptides, genes, chains of simple-molecules, simple molecules, atomic ions and complex compounds. The two remaining sub-domains are those of the genome and of the organism.
|Figure 3: General structure ontology for biochemical pathways.|
To support the analysis of the dynamics of biochemical pathways and their components, we have included and are still working on the inclusion of concepts and relationships related to the kinetics of reactions, structure of proteins and protein-protein interactions.
Experimental data is also of great importance for the process of analysis and understanding of biochemical pathways. We are working on the development of an ontology for experimental data. The difficulty in this model is the lack of standards and the diversity in the definition of experiments. For this reason we have initially focused on four types of experiments, corresponding to those being carried out by our experimental partners, namely: 2D-electrophoresis, mass spectrometry, simulations and enzyme activity measurements. We have created a basic (general) concept of an experiment composed of multiple phases, each with initial conditions, parameters and results. The results in turn can be associated with sampling instances and to particular compounds or groups of compounds. This general conceptual frame has been done in order to be able to incorporate other types of experiments without having to undergo substantial changes in the model.
Scientists need to know about the origin of the data being used in their analysis or deduction processes. For example, experiments being carried out by different laboratories could yield conflicting results, thus it is very important to know the origin and characteristics of these experiments in order to be able to resolve such conflicts. Therefore, we have incorporated a small ontology referring to the meta-data, containing concepts related to the origin of the data stored (e. g., references and links to the original sources).
One of the main objectives of the system being developed is to provide the mechanisms for the user to query the data in different ways. Therefore, we are working on the development of several query-interfaces both text- and visually based.
The system offers a navigational (window-based) interface, with which the user can go from one window displaying information about a particular object, for example a biochemical reaction, to another window with information about a related object, for example information about the enzyme that catalyses the reaction. This user interface is closely related to the conceptual model on which the database is based. That is, the relevant information about a particular object may differ according to the conceptual point of view. For example an enzyme is defined as a protein that catalyses a reaction, therefore one can view an enzyme as a catalyst, providing its catalytic properties, or as a protein.
|Figure 4: Example of a query using the domain-specific query language for biochemical pathways.|
Additionally, the system offers a "query by example"-type interface and a domain-specific query language interface. The first allows the user to query for the set of objects that satisfy a series of conditions describing the characteristics of the particular type of object. For example by providing information about the compounds that should participate or not in a reaction one can obtain the set of reactions that satisfy these specifications. The domain-specific query language is a high-level query language based on the definition of the objects and their relations. For example (as shown in Figure 4) to ask for all the reactions that have 'ATP' as a reactant one can write:
Reaction hasReactant (Compound hasName 'ATP)
Queries are forced to be well-formulated following a defined grammar.
Based on the old saying "a picture says more than a thousand words", a main part of the Data Alive project is the development of visualization techniques for the representation and understanding of biochemical pathways. We have developed the first prototype of PathVis , a tool that dynamically displays pathways (or connected sets of reactions) according to the information supplied and requested by the user. This information can be the result of a query to the database, i.e. PathVis acts both as a viewer for biochemical pathways and as a database browser.
A database containing data about biochemical networks has a vast amount of different kinds of information about the participants of the biochemical reactions. On one hand a visualisation tool for biochemical pathways should be able to generate a clear drawing of the pathways and on the other hand the application should also be able to display all relevant information related to the displayed pathway elements. One of the challenges in the development of a biochemical pathway visualisation tool is to deal with the conflict between the amount of information to be displayed and the demand for a clear drawing. We used a combination of two strategies, the first to offer the user the possibility of selecting the information to be displayed in the labels and the second to use temporarily displayed labels. This last feature is implemented in the form of the so-called "ToolTipInfo", which shows up if the mouse pointer is positioned over a pathway element (chemical compound, catalyst, chemical reaction) to display additional information. The main advantage of the "ToolTipInfo" is the possibility to display space consuming information such as the stoichiometric equation without confusing the pathway drawing (see Figure 5).
|Figure 5: The PathVis pathway visualisation tool.|
We are also developing methods for the visualisation of experimental data. These data include results from the analysis of gene expression in Mycoplasma pneumoniae and from the organism-specific analysis of proteins using 2D-gel electrophoresis and mass spectrometry.
Apart from the problems concerning the modelling of the domain we must consider that in order to integrate data originating from different (heterogeneous) sources we have to cope with the fact that each has a different scope, representation, level of completeness, functionality and accuracy.
At the moment the database which is created based on the ontology is fed with curated data from the database implemented in Oracle. In the near future we hope to be able to support the curation process by using the constraints and rules defined in the ontology as "partial-curators" of the data. We believe, however, that fully automated curation is still a long way down the road.
The database is populated by using data stored in publicly available databases and literature, as well as with data from our collaboration partners. So far we have mainly limited our work to data associated with Mycoplasma pneumoniae, but we are already extending it to cover other organisms.
Public or licensed databases available on the Internet, such as KEGG , ExPASy-ENZYME , SWISS-PROT , etc., constitute the main source of data for the population of the database. Parsers to extract data from these databases have been and are still being developed.
Before incorporating the parsed data into the database, the data undergoes a manual curation process to detect possible errors and inconsistencies. As previously mentioned, our aim is to automatise part of this curation process by using semantical information included in the ontology. Although many potential errors could be detected using these types of methods, the resolution of inconsistencies will still remain on the side of the domain experts. However, just the highlighting of these inconsistencies is a great support for the curation process.
Another very important source of information is scientific text, such as abstracts in literature databases (Medline), full text papers or comment lines in databases. Together with the group of U. Reyle at the Institute for Computational Linguistics, University of Stuttgart, we are working on the development of linguistic methods for the automatic extraction of biochemical information from scientific texts (http://www.ims.uni-stuttgart.de/projekte/GenIE/). The extracted information will be used to fill the biochemical databases in a semi-automatical way. Considering that data curation by domain experts cannot be completely omitted. The specificity of the information to be extracted substantially exceeds the capabilities of most of the existing information extraction systems in the field of molecular biology and biochemistry, which are mainly interested in extracting interactions between chemical compounds (such as protein-protein interactions). Our goal is to extract all types of information specified in the biochemical ontology. For the development of a grammar/parser for biochemical terminology a corpus of names (systematic, recommended and others) and semi-formal descriptions of enzymes (catalyzed reaction, substrate specificity etc.) has been extracted from enzyme databases and preprocessed. Additionally a corpus of full papers was built up and tagged.
We are working on the development of a database system to support the modelling and analysis of biochemical pathways. Our prototype concentrates on metabolic pathways although many of the defined concepts and relations are common to other types of pathways such as signalling and transport. The system is mainly populated with data related to Mycoplasma pneuemoniae, but has been designed in an organism-independent way.
The database is modelled using an ontology that reflects the concepts associated with the domain of biochemical pathways and the way how these are related. This model is being extended to include experimental and kinetic data.
A deductive database together with the API to access the database is automatically generated from the ontology, with the use of externally acquired tools. Using the Java-base API several methods for querying the data have been developed: navigational, query by example and a domain-specific query language. We have also developed a visualisation platform for biochemical pathways information and experimental data such as 2D-gel electrophoresis and mass spectrometry.
The ontology is also used to support the process of automatic extraction of biochemical information from scientific text. In near future we hope to use it to support the curation of the information included in the database. We also aim at working on the development and application of methods for ontology-based data mining, defining hypotheses and rules using the same first-order logic language used to define the ontology.
The authors would like to thank the whole team that has participated in the elaboration of this system during the last two years: Andreas Kohlbecker, Moritz Becker and Hella Knaeble, in their work on the query and visual interfaces; Bill Andersen, Jennifer Williams and Brian Peterson, from Ontology Works, for their contributions on the elaboration of the ontology; and to Ursula Kummer, Ralph Gauges, Anne DeBeuckelaer and Erich Bornberg-Bauer for their indispensable discussions on the representation of biochemical information. Last but not least, we thank Kerstin Schneider for her useful comments.
We are very thankful to the Klaus Tschira Foundation (KTF) and the BMBF (Project Bioregio 12212) for their financial support.