| In Silico Biology 4, 0009 (2003); ©2003, Bioinformation Systems e.V. |
| Ontology Workshop Tokyo 2003 |
Scientific Databases and Visualisation Group,
EML Research, Heidelberg, Germany
Email: Isabel.Rojas@eml-r.villa-bosch.de
* corresponding author
Edited by H. Michael; received February 17, 2004; revised and accepted March 08, 2004; published March 15, 2004
In this paper we aim at presenting the main flavours and uses that are given to the term ontology in the bio-domains. The paper does not intend to be a thorough review of the existing work in the area. It highlights the uses that are given to ontologies in the Scientific Databases and Visualisation Group at EML Research, in Heidelberg.
Key words: formal ontology, databases, information extraction, biochemical information
The use of the term ontology is becoming more and more popular in biology and related fields. It is used to refer to many things, amongst them controlled vocabularies, taxonomies (typically with is-a or part-of relationships, as in the case of the Gene Ontology [1]), conceptual models of a given domain (as description of rules to infer new knowledge), or a combination of part of or all the above [2]. The information included into an ontology strongly depends on the uses that are given to it. Although in principle an ontology should reflect facts of a given domain or sub-domain, pragmatically speaking the construction of the ontology is mainly guided by the intended need, meaning that the detail at which certain properties or relationships are specified are strongly influenced by the intended use or research interests. This paper discusses some of the different uses given and that can be given to ontologies, focussing primarily on the work carried out at the Scientific Databases and Visualisation group (SDBV) at EML Research. The different intended uses of the ontology allow a more neutral or general approach to its building, diminishing the influence of the intended use of a single application. Different applications have different needs (specified in the application ontology), but the concepts and their relationships should hold for all (they can be found in the domain ontology).
The complexity of the biochemistry and molecular biology domains make the modelling, handling and exchange of data very difficult. This complexity is reflected in some of its common characteristics, amongst them:
The term "ontology" is already extensively used to refer for example to controlled vocabularies, conceptual models and taxonomies. Controlled vocabularies are lists of terms or phrases (a naming convention) that are used for a special purpose that users of a certain domain agree upon - relevant concepts or terms within a domain. In many cases this vocabularies are organised as taxonomies (or hierarchical classification), which specify generalisations and specialisations between the terms. Typically these taxonomies include so called "is-a" relationships, which define types of concepts and "part-of" relationships, describing relations like a gene is a part of a genome or "phosphorylation of Stat is a part of the Jak-Stat- pathway. A conceptual model goes further than a taxonomy, and can be defined as a semantically consistent specification of the concepts, the sub concepts, the distinguishing properties of these as well as the relations between them [2]. We see an ontology more as the last, however appreciate the importance of all the other forms of ontologies. Constructing an ontology is a hard and tedious work, thus all forms of organised knowledge contribute to the specification of the domain.
The most prominent example of ontologies in the bio-sciences is undoubtedly the GeneOntology (GO) [1]. The main goal of GO is to offer a controlled vocabulary that can be applied to all organisms. GO's controlled vocabulary has become a standard reference for databases and biological systems. Reference to a GO term is used both to extend the information about the related object as well as to order the related object under the GO ontology. This association is also used to relate or integrate objects from different data sources that refer to the same GO term. However, the criteria by which GO terms are classified are not always clear, additionally the properties of the concepts described are limited to "is-a" or "part-of" descriptions, where multiple types of these relationships are combined (see [5]). These problems have been recognised and there are several projects being carried out to extend the semantics of the GO terms and the relationship types between the terms [6]. There are other, probably less famous, ontologies such as the one behind EcoCyc [7] or behind Tambis (and associated projects) [8; 9], that go beyond "type-of" and "part-of" relationships between the objects and describe complex relationships and concepts in a more formal framework; or the BioPax initiative [10] where create a data exchange format for biological pathways. The development of such ontologies is a complex and time-consuming work, which requires the participation of knowledge modellers and domain experts. The development of ontologies in the fields of biochemistry and molecular biology needs to consider the characteristics of this data. On one side it is important to have a clear conceptual description of the sub-domain being represented, but at the same time one has to consider the characteristics of the existing data and of the generation of data (and thus information) in these fields.
Within the SDBV ontologies are conceived as conceptual models of specific sub-domains, created by "extracting'' as much knowledge as possible from the experts' minds. Three very closely related ontologies are under development, namely an ontology of biochemical pathways, another on compound classifications and an ontology on protein interactions.
The use of formal ontology for the building of models of biochemical information and databases offers several advantages. The first is the clear specification of exactly what is meant by the terms used to express information in the biochemical domain. Secondly, the use of such ontologies enables additional capabilities, e.g. support for induction to help the scientist formulate and test research hypotheses, for natural language processing, and for data integration. The consolidation of this technology is still, however, hampered by the complexity involved in the modelling of knowledge and the formalisms used. There are already efforts being carried out to facilitate this process so that domain experts can define their own ontologies, however these are still not in common use. Within the GONG project [6] a formal ontology language is used to give reasoning based support to the development of the GO ontology. The growing size and complexity of GO requires the use of computational techniques in order to be able to maintain its consistency.
In the SDBV group computer scientists, computer linguists and biology domain experts work together in the creation of the domain ontologies. This interdisciplinary approach facilitates and enriches the development of the ontology. Another very important factor is the use of a core ontology (also called top-level ontology) on which one can build one on. A core ontology that contains formal definitions of basic elements such as processes, events, situations, and relation types (binary, transitive, etc) is essential to build rules and concepts which can then be used to reason in the domain defined. Examples of such core ontologies are the DOLCE ontology [11] and the Ontology Works ontology [12], both of which have been used at the group. The importance of using such an ontology is that first it has been developed by ontologists with vast experience in the representation of knowledge, second they are consistent and well founded, and third they describe many concepts, relations and rules which one can build on, in a consistent way. OWL (the Ontology Web Language) [13] has been designed to offer the basis for a semantic interoperability of web-based resources. Such interoperability relies on machine interpretable semantic descriptions. In the following we will highlight some of the uses that we give to the ontologies being developed in the group.
Supporting data integration and validation
As mentioned before, the semantics provided by ontologies allow the integration of heterogeneous databases. However, it is necessary to be sure that when "joining" two objects supposed to be the same that they actually are. Taking the GO example, there are many systems that join information based on the GO classification associated. Joining two objects based on their respective GO terms means that we can assign the same gene function, for example, to two genes. This does not mean that they are the same gene (of course, you may say). Another common approach is to integrate information based on two objects with the same name (syntactical equivalence). This is not necessarily a good criterion because names of biological objects (e. g. genes, proteins, pathways) are sometimes assigned by several labs in an arbitrary manner. Thus we have to use other criteria based on the characteristics of the object. For example, for a gene one could talk about: the sequence, the organism, the proteins that are associated to it, and even the characteristics of these proteins. An ontological frame that has the description of the characteristics of the objects and their relationships can allow both the reasoning about if two objects are potentially the same object (absolute certainty we think cannot be accomplished completely automatically) and can be used by many to describe objects based on a common ontological base. Such a common frame can also facilitate the exchange, integration and validation of information. At the SDBV we have developed and continue working on an ontology of biochemical compounds. This ontology describes and classifies compounds according to their functional groups. With such an ontology it is possible to integrate information referring to a compound or a class of compounds. Two compounds will be the same if they have the same structure, but without even going to the structure level we can say how they are related based on the functional groups that they contain. The use of this ontology not only lies in the integration of information but also in the validation of information. More information on this topic can be found in [14].
Supporting information extraction
In the process of information extraction we use ontologies at mainly two levels. First as controlled vocabularies that help the detection of named entities and their properties or their relationships with other named entities; second as knowledge base containing the domain-specific rules, properties and relationships. The main idea behind information extraction is a cascaded processing of rules (i. e. linguistic patterns), which work on a linguistically pre-processed set of texts (i. e. word and sentence boundaries are detected, relevant multi-words are detected and part-of-speech tags are annotated for each word), e. g. [actprot the GCN4 activator protein] identified as referring to an activator protein, or the nested structure [bs a binding site for [actprot the GCN4 activator protein]] referring to a binding site for an activator protein. The definition of rules and relationships can then be used to find relationships amongst named entities, e. g. [cont [prom the ATR1 promoter region] contains [bs a binding site for [actprot the GCN4 activator protein]].
The ontology can be used to support the extraction of knowledge from text and at the same time the linguistic patterns filled with information can be exploited to enrich the ontology. In [15] Cimiano present results of an experiment (carried out in collaboration with SDBV members) where it is shown that properties of objects can be used to resolve anaphoric relations (co-reference of one linguistic expression with another) like bridging phenomena (a special type of co-reference, where both co-referring entities are related in a way which is not explicitly stated), for example (taken from Swiss-Prot [16]): "BINDS STEM LOOP I OF U1 SNRNA ... THIS INTERACTION IS REQUIRED FOR THE SUBSEQUENT BINDING OF U2 SN-RNP..."
To be able to detect that this interaction in the second sentence refers to binds in the first sentence we have to infer that binding is a type of interaction. This is an inference that can be done with an ontology reflecting protein interactions.
Support for the formulation and test of research hypotheses
The formalisation of a given sub-domain by means of an ontology, allows knowledge to be expressed in a computer readable way. Furthermore, the fact that most of the formalisms used to describe (formal) ontologies have associated induction engines, allows the induction of knowledge from the facts (concepts and relationships) and rules included in the ontology. The concepts and relationships of an ontology can be used as building blocks for the formulation of hypotheses that can be verified, rejected, or (in some cases of open world assumption Footnote 1) neither. This of course depends very much on the rules and facts of the ontology. This querying process differs from the formulation of an ad-hoc query to a database in that the solution path is not know a priory and can vary overtime according to the facts included in the ontology. Let us take, for example, an ontology about protein-interactions or more precisely about gene transcription. Suppose that we have amongst the concepts protein, gene, binding and activates (these last two typically represented as relations). In many biological contexts the distinction between a gene and a protein is not too strict; it is common to find papers referring to a gene with the same name of the protein transcribed from it. Thus, to find out if a gene A is directly or indirectly activated (or regulated) by a gene B, a biologist may write something like (using Prolog type notation):
-? activates(gene(B),gene(A)).
Suppose further that our model contains rules like the following, specifying the different types of activation:
activate(X,Y)
protein(X) AND protein(Y) ...
activate(X,Y)
protein(X) AND gene(Y) AND has_gene_product(Y,Y') ...
Given these rules the query formulated above would be rejected, given that the relationship activated can only be defined either between two proteins or between a protein and a gene. So before looking for an answer of a query or hypotheses, the system can verify the "validity" of this claim according to the rules that define it. If the query satisfies the rules defined, then it can continue looking for the different solutions to it. An even more useful feedback would be if the user is told the reason why the query is not valid, which are the rules that it breaks, and in even a better case if it would offer suggestions on how to formulate the query. This last we consider a very difficult, but interesting, task that we have not even started to address.
The use of the same ontology or ontologies to integrate data, extract information and formulate and test research hypotheses, would allow the verification or negation of hypothesis by using data integrated from different/multiple sources, including text based ones. This is the overall goal of the use of ontologies in our group.
In this paper we have presented the main uses of ontologies in the bio-domains, mainly based on the issues addressed by the SDBV (Scientific Databases and Visualisation) group at the EML Research in Heidelberg. The paper does not intend to present a state of the art review of all efforts and projects in the area. By describing some of the common characteristics of biological data, we aimed at showing the difficulty of using "conventional" techniques to model the domain and furthermore to reason in this domain. We believe that the use of ontologies in all its forms, controlled vocabularies, taxonomies or more formal conceptual models, is contributing to the distributed description of different aspects of biochemistry and molecular biology. Each ontology developed tends to focus on a certain aspect or sub-domain, thus the combination of ontologies can provide a more extended domain description. To integrate different ontologies it is necessary to have a formal ground on which these can be described. The OWL (Ontology Web Language) seems to be a good candidate for this, however, it lacks the definition of top level concepts and relationships (such as those defined within DOLCE), also necessary to be able to make the multiple ontologies more semantically compatible.
At the SDBV we are using ontologies mainly in three areas, for information integration, information validation and information extraction. We are currently working on mainly three sub-domains, namely biochemical pathways (and within this classification of biochemical compounds), protein interactions (in general) and transcription factors (as a particular cases of protein interactions).
The ontology on biochemical pathways has served as the basis for the construction of a database for the storage of information the in this sub-domain. We are now working on extensions of the database to store information related to simulation of biochemical reactions. Although there is an XML standard (SBML - Systems Biology Mark-up Language) used for the exchange of information on simulation models this is far from an ontology that can be used to integrate, compare and extract information about these models. One of our immediate goals is to formalise the schema defined to present it as a basic (a first effort towards an) ontology for the description of simulation models.
Footnote1: The knowledge universe is not constraint to the facts and rules in the ontology.