|In Silico Biology 2, 0017 (2002); ©2002, Bioinformation Systems e.V.|
|G C B ' 0 1|
RZPD Deutsches Ressourcenzentrum für Genomforschung GmbH
D-14059 Berlin, Germany
Edited by E. Wingender; received and accepted January 18, 2002; published March 18, 2002
About five years ago, ontology was almost unknown in bioinformatics, even more so in molecular biology. Nowadays, many bioinformatics articles mention it in connection with text mining, data integration or as a metaphysical cure for problems in standardisation of nomenclature and other applications. This article attempts to give an account of what concept ontologies in the domain of biology and bioinformatics are; what they are not; how they can be constructed; how they can be used; and some fallacies and pitfalls creators and users should be aware of.
Keywords: domain ontology, biology, bioinformatics, bio-ontologies, design, guidelines, semantics, philosophy
There are a multitude of heterogeneous and autonomous data resources accessible over the Internet that cover genomic , cellular , structure , phenotype  and other types of biologically relevant information . Even for one type of information, e.g. DNA sequence data, there exist several databases of different scope and organisation1 [1,6,7].
There exist terminological differences (synonyms, aliases, formulae), syntactic differences (file structure, separators, spelling) and semantic differences (intra- and interdisciplinary homonyms). Data integration is impeded by different meaning of identically named categories, overlapping meaning of different categories and conflicting meaning of different categories. Naming conventions of data objects, object identifier codes and record labels differ between databases and do not follow a unified scheme. Even the meaning of important high level concepts that are fundamental to molecular biology is ambiguous.
One prominent example is the concept gene. For GDB , a gene is a "DNA fragment that can be transcribed and translated into a protein". For Genbank  and GSDB , however, a gene is a "DNA region of biological interest with a name and that carries a genetic trait or phenotype" which includes nonstructural coding DNA regions like intron, promoter and enhancer. There is a clear semantic distinction between those two notions of gene but both continue to be used thereby adding another level of complexity to data integration. Another term with multiple meanings is protein function (biochemical function, e.g. enzyme catalysis; genetic function, e.g. transcription repressor; cellular function, e.g. scaffold; physiological function, e.g. signal transducer).
If a user queries a database with some ambiguous term until now she has full responsibility to verify the semantic congruence between what she asked for and what the database returned. Even if a semantic incompatibility is known it still must be sorted out for each search result. Ontologies could help here to localise the right type of concept to be searched for as opposed to identify a mere label naming a search table.
The advent of microarray technology for mRNA expression analysis requires additional standardisation in terminology, especially for characterising experimental setup, mathematical post-processing of raw measurements, genes, tissues and samples. A comparison between different experiments is only feasible if consistent terminology and standardised input forms are used. The development of suitable ontologies is currently persued in the MGED consortium .
Another reason demanding for standardised nomenclatures in biology is the merging of different subfields that historically started rather independently but now with a more integrated approach to biology must be closely integrated. This concerns e.g. genetics, protein chemistry, pharmacology. Since these areas have grown quite distinguished terminology especially large pharmaceutical companies feel an urgent need to harmonise the technical language to store their corporate knowledge in a central, unified database.
The fast growth of sequence, structure, expression, metabolic and regulatory data of many organisms adds additional pressure to utilise standardised and compatible nomenclature in molecular biology.
Text mining and natural language understanding in biology can also profit from ontologies. Where currently mostly statistical and proximity approaches are applied to text analysis ontologies can support parsing and disambiguating sentences by constraining grammatically compatible concepts.
To eliminate semantic confusion in molecular biology, it will be therefore necessary to have a list of the most important and frequently used concepts coherently defined so that e.g. database managers, curators and annotators could use such set of definitions either to create new software and database schemata, to provide an exact, semantic specification of the concepts used in an existing schema and to curate and annotate existing database entries consistently.
It is important to understand that semantic ambiguities also arise between human experts. However, in the course of a conversation usually enough background knowledge and context is available so that semantic ambiguities are most often faster resolved than even consciously recognised. This is possible because of our intelligent capabilities which computers, programs and databases, at least for the near future, fall yet short of.
First, one should be aware of the distinction between ontology, the study of being as a branch of philosophy and individual (domain) ontologies, which are the result of the analysis of a particular domain of interest (possibly as broad as the universe) and the instantiation of a concrete ontological model of that domain. Such an individual ontology represents a system of categories accounting for a particular vision of the world (or parts of it).
Ontologies are to a large extent in principle language independent, e.g. there can be a German equivalent to an English domain ontology, even if the actual translation process would not be trivial since subtle connotations of terms and definitions must be precisely understood and appropriately retold in terms of the other language.
Domain ontologies can be of varying scope and content. One can distinguish between
Since the world around us in general and molecular biology and bioinformatics in particular are in many aspects of enormous complexity it is important to well understand beforehand the intended use for a newly to be developed ontology. Otherwise there is a great risk of loosing focus and being overwhelmed by the multitude of facets leading to a failure of finishing a sufficiently complete, useful ontology. This aspect is acknowledged by the term "situated ontologies"  which emphasises the fact that a domain ontology should always be evaluated with respect to its intended use.
Certainly, ontologies cannot remain constant but will need to be updated in light of new experimental evidence, new focus of knowledge and shifting semantics in our language. The good news, however, is that an ontology is much more stable than e.g. a database schema, which depends on a database representation formalism, a database management system, requirements from the applications which access the data. Since an ontology can easily be translated from one knowledge representation formalism into another (given equivalent expressive capability) it can be also converted into a database schema. Since a domain ontology addresses primarily basic, fundamental underlying relations of an application domain there is less need to modify an ontology as compared to actual knowledge bases.
The main semantic stages in information retrieval in the past were:
Nowadays concept based search on a curated set of concepts is becoming more common, e.g. Ontoligua or GeneOntology. The interplay between ontologies, biology, computer science and philosophy is depicted in Figure 1.
|Figure 1: Molecular biologists discover facts that need to be organised and stored in databases. Computer scientists provide techniques for data representation and manipulation. Philosophers and linguists help organise the meaning behind database labels.|
Probably the first notable ontologist was Aristotle (384-322 BC) who among many other things pursued the question of what can be known about something - or even anything. His solution is presented in his "Categories" and can be seen as the first upper-level ontology.
In Aristotle's point of view these ten categories suffice to say anything that can be known about something. They present the essential qualities that matter. Everything else can be subsumed into one of those. Of course, for annotation of molecular biological entities this list of concepts seems too short and the concepts too general. However, if one subscribes to this set of categories as the essential fundamental ones one could continue and further subclassify these categories in more specific ones until they reach the realm of molecular biology.
Another design feature of Aristotle's ontology is the missing interconnection between his ten categories. If each of these is assumed to be an "atomic" category, i.e. it cannot be meaningfully decomposed into smaller concepts, then there cannot be much structure on top of them. However, if one might want to know more about how these ten categories basically relate to each other this information should also go into the ontology. Other ontologies try to be more explicit about the relations between their concepts (see below for examples).
Entity Particular (e.g. "large molecule", "green spot") Concrete particular Location Object Abstract particular Set Structure Universal (e.g. "largeness", "color") Property Property Kinds... Relation
N. Guarino offered this hierarchically composed version of an upper-level ontology. The hierarchical link between indented concepts means "is subclass of" . This upper-level ontology is also rather small and stops well before biologically relevant concepts are reached.
One of the first computational ontologies was Cyc . Cyc is an ontology originally developed to cover everyday common-sense knowledge. A subset of about 6000 concepts is publicly available as HTML hypertext with ample documentation. Cyc was not built to support a specific application but with the intention to cover even subtle semantic distinctions that a person has to consider when communicating in daily life. The complete version of Cyc is commercially available. Cyc contains a large and detailed collection of well documented concepts but is of limited use for molecular biology for several reasons. Cyc does not include a significant portion of concepts relevant to molecular biology since it was designed to be a universal ontology and only very basic knowledge about chemistry and biology has been added.
Although the authors of Cyc state that they "generally only list a nonredundant series of supersets" or "the incommensurably most specific (i.e., smallest) supersets of each collection" this rule is violated on several occasions. For example, Collection has listed the supersets Intangible, Thing and Set of which Thing is a superset of Intangible which in turn is a superset of Set. There are also several cases where two concepts are listed to be the superset of each other, e.g. Stuff and IndividualObject. Thing, the "universal set of everything", has as its immediate subclasses IndividualObject, Intangible and Role of which all three are overlapping because there exist intangible IndividualObject(s) and a Role is something both individual and intangible (Figure 2). The definition of Thing as the set of everything also faces Russel's set dilemma.
Though most definitions in Cyc seem philosophically well established, what is visible to the public is counterintuitive in some places. For example, Situation is defined to be "a state of affairs" with superclass IndividualObject which is a "discrete, not abstract entity that can have parts but not elements or subsets", suggesting that not only objects involved in a Situation but also Situation itself is a tangible entity since no link to Intangible exists.
The concept Stuff, defined as a discrete object that "when divided into pieces remains of the same type" (e.g. water) includes "physical entities like wood", "temporal entities like the event of a person running" and abstract things like "a piece of English text". One problem with the definition of Stuff is its granularity: on a molecular scale wood can well be divided into components that no longer are wood. Similarly, English text can be divided into letters which are neither distinctively English nor text anymore.
The criterion used to subclassify a concept in Cyc is not always stated explicitly. In many cases, subclasses in one class overlap semantically or are created using different subclassifying criteria. No homonyms are found in Cyc. Naming of concepts is sometimes confusing, e.g. Thing vs. SomethingExisting; PartiallyTangible vs. PartiallyIntangible; IntangibleObject vs. IntangibleStuff. Cyc contains a hierarchy of classes containing only classes that in some cases mirrors a similar hierarchy of classes containing instances but which does not convey any new information. This adds to the confusion when searching for a concept. All these properties of the Cyc ontology make it difficult to locate the appropriate position for an existing concept or for a new one to be added.
|Figure 2: Upper Level of Cyc Ontology. Straight lines indicate "is a subclass of" relation, arrows and italics denote "is a member of" relation (instances).|
Another philosophically motivated upper-level ontology is the author's . Like Guarino's upper-level ontology it starts from a single node but also extends into physical and abstract concepts that are relevant for biology and bioinformatics.
The upper level of a prospective Ontology for Molecular Biology is shown in Figure 3. Starting from the root node Being which includes anything that is, the classes Object and Event are disjoint and discriminated based on their temporal extent. An Object remains an Object even in a single moment in time whereas an Event when dissected into single moments looses its identity. This holds also for all subclasses of Object and Event. The class Object is further subclassified into Individual Object and Property. Both can be thought of as instantaneous, i.e. they keep their identity even if looked at only for one moment. The two are discriminated based on self-contentment. An Individual Object can stand alone whereas a Property always needs another Object or Event to refer to. A Property is further subclassified based on arity into Attribute, a property with only one argument and Relation, a property relating two or more Beings.
|Figure 3: Upper Level of a prospective Molecular Biology Ontology. Links represent the "is a subclass of" relation. No instances are present; discriminating criteria have arrows and boxes; thick lines denote disjunct subclasses.|
Hereby, the logical grammar of words, not their surface structure must be considered. For example, in the statement "Paris is beautiful", beautiful is not a logical attribute to Paris because this statement necessarily involves a second entity, the speaker and thus becomes one binary and one unary relation: "She thinks, Paris is beautiful".
Attribute can be subclassified into Identifier and Descriptor based on whether it just labels an entity or whether it carries additional information about it. Relation can be subclassified analogous to Locke  into Secondary Property relations that involve personal judgement and Primary Property factuals describing intersubjective measurable relations.
Individual Object is subclassified based on physicality into Abstract Object, which has no physical equivalent per se (except capable of being represented neurologically or in writing, etc.) and Physical Object, which must have a defined spatial extension and/or energy content and is similar to Popper's "World 1" .
Abstract Object is further subclassified based on mentality, i.e. whether it refers to an object within the mind or to an object in the outside world, into Mental Object (similar to Popper's "World 2") and Worldly Object (similar to Popper's "World 3"). Although energy and matter are equivalent in nuclear physics a given object can be only of one type at a time. Hence, Physical Object has been subclassified based on mass content into Energy and Matter.
On the other branch of the ontology Event is subclassified based on activity into Occurrence, where at least one object participates and (pure) Time, where nothing happens. This is the notion of absolute time which is no longer valid in relativistic physics and astronomy. The reason for nevertheless holding on to the belief of absolute time here is justified by the intended scope of the ontology for molecular biology: physical processes in living organisms have so far never been known to reach the realm of relativistic physics.
Time is further subclassified according direction into Past and Future. Because presence strictly lasts one moment only, it does not appear in this branch. Analogous to abstract and physical objects, Occurrence is subclassified based on physicality into Abstract Event and Physical Event and further Abstract Event based on mentality into Mental Event (similar to Popper's "World 2") and Worldly Event (similar to Popper's "World 3"). Physical Event is similar to Popper's "World 1" and subclassified based on whether it is done or initiated by human intention into Human Activity and Natural Process.
This section addresses domain or concept ontologies only. No statements should be applied to ontology as the branch of philosophy except where explicitly noted. Since there is no a priori definition of a domain ontology this section necessarily contains personal opinion but tries to give rational explanation whereever possible.
Here are three definitions of domain ontologies.
(i) "System of categories accounting for a particular vision of the world." 
(ii) "Specification of a conceptualization." 
(iii) "Concise and unambiguous description of principle relevant entities with their potential, valid relations to each other." 
Definition (i) is in the sprit of Aristotle's ontology and characterises well many ontological systems from philosophy but fails to impose any structure or form on them. Definition (ii) says, analyse your domain of interest, find out the basic concepts that are instantiated and specify them (but not the actual instantiations). Although this describes well in broad terms several main stages in ontology development, the definition itself is not self-explanatory.
For that reason, definition (iii) was conceived. It attempts to summarise definitions (i) and (ii) and to explain at least in some detail the scope of an ontology and a few constraints to be observed. Which are the principle relevant entities is determined by experts of the domain.
Any domain ontology should meet the following requirements.
The building blocks of an ontology are the following.
What is not an ontology?
An ontology is not a collection of facts arising from a specific situation but it provides all semantic entities (e.g. classes) to describe that situation. A concrete description of a situation uses those concepts to create instances and annotates them with their predicates.
An ontology is not a model of an application domain but a compendium of all building blocks with their valid modes of combination required to express a theory. An entire model of an application domain (e.g. enzyme chemistry) would be a set of (possibly verified) hypotheses or a theory.
An ontology is not a database schema which describes the categories and their data types and organisation in a database but not necessarily the relations between the actual entities in the real world they stand for. A database schema can be derived from an ontology by adding data type information and translating the knowledge representation formalism into a database management paradigm (e.g. relational). Vice versa, a database schema can be used as a starting point to create an ontology. The categories and their attributes can be taken as an initial set of concepts to populate an ontology.
An ontology is not a knowledge base which gathers knowledge about actual individual objects, events, situations, experiments etc but it holds a collection of the types of objects, events etc used to specify those objects in an actual situation. Alternatively, one could say an ontology is a particular knowledge base filled with knowledge about concepts and their ontological relations.
An ontology is not a taxonomy which knows only about superclass and subclass relations whereas an ontology is open to many types of relations between concepts (e.g. mereological, topological, compositional).
An ontology is not a vocabulary or dictionary since the words in a dictionary do not necessarily describe the hierarchy and relation between each and every concept and are not organised in a way that supports computational inference. In an ontology one can follow a path from any one concept to another along the edges of some IS-A hierarchy or other relations.
An ontology is not a semantic net which is a more general representation formalism that can be used to implement an ontology but is not the only choice for that.
As an example for ontological distinctions consider the following. When we say "DNA" we can actually mean several quite different entities. First, there is the actual substance, which is physical and can drop on your foot. Second, DNA can refer to a particular class of chemical substance, which includes general features common to all DNA molecules and is used e.g. in in molecular modelling. Third, DNA can mean a certain type of sequence or string which is an abstract mathematical concept, can be subject of certain mathematical operations but cannot drop on your foot. Forth, DNA is often used in the lab to refer to a particular instance of a sequence, e.g. the DNA sequence of E.coli K12 which can be stored in a database and needs carrier (memory chip, paper) to survive. There are probably other connotations to DNA in everyday life than listed here.
Due to various notions and uses of ontologies there are several ways of how to build an ontology (e.g. stage-based , iterative evolving prototypes ). In the following, the one of  is described.
Given the components described above (set of concepts, propositions about concepts, axioms, knowledge representation formalism, "is a subset" relation, "is a member" relation), apply the following steps to each concept.
Concept naming guidelines
The following rules make an ontology more readable.
The benefits of this methodology are significant. When adding a new concept one can use the discriminating criteria of the ontology as a decision tree to travel down from the root and at each branch deterministically decide where the new concept should belong to. Either one finds the concepts is already there (possibly under another name). Then the insertion process is merely adding another alias to the existing concept. Or, one ends at some point in the hierarchy where no alternative seems appropriate anymore. This is then the place where the new concept should be added, either directly or using some intermediary concepts to separate the exiting concepts from the new branch. This also guarantees consistency of the existing ontology and generalisation/specialisation hierarchy after inserting a new concept. Searching for a known or even unknown concept can be done in the same way, i.e. by traversing the decision tree of discriminating criteria.
Ontological commitment refers to the choice of axioms for an ontology, i.e. the background belief which is not explicit in an ontology; the choice of granularity in the selection of concepts and definitions (coarse abstractions vs. finer details); and the choice of subclassifying criterion (content; priority). All these decisions influence the final appearance of an ontology and should at least be stated explicitly.
There are several difficulties to be overcome when building an ontology. Some difficulties are inherent to the ontology building process, others arise mainly from the application area at hand.
Since there is no definite rule to determine the "best" (e.g. most informative) subclassifying criterion for a given class one is left with a necessarily arbitrary decision on how to subclassify that class. This implies there will not be an optimal nor best ontology for a given set of concepts but only (in)consistent and (un)useful ontologies. Also, since the information content of the concepts that still need to be added to an ontology is not precisely known in advance the choice of subclassifying criteria can lead to more complex inheritance structure than necessary.
Other difficulties arising in the ontology building process are the following.
Of the domain specific difficulties in ontology building ill-defined technical terms, controversial technical terms, difficulty to analyse and separate homonyms, imprecise or lacking documentation of database categories are the most common ones.
In toto, this leads to the conclusion that one main degree of freedom when building an ontology, i.e. the degree of abstraction, granularity and detail of the domain to be modelled determines the practical quality of an ontology in a range from useless (too abstract, does not give sufficiently detailed information) to impossible (ultimate granularity and coverage).
The feasibility and desirability of one comprehensive ontology for molecular biology versus several smaller task oriented ontologies has been extensively debated in the community. On the one hand one comprehensive domain ontology would certainly be very helpful if it could be achieved and maintained. On the other hand, it seemed much more efficient and effective to have several smaller task or subdomain ontologies which take less time and expertise to grow and maintain and therefore are in the position to be put to use much sooner.
In principle, the approach of smaller subdomain ontologies is the more practical one with the exception of a situation where eventually the goal is to combine all subdomain ontologies. In that case, much work will have to be redone since the integration of ontologies as described above can hardly be automated. Each concept must be located and identified in the various subdomain ontologies which involves manual search, reading and comparing concept definitions. A decision must be made whether the concepts are similar enough to be merged into one or if several similar concepts need to be saved. Then, these concept(s) must be added to a new ontology that will incorporate all subdomain ontology concepts.
In the special case where the root or some top-level concepts of one ontology exactly match concepts in another ontology these branches could be merged. However, in this case the data format (syntax, representation formalism) and the relations between concepts of the two ontologies still need to be checked and verified.
Since this process of ontology integration is quite laboursome it might be more sensible to start of with an ontology that has a rather general upper level and can accommodate all of the diverse ontological types that are to be expected from the application domain. This was exactly the motivation for starting the MBO ontology .
Ontologies can provide to computer programs much of the common sense and background knowledge that human experts use. Therefore, their range of applicability is rather broad as was indicated already in the introductory paragraphs of this document. Two examples, database integration and data annotation will be discussed here briefly.
Data integration faces the problems of syntactic and semantic heterogeneity. While regular syntactic incompatibilities can be easily aligned with pattern matching software, semantic heterogeneity needs a unified semantic repository to be resolved. For the case of n databases in the traditional way each table, object etc of one database has to be manually aligned with the structure and contents of every other database. Since because of different meanings in one word the mapping between database tables and attributes can be non-symmetrical this actually amounts to n*n integration attempts. However, if one ontology exists that can be placed "in the middle" of those n databases the integration effort is reduced from n*n to n only, since each database has only to be mapped to the ontology where general inference algorithms can figure out identical or similar concept in any other database .
For data annotation, in principle not a full fledged ontology as described above is required but only a controlled vocabulary since the main purpose is to provide constant and unique reference points. Such a controlled vocabulary is developed in the GeneOntology (GO) project . GeneOntology attempts to provide continuity in the so-called GO identifiers (GO ID). This means that new concepts get new GO Ids, old concepts keep their GO Ids, even if they are moved to another location within the hierarchy and GO Ids of deleted concepts are not reused.
However, the design principles of GO did not prevent the following shortcomings.
All in all, GO seems currently to be more a nomenclature or controlled vocabulary for molecular biology rather than a full fledged gene ontology.
Finally, here some resources are listed that could be relevant to work on bio-ontologies.