The Semantic Metadatabase (SEMEDA):
Ontology Based Integration of Federated Molecular Biological Data Sources

Jacob Köhler1 and Steffen Schulze-Kremer2

1 TU Berlin, Germany
2Resource Center DHGP, Berlin, Germany


A system for "intelligent" semantic integration and querying of federated databases is being implemented by using three main components: A component which enables SQL access to integrated databases by database federation (MARGBench), an ontology based semantic metadatabase (SEMEDA) and an ontology based query interface (SEMEDA-query). In this publication we explain and demonstrate the principles, architecture and the use of SEMEDA. Since SEMEDA is implemented as 3 tiered web application database providers can enter all relevant semantic and technical information about their databases by themselves via a webbrowser. SEMEDA' s collaborative ontology editing feature is not restricted to database integration, and might also be useful for ongoing ontology developments, such as the "Gene Ontology" [1]. A "static" demo frontend can be found at


In conjunction with the rapid progress of biotechnologies and the human genome project, an increasing amount of data is being generated. The amount of new data is that big that human genetics journals are increasingly reluctant to publish mutation reports [2]. However much data is often published in publicly accessible data sources. The annual list of current database systems from the Journal of Nucleic Acid Research (January issue) lists at the moment about 300 systems with molecular biological data. Since every database uses its more or less powerful and user friendly proprietary user interface, querying these databases can be tedious. The fact that databases do not only vary in their interfaces, and data formats, but also in the semantic content makes it especially difficult to combine information from several databases. Therefore systems like SRS from Lion Bioscience are successful, although SRS does not semantically integrate databases, i.e. in SRS the user still has to know the schema of each database and figure out himself suitable and valid queries.

Based on principles and requirements (multi-user support, importing of ontologies, web supported ontology editing and database metadata editing etc.) as described in [3] and [4], a system for "intelligent" semantic integration and querying of federated databases is being implemented by using three main components: A component which enables SQL access to integrated databases by database federation (MARGBench [5]), an ontology based semantic metadatabase (SEMEDA) and an ontology based query interface (SEMEDA-query). Subsequently we explain and demonstrate the principles, architecture and the use of SEMEDA. The principle idea of ontology based semantic database integration is to define database attributes by referencing them to ontological concepts. In a next step which will not be described in this publication, the structure of the ontology can be used for "intelligent" and user-friendly database querying.


General Considerations

What are Ontologies in SEMEDA. The notion what ontologies are and how they should be implemented varies between people and research groups [6]. In SEMEDA, ontologies can be considered as a set of concepts which are connected by binary relations. Concepts are well defined entities: they have a unique meaning, properties like a name (label), a description and an identifier. The terms concept and node will be used as synonyms in this publication and so will the terms relation and edge. Relation type (edge type) addresses the characteristics of a relation. The term hierarchy applies to the closure (in its mathematical sense) of a concept over a given relation type.

SEMEDA allows to assign relational algebraic properties such as symmetry, reflexivity and transitivity to relation types. Especially transitivity and symmetry are used to derive the semantic entailments of concepts, and will be especially important for SEMEDAs database query interface.

More formally in SEMEDA an ontology is a pair:
 O := (N, E)
 N: set of concepts, E: set of edges, E (a, b, t) where a, b N, and t is the type of the relation which defines the semantic (is-a, is-part etc.) and the algebraic properties (transitivity, symmetry etc.) of the relation.

Elaborate concepts for "formal ontologies" based on lattices exist [7], but since many "real world" ontologies did not follow these design principles, SEMEDA supports them but does not require that they are used.

Support for Collaborative Ontology Editing. Several users have to be able to collaboratively edit the ontology and database meta-information. Problems related to transaction management are solved by using locks and permissions.

3 groups of user with different permissions exist:

Admins: full permission on everything. Only people who are responsible for maintaining the system should use administrative accounts.

DB Provider: Objects which are generated by the DB provider will be treated as "suggested objects", i.e. they will not be fully integrated to the ontology before an administrative account releases access. Therefor a database provider may: a) add nodes, edge-types and edges. b) edit nodes and edge-types which he generated c.) delete nodes and edge-types which he generated d.) add a database and edit delete their own database meta-information e) define table attributes as nodes.
After an administrative account has released objects which were suggested by a database provider, these objects can no longer be edited by the database provider.

Everybody: everybody has permission to browse the ontology. However, confidential database information (host, port, login/password) can only be browsed by the database provider and by admins.


Because querying large ontologies can be computing intensive, performance was important even though our intention was to develop "only" a prototype. We implemented the system as a 3 tiered system consisting of a relational database (backend) and jsp 1.1 (java server pages) as the middle tier which dynamically generates the html frontend. Using this architecture has several advantages: data (ontologies and database metainformation) can be consistently stored independently from the application and can also be retrieved or imported by using the various built in interfaces and tools of the DBMS.

Backend. Oracle 8i was used as the relational DBMS system, but any other database system could also have been used. Whereas often the "application logic" is located in the middletier, we implemented it partly in JSP and partly by using DBMS features such as constraints and PL/SQL (trigger, procedures, functions). This results in better performance (fewer JDBC calls, query optimisation by the DBMS) and has the advantage that much of the implemented functionality can be used by other applications without having to use the middletier. On the other hand, the use of DBMS features makes it more difficult to port SEMEDA to other DBMS. However, this does not affect raw data exchange of ontologies and database metadata between SEMEDA and other systems.

Fig. 1 shows a simplified ER Schema of the backend. The actual implementation of the schema contains more features which are not displayed in Fig. 1: Adaptations for synonyms exist and a relationship CONTAINS_INFO_ABOUT connects TABLEs and NODEs and allows further definition of the content of a database table. This allows for example to specify if a database table contains data about just one species.

Figure 1: Simplified ER-schema of SEMEDAs backend.

As can be seen in the ER schema (Fig.1), we basically save ontologies as a set of nodes which are connected by edges. However, a tree or a net representation of the ontology is needed when a user browses an ontology, or when all "child nodes" of a given hierarchy have to be selected. For this, we used the "connect by prior" statement, a DBMS specific SQL extension [8]. In [9] several methods for "treewalking" in databases were investigated and it is stated that this proprietary SQL extension performs "excellent".

Middle -Tier. The middletier indirectly connects the frontend with the backend, i.e. maps HTTP GET and POST requests to the appropriate SQL/DML statements and PL/SQL procedures via JDBC. In addition, JSP is used for session tracking.

Frontend. For the frontend, dynamically generated HTML and JavaScript was used. Internet Explorer 4.x and Netscape 4.x versions are supported. By using Cascading Style Sheets, the layout of the frontend can easily be adjusted.

Ontologies are often large. Thus for a visual representation of the ontology, only a subset of all concepts can be displayed at a given time. For visual representation of the ontology a by depth and by relation type filtered tree is used, which is equivalent to a depth limited closure of a node over a given hierarchy.

At a static demo of SEMEDAs user interface can be browsed (see also Fig. 2). A simple ontology browser which is connected to the backend and which can be used to browse an imported version of the Gene Ontology [1] is available at However, by October, i.e. the date of the GCB) the full version which will be connected to the backend will be available at

Figure 2: Screenshot of SEMEDAs Demo main window in edit mode. Left: Database metadata editor, middle: Ontology Editor. Right: context dependent frame. The biologically "naive" fish definition was adopted from WordNet 1.7 [10]. See for this static demo which is not yet connected to the relational backend

The frontend consists of three frames: the left frame is a database metadata editor, the middle frame is used for ontology editing and browsing and the right frame is used context dependent on both frames.

In both the ontology and metadata editor frames, concept, database, table and attribute information can be viewed in the right frame by clicking the appropriate objects. These objects can be edited by selecting the appropriate radio-button before clicking on the appropriate "add", "edit" or "delete" button. Whereas all concepts should be part of the is-a relation, other hierarchies can be browsed (browse hierarchy) and new hierarchies can be defined (edit hierarchies). Concepts which are not connected by an is-a relation, can be found by selecting the "lost-concepts" option in the browse hierarchies drop-down menu.

Database attributes can be defined using the "define" button after selecting both the attribute and the appropriate concept in the ontology. By "defining" a database attribute as a concept, the database provider states: "entries of this database attribute are is-a children of the selected concept". A database attribute can be defined more than once, since database attribute contents sometimes are heterogeneous: for example the attribute "source" in the Protein Data Bank [11] may contain tissue, species, cell line etc. information. Defining an attribute only as one concept means, that all attribute entries are is-a children of the concept, defining it both as a specific concept and as the root concept (thing) means that some entries are is-a children of the concept.

The "refine" button refines the content of a database table, i.e. makes statements about all entries of a table. Refining a table for example as mouse means that all data in this table is mouse data. Tables can also be refined to several concepts since for example a table may contain protein data from mice.

A clear distinction between browse and edit mode exists and the user always sees in which mode he is working. "Suggested objects" are visually differentiated and the username of the owner, i.e. the user who suggested an object, is displayed in the edit mode, thus a user can see if he has the permission to modify an object. Objects which do not "belong" to somebody specific, can only be edited by administrative accounts. Such nodes should be 100 % correct and not be modified. Thus a DB provider can rely on the semantics of the concepts in SEMEDA, and he can be sure that the semantics will not change after he defined database attributes as a concepts.


Whereas we implemented SEMEDA with molecular biological databases in mind, databases from other knowledge domains could also be integrated (ecology, chemistry, agriculture, GIS, stock-market related databases, socio-economic etc.).

SEMEDA is a system which allows database providers to provide all technical and semantical meta-information which is necessary to access their database. The next step after completion of SEMEDA, will be to implement support for integrated schema modeling [5] and "intelligent" database queries (SEMEDA-query) as described in [4]. SEMEDA can handle several ontologies, and several ontologies can be imported by using the DBMS built in features. Therefore, SEMEDA' s collaborative ontology editing feature is not restricted to database integration, and might also be useful for ongoing ontology developments, such as the "Gene Ontology" [1].


  1. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9.
  2. Krawczak, M., et al., Human gene mutation database-a biomedical information and research resource. Hum Mutat, 2000. 15(1): p. 45-51.
  3. Schulze-Kremer, S., Adding semantics to genome databases: towards an ontology for molecular biology. Ismb, 1997. 5: p. 272-5.
  4. Köhler, J., et al. Logical and Semantic Database Integration. in Bioinformatics and Biomedical Engineering. 2000. Arlington, Virginia, USA.
  5. Freier, A., et al. MARGBench - An Approach for Integration, Modeling and Animation of Metabolic Networks. in Proceedings of the German Conference on Bioinformatics. 1999. Hannover.
  6. Noy, N.F. and C. Hafner, The State of the Art in Ontology Design: A Survey and Comparative Review. AI Magazine, 1997. 18(3): p. 53-74.
  7. Ganter, B. and R. Wille, Formal concept analysis : mathematical foundations. 1999, Berlin ; New York: Springer. x, 284.
  8. Oracle-Corp., Oracle8i SQL Reference. 1997.
  9. NHS Information Authority, The Clinical Terms Version 3 (The Read Codes) - Reference Manual. 2000, Woodgate.
  10. Fellbaum, C., WordNet : an electronic lexical database. Language, speech, and communication. 1998, Cambridge, Mass: MIT Press. xxii, 423.
  11. Berman, H.M., et al., The Protein Data Bank. Nucleic Acids Res, 2000. 28(1): p. 235-42.