Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalised Data Structures
In Silico Biology 5, 0005 (2004); ©2004, Bioinformation Systems e.V.  
Ontology Workshop Göttingen 2004

Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalised Data Structures


Jacob Koehler1,*, Chris Rawlings1, Paul Verrier1, Rowan Mitchell1, Andre Skusa2, Alexander Ruegg2 and Stephan Philippi3




1 Rothamsted Research, BAB division, Harpenden, UK
2 University of Bielefeld, Technical Faculty, Germany
3 University of Koblenz, Department of Computer Science, Germany



* Corresponding author
   Email: jacob.koehler@bbsrc.ac.uk



Edited by H. Michael; received October 19, 2004; revised and accepted December 23, 2004; published December 31, 2004



Abstract

The structure of a closely integrated data warehouse is described that is designed to link different types and varying numbers of biological networks, sequence analysis methods and experimental results such as those coming from microarrays. The data schema is inspired by a combination of graph based methods and generalised data structures and makes use of ontologies and meta-data. The core idea is to consider and store biological networks as graphs, and to use generalised data structures (GDS) for the storage of further relevant information. This is possible because many biological networks can be stored as graphs: protein interactions, signal transduction networks, metabolic pathways, gene regulatory networks etc. Nodes in biological graphs represent entities such as promoters, proteins, genes and transcripts whereas the edges of such graphs specify how the nodes are related. The semantics of the nodes and edges are defined using ontologies of node and relation types. Besides generic attributes that most biological entities possess (name, attribute description), further information is stored using generalised data structures. By directly linking to underlying sequences (exons, introns, promoters, amino acid sequences) in a systematic way, close interoperability to sequence analysis methods can be achieved. This approach allows us to store, query and update a wide variety of biological information in a way that is semantically compact without requiring changes at the database schema level when new kinds of biological information is added.

We describe how this datawarehouse is being implemented by extending the text-mining framework ONDEX to link, support and complement different bioinformatics applications and research activities such as microarray analysis, sequence analysis and modelling/simulation of biological systems. The system is developed under the GPL license and can be downloaded from http://sourceforge.net/projects/ondex/

Keywords: graph database, ontology, Generalised Data Structures, semantic data integration



Introduction

In many organisations, an important challenge for bioinformatics research and infrastructure groups is having to work with data from many different species and accommodate a wide variety of different types of data collected in different ways.

If the research involves organisms that are not central to medical research there are additional complexities including:

Fig. 1 gives an overview how integrated data sources and different bioinformatics disciplines can be applied to derive new biological knowledge and to support the analysis of experimental data. Biological networks, such as protein interactions, metabolic pathways and signal transduction networks can be exploited for generating templates for modelling and simulation of biological systems which can subsequently be refined in collaboration with experts in the respective application domain.

Sequence analysis methods and biological networks are interrelated because proteins and genes are the key elements in such biological networks. Analysing newly sequenced genomes, ESTs and other data sources in the context of such biological networks offers the opportunity to predict putative pathways and protein interactions based on sequence homology. Furthermore, such biological networks can be exploited for the visualisation and analysis of experimental results such as those from micro-array experiments, mass spectrometry and NMR. In order to enable this kind of data analysis, the acquisition, extraction and close semantic integration of biological networks from pre-existing databases or text mining techniques is a prerequisite.



Figure 1: Bioinformatics disciplines and how they can be related.


To successfully integrate data in a meaningful way, technical, semantic and organisational integration issues must all be addressed [Köhler, 2004]. While many of the major life science data sources can be accessed using established data integration tools such as SRS [Zdobnov et al., 2002] or DiscoveryLink [Haas et al., 2001] these systems address only technical heterogeneity among data sources. They do not support the semantic integration of databases which recent research shows can be achieved with significant benefit [Köhler et al., 2000; Köhler et al., 2003; Ludäscher et al., 2001; Philippi and Kohler, 2004; Stevens et al., 2000].

In the following, we will describe principles and methods for the close semantic integration of biological networks, and their extraction from heterogeneous data sources to support several bioinformatics activities. The described methods are the basis for the ongoing extension of the text mining and data integration framework ONDEX.



Principles and methods


Requirements

A biological data integration system should support:

When individual data sources have to be updated, it is assumed that all databases to be integrated are re-imported and integrated. In consequence, this avoids having to deal with conflicts that arise when an entry that was linked to its equivalent in a different data source was modified or removed in the updated version of the data source. In the longer term, consistency control and belief maintenance will be an important extension to such systems.


Principles

We propose using graph based methods for representing semantic data models. A graph is one of the most general data structures available and the basis for all definitions of networks. Biology in turn is often described as a science of relations and thus, of networked entities. Most biological relevant data can be seen as a network and stored as a graph. In the following we use the graph related terminology in the way it is normally used in graph theory.

Graphs have a strong theoretical background [Harary, 1969] and graph manipulation is supported by a wealth of algorithms. Therefore, as well as having available a broad range of algorithms for graph representation, analysis and visualisation, it is also the case that much biological information can be visualised as a graph in a way that is meaningful to users. This allows that computational methods can be applied to analyse large amounts of data. In contrast to relational databases and most public biological databases, it is also possible to visualise whole databases in one graph instead of just displaying individual entries one by one. Current research shows that the human eye is able to spot relevant information in big graphs consisting of thousands of nodes and edges, provided that appropriate graph visualisation and navigation methods are used [Hughes et al., 2004; Risden et al., 2000].

The main idea in ONDEX is to represent biological databases using graphs in which the nodes and edges have different attributes. In a protein interaction database, it is possible to represent the proteins as nodes, and their interactions as edges that connect these nodes. Likewise, in a metabolic network, the metabolites, enzymes and reactions can be represented by different types of nodes that are connected by directed edges. However, the meaning of the nodes and the edges differ. To reflect this, it is necessary to define the semantics of the edges, i.e. whether a node represents a protein or a metabolite, or whether an edge has the meaning "binds_to", "produced_by" or "consumed_by". In the following we will refer to such typed nodes and edges as concepts and relations, which reflects that the nodes and edges are meant to represent real world entities that are interlinked by semantically well defined relations. This approach to a generic graph based representation of different types of biological data means that any kind of biological data can be seen as an ontology which consists also of concepts which are linked through relations. This gives the ability to store ontologies, biological databases and experimental data using the same graph based representation. The ability to represent ontologies and databases using the same data structure assists in linking, integrating and relating data from different data sources through algorithms and methods that can be applied in the same way to ontologies and databases. Although when seen from a technical point the difference between ontologies and databases are resolved by considering both these data structures as graphs, many links and equivalent entries in different databases can only be established through the graphs that are derived from ontologies. Going back to the above mentioned examples (protein - protein interaction networks and metabolic networks), it will become clear that the concepts and relations may have different attributes: enzymes may have distinct sets of properties describing their activity (Km, Vmax, pH - optimum, pH - range) whereas the relations that link proteins in protein interaction networks may have properties that describe the nature of the binding (covalent binding, hydrogen bonds etc). Since it cannot be foreseen which properties will have to be represented, a mechanism is required that allows to represent such properties without the need to change the underlying database schema. In order to be able to store this kind of information along with the concepts and relations, generalised data structures will be used (see below).

Therefore, the general goal is to import different data in such a way that it fits into a graph template. If a mechanism exists that ensures the correct semantics of the edges, i.e. the relation links between the entities, then different networks from different levels of biological hierarchy can be integrated into the same database schema. Furthermore, not only relations between single entities can be represented by this mechanism, but by using different semantics for the edges, the relations (e.g. hierarchies) between entire networks can be maintained. Biological entities can be involved in networks of different levels and correspondingly a node in the integrated graph can exhibit different kind of links that connect to these different levels. Once the data is imported into this structure and the links are created, it will be easy to retrieve integrated information that has been extracted from different sources.

The next step will be to go beyond importing pre-defined relations as edges between nodes, to the creation of new edges by automated annotation and discovery methods. For example, there will be many cases where the same biological entity will be represented by different subgraphs because it will have come from different data sources where it has a different name, or has different properties etc. Methods to align equivalent entries from different data sources are under development.

These methods are being developed in a way which will align different data sources in a fully automated way. By aligning (and not integrating) the different data sources, it is possible to trace which data originates from which source, and at the same time find further information in different graphs. This mechanism also allows for alignment of graphs using other relations than equivalence, i.e. for example for the alignment of homologs, where the mapping between the two concepts receives the relation type "is homolog of". In a similar way, concepts that have a broader meaning could be mapped to concepts that have a narrower meaning In such a case an "is_a" relation would represent this kind of mapping in an appropriate way.

We are working on different methods and algorithms to support the graph based semantic integration of heterogeneous databases:

    1.) Import of mapping lists: In many cases, databases provide linkouts to equivalent entities in different databases. Such mapping lists are simply imported. 2.) Methods based on graph structure (structalign): the main idea of the structalign algorithm is to align concepts that have the same name and are directly or indirectly linked to other concepts that also have an identical name. 3.) Compare concept names (2syn) : Many biological entities have more than one name (genes, species etc). By comparing the set of all names to each other, equivalent concepts can be identified, without mapping homonyms (computer mouse versus biological mouse). 4.) Sequence analysis: By using algorithms such as implemented in BLAST [Altschul et al., 1990] and Vmatch http://www.vmatch.de/, it is possible to identify equivalent concepts when equivalent proteins have different names in different databases. 5.) transitive mapping (trans): This allows to link concepts indirectly through other concepts, i.e. A equivalent B, and B equivalent C implies A equivalent C.

The mapping methods are still under development, however, preliminary evaluations indicate a precision of >0.95 for the alignment of several taxonomies and ontologies.



Results and implementation

The system is being developed through a collaboration between Bielefeld University and Rothamsted Research. The initial development began in 2001 at Bielefeld with the development of the ONDEX application for ontology based biological text mining. This text mining framework uses a core set of ontologies which are aligned using several fully automated methods. This core is currently being extended to support alignment and integration of different kinds of heterogeneous data sources. The system is now under rapid development to underpin a number of applications that require the services of a data warehouse of collections of biological data. The ONDEX system is developed under the GPL license and can be downloaded from http://sourceforge.net/projects/ondex/. It runs under LINUX. Hardware requirements depend on the number of data sources to be imported and on the mapping algorithms used. When more than a few data sources are to be integrated, we recommend a multiprocessor computer and RAID storage.

Behind the ONDEX system, the underlying data structure for the data warehouse is of central importance. The schema (Fig. 2) of the ONDEX core reflects the representation of biological data as a graph consisting of nodes (Concept) and edges (Relation). It is designed to provide fast data retrieval of connected graph nodes and their meta-data.



Figure 2: Data structure of ONDEX (Entity Relationship diagram).


The schema can store imported data sets extracted or otherwise derived from existing databases and from experimental results such as those coming from the analysis of a series of microarray experiments. Each imported data set is given a unique identifier within the system and is described in the Dataset table. The data is split into concepts which represent entities such as substrate(s), enzyme, activator(s), inhibitor(s), co-enzyme(s) or product(s) of a specific biosynthetic reaction and the reaction itself. The name and synonyms of a concept are held in the Concept_Name table and the type of concept is held in the Concept_Class table (whether it's a transcription factor, enzyme, metabolite etc). In addition, a set of accession numbers that uniquely identify a given concept can be stored in Concept_Acc. The Concept table holds information that represents real world entities (genes, proteins, diseases). The Relation table links concepts and assigns these links a meaning through Relation_Types, i.e. defines how concepts are related to each other. An example would be [Chen et al., 2004]:

transcription factor StBEL5 regulates gene GA20ox1

"transcription factor" and "gene" are Concept_Classes, StBEL5 and GA20ox1 are Concepts which are linked by a Relation which has the Relation_Type regulates.

For the relation types we are using the syntax and algebraic relational properties as defined in [Smith et al., 2004] to ensure that the ONDEX system will benefit when other data sources adopt the same set of relation types. However, we have found that the relation types defined in this paper are too limited to represent all types of relations that commonly occur in biological databases without significant loss of information. Therefore we are extending this basic set whenever a new relation type is required for a new data source.

The data structure of ONDEX will also be the basis for graph based visualisation and analysis of the underlying data sources: concepts can be visualised as nodes in graphs, and the relations can be represented as the edges that connect the nodes. Information stored in Concept_Class and Relation_Type information can be utilised to further refine the way that such graphs can be visualised and analysed (visualisation: different colours, icons etc. analysis: graph traversal, clustering nodes etc.). Mappings may be established to link concepts from different data sets. This may optionally be performed by the mapping algorithms embodied in the ONDEX software. These mapping algorithms generate links between concepts and store the Evidence which of the different mapping algorithms has generated a given link in the table. This enables ONDEX and the applications that utilise this database integration system to connect disparate data sets.

The ONDEX system is implemented with PostgreSQL, shell scripts and a set of Java™ programs. All these are managed through Unix makefiles. The ONDEX import mechanism does not rely on direct access to external data sources. All parsers access the flatfile versions of the databases in the form they are distributed by the database provider. Whereas some parsers were developed around a more or less standardised exchange format (OBO format, old DAG-Edit format) most parsers are specific for a given data source. This is necessary, because on the one hand each data set is distributed in a different format, and on the other hand the parsers themselves have to resolve semantic heterogeneities in the underlying data sources. This means that the parsers select only the "relevant" information of a data source, e.g. they convert free text species names to NCBI Taxonomy IDs [Wheeler et al., 2003] and when a data source contains links to external accession numbers, the parsers have to resolve to which name space the given accession number belongs, i.e. what type of accession number it is (NCBI locuslink, RefSeq etc.). The parsers convert the data sources into ONDEX through a tab-delimited flat-file import mechanism where the data is pre-organised to allow for efficient import into the system. The parsers are normally written in Java and designed to convert data from a flat-file distribution of a third-party data source into tab delimited flatfiles in a format that can be readily imported into the previously described database schema. Each data import does not rely on a previously imported data set so that any specific data set may be loaded as a free-standing system. Thus a data warehouse consisting of any combinations of data sources may be built from scratch using scripts to define the data tables, scripts to run database parsers and load the new ONDEX database. The mapping algorithms are then executed as a once-off task to provide a comprehensive cross-linking of the installed data.

A problem with the graph based data structure for the storage of biological networks described so far is that different data sources assign different kinds of attributes to concepts and their relations. If we consider for example the biological concept 'enzyme', usually attributes like 'km' and 'vmax' are used to describe entities of this concept in more detail. With classical data storage a relation would have to be built with 'km', 'vmax' and other attributes in order to characterise a specific enzyme. Since a concept in the ONDEX system is a node in an arbitrary biological network, a classical table structure is not well suited for the storage of attributes describing those concepts or relations between them. The insertion of new attributes into already integrated data sources as well as the integration of data sources with entirely new semantic concepts into the system would inevitably lead to modifications of the underlying database schema. In order to avoid these problems and foster fully automatic handling of data storage from arbitrary sources, the database schema presented in Fig. 2 is extended with generalised structures for the storage of arbitrary kinds of data assigned to concepts and relations [Philippi, 2003].

The main idea of generalised data storage, or GDS for short, is to split classical relational tuples into attribute/value pairs, which are then stored as tuples in their own right within a generalised structure. Fig. 3 illustrates this principle with parts of two classical tuples stored in a generalised data structure.



Figure 3: Classical relational data storage vs generalised data storage in ONDEX exemplified for enzyme data. In a conventional relational representation, for each enzyme a set of attributes exists for defining properties of enzymes. For the GDS representation in ONDEX (see also GDS_concept in Fig. 4), a table consisting of basically three columns is sufficient to store any kind of information that characterises a given concept.


The generalised relation of this scenario also illustrated in Fig. 4, consists of only three attributes, no matter how many attributes a classical relation would need. The main difference to classical data storage is that the attributes of the generalised structure are not at all linked to the semantics of the application domain. To be more specific, the generalised structure stores a triple consisting of an identifier (property_of_concept) as well as name (property_name) and value of an attribute to represent (see Fig. 3). This way, each classical attribute value is represented by a tuple in the generalised structure. The 'id' attribute (Property_of_concept/ Property_of_relation) is used in order to indicate that sets of generalised tuples actually belong to the same concept or relation, i.e. they describe properties of a specific enzyme in the example. In contrast to the traditional use of relational databases, there is no need to store NULL values with generalised data storage. Attributes without a value, like 'pH Range' in the second tuple of the example in Fig. 3, are simply not stored at all.

Currently, the ONDEX system uses GDS to represent all kinds of data types (real, integer, strings etc.) as strings in a single generalised table (see Fig. 4). However, along with the data the original data type is stored, which allows to convert data back to its proper type when this is required. Thus, it is not necessary to duplicate the GDS table for each data type, but this comes at the cost of [data] type conversions.

Generally, the main benefit of the described generic structure for data storage is that there is no need to map attributes characterizing semantic concepts to application specific relational structures. In consequence, arbitrary kinds of biological networks can be represented with this extension in ONDEX without any loss of information and without the need to alter the underlying database schema in case of structural changes in source databases or if new data sources are to be integrated into the system.



Figure 4: Generalised Data Structure in ONDEX.



Data sources

At present, more than 500 molecular biological databases exist [Discala et al., 2000; Galperin, 2004]. The various data sources are maintained by many different institutions and companies and vary widely in their content, formats and access methods. They contain data about metabolic pathways, protein structures, DNA sequences, organisms, diseases, etc. Many biological questions require that the right combination of data from several sources are queried, searched and integrated. Considering this high number of existing databases, even the identification of relevant data sources is not trivial.

Table 1 lists a set of databases which is of special interest in the context of the research at Rothamsted Research, which has its research focus on crop plants and related pathogens and pests. A subset of these data sources is presently being integrated to solve current problems of biologists at Rothamsted Research and serve to illustrate the power of the graph oriented network as a tool for extending biological knowledge through a combination of comparison, similarity, inference and modelling. At its core, the system will integrate genotype-phenotype databases, pathway data etc. This combination of the different data sources allows to bridge genetics, metabolomics, transcriptomics and signal transduction to treatments and traits.

Whereas the described combination of data sources meets the requirements at Rothamsted Research, in a similar way different data sources can be selected and combined to meet different requirements resulting from different application scenarios.

Table 1: Databases sources that are currently been integrated to meet the requirements at Rothamsted Research.
Database URL Parser available Description
AGRIS http://arabidopsis.med.ohio-state.edu/ No Arabidopsis promoter sequences, transcription factors and their target genes.
AraCyc http://www.arabidopsis.org/tools/aracyc/ Yes Biochemical pathways of Arabidopsis thaliana
BRENDA http://www.brenda.uni-koeln.de/ Yes Comprehensive collection of enzyme functional data
Drastic http://wwwexternal.scri.sari.ac.uk/ TiPP/PPS/DRASTIC/search.asp Yes Plant genes regulated in response to biotic and abiotic stress
IGF-wheat http://www.cerealsdb.uk.net/database.htm No Wheat ESTs
KEGG PATHWAY http://www.genome.jp/kegg/pathway.html No Molecular interaction networks, including metabolic pathways, regulatory pathways, and molecular complexes in many species.
PDB http://www.rcsb.org/pdb/ No 3-D biological macromolecular (especially protein) structure.
PlantsP http://plantsp.sdsc.edu/ No Functional genomics of plant phosphorylation
PlantsT http://plantst.sdsc.edu/ No Functional genomics of plant transporters
SGD http://www.yeastgenome.org/ No Molecular biology and genetics of the yeast Saccharomyces cerevisiae
TRANSCompel http://www.gene-regulation.com/ No Composite regulatory elements affecting gene transcription in eukaryotes.
TRANSFAC http://www.gene-regulation.com/ Yes Eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles.
TRANSPATH http://www.gene-regulation.com/ Yes Signal transduction pathways, in particular those that aim at transcription regulatory components.
unnamed www.plantphysiol.org/ cgi/doi/10.1104/pp.014134 No Phenotypes of Arabidopsis mutants
Most OBO ontologies http://obo.sourceforge.net/ Yes Generic parser for all ontologies that use the old or the new DAG-edit format
EC Nomenclature http://www.chem.qmul.ac.uk/
iubmb/enzyme/
Yes Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzyme-Catalysed Reactions
WordNet http://www.cogsci.princeton.edu/~wn/ Yes WordNet® is a lexical reference system in which English nouns, verbs, adjectives and adverbs are organised into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
MeSH http://www.nlm.nih.gov/
mesh/meshhome.html
Yes MeSH is the National Library of Medicine's controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity.
NCBI TAXONOMY http://www.ncbi.nlm.nih.gov/ Taxonomy/taxonomyhome.html The NCBI taxonomy database contains the names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence.


Implementation status

The current implementation status meets all seven requirements that are introduced at the beginning of the "Principles and Methods" section. Currently, any combination of databases and ontologies can be imported and integrated, provided that parsers for these databases exist (see Table 1). This also includes parsers to "linguistic ontologies", thesauri and taxonomies such as WordNet, MeSH and the NCBI TAXONOMY database. These resources are useful in resolving situations where different databases use different names for equivalent entities. Thus, they play an important role in the semantic integration of databases. The next step in the development of ONDEX is towards exploitation of the integrated data sources through graph based analysis and visualisation methods. This also requires the development and maintenance of new and existing import parsers. Further ongoing and planned activities are towards the evaluation, improvement and quality assurance of the generated mappings between the heterogeneous data sources.



Discussion

Integrating data using one of the most general data structures, a graph, opens up a broad range of possible applications. Networks (or more formally - graphs) are used to describe biological relations on different levels as e.g. metabolic networks, gene- or protein interactions. On one hand, integration based on a graph data structure will help to improve the quality of networks at each level of organisation as well as to combine networks of different levels by facilitating the discovery of overlaps and inconsistencies. On the other hand, a standard data structure makes it more easy to analyse and visualise different kinds of biological networks using the same algorithms and software.

What makes ONDEX distinct from major database integration products such as SRS [Zdobnov et al., 2002] or DiscoveryLink [Haas et al., 2001] is the fact that it also deals with semantic heterogeneity of data sources. When compared to advanced systems for semantic integration of databases such as described in [Köhler et al., 2000; Köhler et al., 2003; Ludäscher et al., 2001; Philippi and Kohler, 2004; Stevens et al., 2000], ONDEX differes in the following aspects:

For improving the quality of networks, the comparison between experimental data and already annotated data in databases is important. Here it is possible to retrieve for each component of the network, i.e. each node of the graph, the available information from different sources. By linking all information to the same concept name and its synonyms, such information can be directly compared.

Another application example on which we are currently working is the visualisation and analysis of differentially expressed genes from microarray results in the context of biological networks such as the whole metabolism or all known signal transduction pathways of a given species. This will support molecular biologists and geneticists in interpreting and analysing causal relationships in the large number of genes that statistical analysis of microarray experiments often produces.

Network integration on different levels could be achieved in the same way. Concepts will occur in different contexts, indicated by different kind of relations they exhibit. For example gene concepts may have regulatory links to transcription factors, but are also involved in signalling pathways (as the final target) or protein and metabolic networks (as substrates for proteins, enzymes). These different kinds of relationships are reflected in the database schema. Thus, a far more integrated view could be achieved, starting from extracellular primary messenger signals and combining the intracellular networks.

In the context of plant science, such information can be used to formulate hypotheses on which genes are key targets for manipulation or reverse genetics to introduce desirable traits into crops. For example a comparison of the transcriptome of genetic lines differing in a quality trait may reveal many differentially expressed genes, and only by linking with additional information is it possible to reduce the set of candidates for the genetic cause to a manageable level.

In the future, more challenging application of our approach can be envisaged in which the integrated data network can be used to annotate and validate experimental data. It is also possible to develop methods that will use automated inference methods (e.g such as those employed in machine learning methods) to predict new methods and relations over the data. These networks could then in turn be aligned with experimental results and inconsistencies as well as common subgraphs identified.



Acknowledgements

JK, CR , PV and RM gratefully acknowledge support from the Biotechnology and Biological Sciences Research Council of the United Kingdom including that from Grant BBS/B/13640.




References