In Silico Biology 2, 0019 (2002); ©2002, Bioinformation Systems e.V.  
G C B ' 0 1

A system architecture for genomic data analysis

Änne Glass and Lothar Gierl




University of Rostock, Faculty of Medicine, Institute for Medical Informatics and Biometry
Rembrandt-Str. 16 / 17
D – 18055 Rostock
Phone: ++49(0)381 494 7310
Fax: ++49(0)381 494 7203
Email: aenne.glass@medizin.uni-rostock.de





Edited by E. Wingender; received December 13, 2001; revised and accepted January 31, 2002; published March 12, 2002


Motivation

Most of diseases are caused by a set of gene defects, which occur in a complex association. The association scheme of expressed genes can be modelled by genetic networks. Genetic networks are efficiently facilities to understand the dynamic of pathogenic processes by modelling molecular reality of cell conditions. In this sense a genetic network consists of first, a set of genes of specified cells, tissues or species and second, causal relations between these genes determining the functional condition of the biological system, i. e. under disease. A relation between two genes will exist if they both are directly or indirectly associated with disease [Oliver, 2000]. Our goal is to characterize diseases (especially autoimmune diseases like chronic pancreatitis CP, multiple sclerosis MS, rheumatoid arthritis RA) by genetic networks generated by a computer system. We want to introduce this practice as a bioinformatic approach for finding targets.

Keywords: genetic networks, model, functional genomics, proteomics, genomic data, expression data, chip data, data mining, analysis, bioinformatics, software system, complex association, causal relation, interaction, targets, artificial intelligence, AI, ART, parser engine



Architecture of the software system GENESYS

We are working on the computer system GENESYS which will allow the import and analysis of genomic data for generating and presenting genetic networks. In this paper we mainly address the design of GENESYS (Fig. 1).

Figure 1: Design of software system GENESYS

Main components of our system are (1) an import tool for gene expression data, (2) an expression data analysing tool using artificial intelligence (AI)-methods, (3) a parser engine for automatic mining information about causal gene relations from internet databases and (4) a visualization tool for presenting genetic networks [Glass, 2000].

(1) Gene expression data from micro array experiments (cDNA chip technology) are provided by different university research groups as well as by research-project "BMBF-Leitprojektverbund Proteom-Analyse des Menschen". Data are validated and standardized for import into GENESYS by a filter tool (written on Excel-Visual Basic Application), which is working directly on lab exported micro array data.

(2) An artificial neural network is utilized for classifying diseases to specific diagnostic categories based on their gene expression signatures. We chose a neural network of adaptive resonance theory (ART). An ART net works like a self-organizing neural pattern recognition machine. The five major properties of the ART system are plasticity as well as stability, furthermore sensitivity to novelty, attentional mechanisms and complexity. The network architecture of type ART1 self-organizes and self-stabilizes its recognition codes and categorises arbitrarily many and arbitrarily complex binary input patterns [Carpenter and Grossberg, 1987]. We obtain the input patterns for ART1 from gene expression raw data of different samples of the same disease by using binary coding. As result of ART1 analysis we get a specific pattern of together expressed genes, which shall be deemed to be typical in general for considered disease.

In addition to ART1 we apply AI-methods of case-based-reasoning. As the technique of case-based-reasoning has been practised successfully in several domains like diagnostics, prediction, control and planning [Heindl et al., 1997; Schmidt et al., 1997] we want to utilize this technique for incremental modelling genetic networks. Each genetic network is considered as a case within the human genome. Similar cases represent similar genetic networks. Each stored identified case in the case base facilitates the retrieval of furthermore cases, i. e. genetic networks. The single cases have to be induced qualified for retrieving similar cases very fast and for integrating new cases into the case base, respectively. Inconsistence and incompleteness are characteristic features of genetic networks in consequence of incremental steady increase of knowledge about the human proteome. As a result the revise-phase is particularly important within the retrieval-reuse-revise-retain-loop of case-based-reasoning systems to control and revise the case base permanently. For this task a set of practicable techniques of our previous work [Gierl et al., 1998] and according to the international level of research are available (e. g. contrast model by Tversky) [Aamodt and Plaza, 1994; Tversky, 1977]. In this way we will obtain a similarity tree [Steffen et al., 2000] of prototypes of genetic networks of different diseases (nodes of similarity tree).

(3) A parser engine mines information about causal gene relations from internet databases (e. g. GeNet, etc.) for a set of considered genes. It is coded in java and consists of three sequential working components: first a database adapter connects to internet database, queries the data and stores all query results locally on computer. After processing the adapter a parser tool analyses local stored information for well-defined data. In the last step a filter tool searches for data redundancy and inconsistency and prepares resulting data with gene relation information for the import into GENESYS and the visualization.

(4) Resulting genetic networks – consisting of a set of genes and causal relations between them - are presented in 3D structure by a visualization tool, which is developed in Inprise Delphi integrating the technology of OpenGL. Genes are presented as globes with expression labels or identifiers of relevant internet databases (members of tripartite: GenBank, EMBL and DDBJ or GeNet) to be chosen optionally. Genes will be linked by arrows if they are related. In future we will develop interactive components for users to choose a set of related genes and zoom into the genetic network.

First results of utilizing several components of GENESYS separately are available.



Results

A similarity tree of experimental expression data is available (Fig. 2). Data obtained by scientists from DKFZ Heidelberg, Stanford, Universities of Bochum, Rostock and Greifswald were validated for import into GENESYS. First genetic networks as nodes of similarity tree (Drosophila, sea urchin) are generated with single GENESYS components, further ones like a genetic network of CP, intestinal inflammation or NF-B interactions as immune response in MS and RA will follow soon. Available networks are nodes of similarity tree which have only one leaf up to now. In other words the node is in the same state as the leaf. These networks are to be considered as a start up of GENESYS and may demonstrate a prototype version of a software system architecture for genomic data analysis and that the system components function properly.

Figure 2: Similarity tree of chip expression data.

Networks of Drosophila and sea urchin we obtained from internet database GeNet information. Gene relation information for Drosophila and sea urchin are mined from GeNet by our parser engine tool automatically: A special database adapter connects to GeNet and imports relevant html-pages with gene relation information for local storage, a local installed parser mines information about genes and gene-related regulatory connections from Drosophila and sea urchin. After parsing html-pages all data (genes and relations between them) can be presented by visualization tool as genetic network. The Drosophila network we obtained from GeNet database was compared with and is according to regulatory connection information of Drosophila genes online visible in GeNet. For sea urchin we could not get comparable maps from GeNet so far. The network of NF-B interactions (Fig. 3) is a composition of scientific publications [Miterski et al., 2002; Deng et al., 2000; Yang et al., 2001].

Figure 3: NF-B interactions genetic network by GENESYS.



Future work

Our straightforward future work will be focused on two principle tasks: first on practicing and linking AI-methods like neural ART-net and case-based learning methods and implementing them for categorizing diseases as well as second on successive increment of our case base for adapting existing networks and generating new ones. We have to practice a neural network and case-based learning methods with expression data mentioned in Fig. 2 to realize our idea of generating nodes from more than one leaf in future. Results from preliminary investigations conducted with comparable data and neural ART-net tune us optimistically, that AI-methods are suitable to analyse array data for discovering disease typical gene patterns and in the accomplishment potentially target genes. In this context we will have to deal with questions like how to measure the similarity of genetic networks. The system GENESYS as a complex working software architecture will facilitate deciding diagnosis and therapy on the base of genomic knowledge and moreover discovering targets for drugs. Conventional methods of clustering excepting biological background knowledge don’t suffice for that purpose.



References