| In Silico Biology 2, 0034 (2002); ©2002, Bioinformation Systems e.V. |
| G C B ' 0 1 |
1Department of Medical Informatics, Biometrics and Epidemiology (IBE)
University of Munich
Marchioninistr. 15
D-81377 Munich, Germany
Tel: +49-89-7095-4497
Fax: +49-89-7095-7491
Email: dug@ibe.med.uni-muenchen.de
http://martin-dugas.de
2Department of Internal Medicine III
University Hospital of Munich, Germany
Marchioninistr. 15
D-81377 Munich, Germany
*To whom correspondence should be addressed
Edited by E. Wingender; received November 30, 2001; accepted December 21, 2001; published March 15, 2002
To assess the relevance of molecular markers it is required to combine clinical and genetic information. For reliable assessment of parameters relevant to diagnostics and therapy large patient collectives must be characterized both with respect to phenotype and genotype. Matching of genetic data like gene expression profiles, molecular genetics and cytogenetics with clinical data like follow-up, morphological findings and diagnoses involves integration of complex databases.
In the context of a nationwide leukemia research network in Germany we designed an integrated database covering both genetic and clinical data of patients. The system contains follow-up data and relevant laboratory modalities, i. e. cytomorphology, cytogenetics, molecular genetics, FISH, immunophenotyping and gene expression profiling.
So far 13541 cases from 7746 patients treated by 1225 physicians are documented. The data structure consists of up to 888 variables per case. From our experience, integration of clinical and genetic information requires significant efforts - including data protection issues -, but is feasible and improves data quality leading to faster and more reliable research results for the benefit of the patients.
Key words: data integration, patient data, microarray, gene expression, cytogenetics, molecular genetics, leukemia
"Obtaining the sequence of the human genome is the end of the beginning" [1] stated Collins and McKusick in their recent publication about "Implications of the Human Genome Project for Medical Science".
Genetic methods have the potential to change medicine fundamentally, however it is important to distinguish between surrogate markers and prognostic relevant parameters, which are associated with important medical outcomes like patient's quality of life and survival. There is a wealth of both clinical and genetic information which must be taken into account to select molecular markers with clinical relevance.
From a computer science point of view the advantages of integrated databases covering clinical and genetic information in medical research are quite obvious, however, there are no comprehensive software products on the market and they hardly exist at all. This is caused by methodological problems when building data models for highly complex and dynamic biomedical systems. These issues concerning formalization of medical knowledge are well-known in medical informatics [2].
In the context of a nationwide German leukemia research project we designed a database combining clinical and genetic aspects of this disease which was integrated into the routine workflow of a large clinical laboratory serving as a national reference center.
Leukemia is a model disease for cancer and its pathogenesis is currently being investigated on a molecular level, therefore special attention is dedicated to genetic parameters like detection of gene rearrangements or mutations, analysis of chromosomes as well as gene expression profiling.
Laboratory methods
Cytomorphology
The microscopic analysis of blood and bone marrow cells (Fig. 1) was based on May-Grünwald-Giemsa stain, myeloperoxidase reaction, and non-specific esterase reaction using alpha-naphtyl-acetate. All stainings from bone marrow and blood were performed routinely according to standard procedures [3].
Immunophenotyping
Immunophenotyping [4] by multiparameter flow cytometry is an essential part of the diagnostic work-up of hematologic diseases. Using monoclonal antibodies against membrane-bound and cytoplasmic antigens this methods identifies and sub-classifies acute lymphoblastic leukemias, chronic lymphatic leukemia, and acute myeloid leukemias based on their specific antigen expression profile. Taking advantage of aberrant antigen expression profiles in leukemias differing form normal bone marrow minimal residual disease can be recognized and quantified.
Cytogenetics
Chromosome analyses were performed on bone marrow or peripheral blood samples according to standard protocols [5]. The chromosomes were interpreted according to the International System for Human Cytogenetic Nomenclature [6].
Fluorescence in situ hybridization (FISH) was performed on interphase nuclei on bone marrow smears or on slides prepared for cytogenetic analysis.
Molecular Genetics
Mononuclear cells were isolated by a Ficoll gradient separation. 1x107 cells were lysed and total RNA was extracted with a RNeasy-kit. About 1 µg of RNA were reversely transcribed. PCR for the specific leukemia fusion transcripts were performed. Amplification products were analyzed on agarose gels. For quantification of the fusion gene in the individual samples real time PCR using the LightCycler® technology was performed.
Gene expression profiling
For gene expression profiling the GeneChip® System (Affymetrix, Santa Clara, CA, USA) was used. Lysates of the leukemia samples were homogenized and total RNA extracted. 10 µg total RNA isolated from 1x107 cells were used as starting material in the subsequent cDNA-synthesis. For detection of the hybridized target nucleic acid biotin-labeled ribonucleotides were incorporated during the in vitro reaction. Before hybridization onto U95Av2, Test3 microarrays were chosen for monitoring of the integrity of the cRNA. The Affymetrix software (Microarray Suite, Version 4.0.1) extracted fluorescence intensities from each element on the microarrays as detected by confocal laser scanning.
An overview of molecular diagnostics is provided in Fig. 2. The general goal of these methods is a precise characterization of leukemic cells at different levels (cell, chromosome, DNA, mRNA) in terms of phenotype and genotype, in order to support diagnostic and therapy of leukemia.
Computer system
The system is implemented with server-side PERL programs (http://www.perl.com) running on a Linux computer (http://www.suse.de) with an Apache web server (http://www.apache.org) and a PostgreSQL database (http://www.postgresql.org/). At the client side a standard web browser is applied (e. g. Netscape CommunicatorTM or Internet ExplorerTM). To ensure patient data security the system is protected by a firewall.
mediwww
For rapid prototyping of ergonomic, highly adaptive web forms a dedicated software tool (mediwww) has been applied as described earlier [7, 8, 9]. It enables to define a data structure (e. g. database table) interactively. A preview of the web forms can be generated and presented to the clinical user. When the data structure is defined, all PERL programs and database tables are generated from templates, i. e. no line of code is programmed manually. The function of the tool is similar to the UltraDevTM extension of Macromedia DreamweaverTM (http://www.macromedia.com/), but is adapted to the needs of medical databases (e. g. specific templates).
To define the detailed data structure we applied an iterative software engineering approach. By means of regular user meetings and rapid prototyping after approximately 20 iteration cycles a suitable database structure was defined.
To embed the system into the routine workflow of the laboratory, Microsoft WordTM documents for written reports as well as adhesive labels for probes are generated directly from the database by means of Microsoft WordTM templates which are completed with the appropriate item values.
Customizable data entry
To enable flexible documentation, all forms are highly customizable by the user. The content of certain pulldown-menus can be adjusted as needed to address the permanent progress in molecular biology generating e. g. new PCR-primers frequently. In the immunophenotyping module even the names and order of the parameters can be modified by the user for reasons of flexibility. To support the generation of summary reports, text blocks can be easily adapted. This was accomplished by specific extensions of the PERL-based web-tool.
Data management
A very difficult and time-consuming task has been the integration of preexisting records of all laboratory modalites and clinical information covering approximately five years of operation. Special attendance was dedicated to matching of cases from different sources to enable patient-specific evaluations. Surname, first name, date of birth and laboratory number were used as matching criteria. Lists of unmatched cases were created and verified manually to assure correct assignment.
Data concerning cytogenetics, cytomorphology and FISH were exported from a WindowsTM-based desktop database application (CybaseR from MetaSystems; built with ParadoxTM) into DBase/XBase-File format. By means of a dedicated PERL-program the data was adjusted according to the new schema. Data from other modalities (molecular genetics, immunophenotyping) was provided in form of ExcelTM files, which were exported into tab-separated text format and transferred to the new database by means of PERL programs.
Clinical information, especially follow-up data, is provided by external sources like the AML-CG [10] study. We implemented a program to synchronize this external data with our database on a regular basis.
A multi-user database with a web frontend consisting of the following modules was implemented: patient demographics and follow-up, cytomorphology, cytogenetics, FISH, molecular genetics, immunophenotyping, microarrays and summary report.
So far the system contains information on 13541 cases from 7746 different patients (November 2001). The leukemia laboratory of the University of Munich acts as a nationwide reference center, therefore patient data from 1225 physicians located at 302 hospitals are available online.
Precise diagnosis and follow-up (Fig. 3) - i. e. what is the status of the patient several years after the first diagnosis - is the key information to assess the medical relevance of molecular markers. To collect it in a reliable manner, the organisational infrastructure of clinical trials is required. Fig. 4 presents the cytogenetics module. In addition to the karyotype and its aberrations details of the measurement procedures are stored. This is a very important issue, because standards in molecular biology are changing continuously. Before a statistical analysis is performed, it must be ensured that the findings were obtained in a comparable manner.
![]() |
Figure 4: Cytogenetic module of the leukemia database. The karyotype and its aberrations as well as technical parameters are presented. |
Fig. 5 describes the microarray documentation module. Experimental conditions and parameters for quality control like 3'/5' ratio are stored. We decided to store the expression data as binary large objects to avoid any loss of information and to provide flexibility when changes of the data format occur by the manufacturer. The gene expression values needed for statistical analysis can be extracted from these files automatically.
Data structure
The data structure - including administrative items - consists of 15 tables and altogether 888 variables. For each probe 15 cytogenetic items, 10 PCR-markers, 10 FISH probes, 8 MRD (minimal residual disease) markers, 72 immunophenotype measurements and a gene expression profile can be handled; most parameters can be customized by the user. The main data structure, which is also a proposal for international standardisation in the field of leukemia research, is available in XML-format on the Internet: http://mdplot.ibe.med.uni-muenchen.de/
Data analysis
By integration of the follow-up data from the AML-CG10 study the prognostic relevance of specific cytogenetic or molecular genetic anomalies could be confirmed [11, 12, 13, 14, 15, 16]. The detection of new chromosomal aberration patterns is supported by a specific program, which parses the karyotype to determine the breakage points. Due to integration with clinical data a frequency distribution of chromosome alterations ordered by disease can be generated.
Linking clinical and genetic data
The Human Genome Project is a major driving force in the evolvement of the new discipline Bioinformatics. However, there is a substantial gap between DNA sequencing, gene function and proteomics on the one hand and clinical relevant information on the other. To identify genomic patterns which are relevant to patients in general, genomic data must be analyzed in conjunction with clinical data in the context of clinical trials with an appropriate sample size. Data protection and ethical considerations are important issues in this context. Surrogate markers and prognostic relevant parameters must be distinguished to answer the question: Is it relevant to the patient?
For this reason a close cooperation between Medical Informatics and Bioinformatics is necessary, as stated by Kohane [17], Altman [18] and Miller [19].
Impact of Data Integration
Biomedical databases are characterized by both complex and dynamic data structures. For the Leukemia database more than 800 variables per patient were appropriate. This requires professional software engineering and project management. Integration of clinical and scientific documentation is laborious, but feasible and provides better data quality and therefore faster research results. Before the integrated database was available, we lost up to 50% of cases when we combined several data sources due to mismatch of patient demographic data and other inconsistencies. Now comprehensive reports on data quality are possible which help to detect missing or presumably wrong values.
Adaptive and highly-customizable data entry is a key factor for success in bioinformatic systems. Forms consisting of up to a hundred items cannot be filled in manually from scratch. Therefore intelligent methods to speed up the data entry process must be implemented - e. g. adjustable default values and customizable text blocks.
Future directions
Functional genomics by means of microarrays is a focus of ongoing research. The number of publications concerning gene expression analysis is growing rapidly, e. g. Alizadeh [20] and Hedenfalk [21] demonstrated recently that new disease entities can be characterized by distinct genetic pathways. So far, the number of patients involved is limited. We plan to analyze this data in an integrated manner to improve the understanding of leukemia's biology on a molecular level and - in the long run - apply this knowledge to improve diagnostics and therapy of the patients.
Supported by a grant from the German Ministry of Education and Research (BMBF), Kompetenznetz: Akute und Chronische Leukämien - 01 GI 9980/6 and by a grant from 'Deutsche José Carreras Stiftung e.V.'