In Silico Biology 2, 0034 (2002); ©2002, Bioinformation Systems e.V.  
G C B ' 0 1


Impact of integrating clinical and genetic information

Martin Dugas1*, Claudia Schoch2, Susanne Schnittger2, Alexander Kohlmann2, Wolfgang Kern2, Torsten Haferlach2 and Karl Überla1




1Department of Medical Informatics, Biometrics and Epidemiology (IBE)
University of Munich
Marchioninistr. 15
D-81377 Munich, Germany
Tel: +49-89-7095-4497
Fax: +49-89-7095-7491
Email: dug@ibe.med.uni-muenchen.de
http://martin-dugas.de

2Department of Internal Medicine III
University Hospital of Munich, Germany
Marchioninistr. 15
D-81377 Munich, Germany

*To whom correspondence should be addressed





Edited by E. Wingender; received November 30, 2001; accepted December 21, 2001; published March 15, 2002


Abstract

To assess the relevance of molecular markers it is required to combine clinical and genetic information. For reliable assessment of parameters relevant to diagnostics and therapy large patient collectives must be characterized both with respect to phenotype and genotype. Matching of genetic data like gene expression profiles, molecular genetics and cytogenetics with clinical data like follow-up, morphological findings and diagnoses involves integration of complex databases.

In the context of a nationwide leukemia research network in Germany we designed an integrated database covering both genetic and clinical data of patients. The system contains follow-up data and relevant laboratory modalities, i. e. cytomorphology, cytogenetics, molecular genetics, FISH, immunophenotyping and gene expression profiling.

So far 13541 cases from 7746 patients treated by 1225 physicians are documented. The data structure consists of up to 888 variables per case. From our experience, integration of clinical and genetic information requires significant efforts - including data protection issues -, but is feasible and improves data quality leading to faster and more reliable research results for the benefit of the patients.

Key words: data integration, patient data, microarray, gene expression, cytogenetics, molecular genetics, leukemia


Introduction

"Obtaining the sequence of the human genome is the end of the beginning" [1] stated Collins and McKusick in their recent publication about "Implications of the Human Genome Project for Medical Science".

Genetic methods have the potential to change medicine fundamentally, however it is important to distinguish between surrogate markers and prognostic relevant parameters, which are associated with important medical outcomes like patient's quality of life and survival. There is a wealth of both clinical and genetic information which must be taken into account to select molecular markers with clinical relevance.

From a computer science point of view the advantages of integrated databases covering clinical and genetic information in medical research are quite obvious, however, there are no comprehensive software products on the market and they hardly exist at all. This is caused by methodological problems when building data models for highly complex and dynamic biomedical systems. These issues concerning formalization of medical knowledge are well-known in medical informatics [2].

In the context of a nationwide German leukemia research project we designed a database combining clinical and genetic aspects of this disease which was integrated into the routine workflow of a large clinical laboratory serving as a national reference center.

Leukemia is a model disease for cancer and its pathogenesis is currently being investigated on a molecular level, therefore special attention is dedicated to genetic parameters like detection of gene rearrangements or mutations, analysis of chromosomes as well as gene expression profiling.


Methods

Laboratory methods

Cytomorphology

The microscopic analysis of blood and bone marrow cells (Fig. 1) was based on May-Grünwald-Giemsa stain, myeloperoxidase reaction, and non-specific esterase reaction using alpha-naphtyl-acetate. All stainings from bone marrow and blood were performed routinely according to standard procedures [3].

Figure 1: Microscopic picture of leukemic cells (AML M2, t(8;21)).

Immunophenotyping

Immunophenotyping [4] by multiparameter flow cytometry is an essential part of the diagnostic work-up of hematologic diseases. Using monoclonal antibodies against membrane-bound and cytoplasmic antigens this methods identifies and sub-classifies acute lymphoblastic leukemias, chronic lymphatic leukemia, and acute myeloid leukemias based on their specific antigen expression profile. Taking advantage of aberrant antigen expression profiles in leukemias differing form normal bone marrow minimal residual disease can be recognized and quantified.

Cytogenetics

Chromosome analyses were performed on bone marrow or peripheral blood samples according to standard protocols [5]. The chromosomes were interpreted according to the International System for Human Cytogenetic Nomenclature [6].

Fluorescence in situ hybridization (FISH) was performed on interphase nuclei on bone marrow smears or on slides prepared for cytogenetic analysis.

Molecular Genetics

Mononuclear cells were isolated by a Ficoll gradient separation. 1x107 cells were lysed and total RNA was extracted with a RNeasy-kit. About 1 µg of RNA were reversely transcribed. PCR for the specific leukemia fusion transcripts were performed. Amplification products were analyzed on agarose gels. For quantification of the fusion gene in the individual samples real time PCR using the LightCycler® technology was performed.

Gene expression profiling

For gene expression profiling the GeneChip® System (Affymetrix, Santa Clara, CA, USA) was used. Lysates of the leukemia samples were homogenized and total RNA extracted. 10 µg total RNA isolated from 1x107 cells were used as starting material in the subsequent cDNA-synthesis. For detection of the hybridized target nucleic acid biotin-labeled ribonucleotides were incorporated during the in vitro reaction. Before hybridization onto U95Av2, Test3 microarrays were chosen for monitoring of the integrity of the cRNA. The Affymetrix software (Microarray Suite, Version 4.0.1) extracted fluorescence intensities from each element on the microarrays as detected by confocal laser scanning.

An overview of molecular diagnostics is provided in Fig. 2. The general goal of these methods is a precise characterization of leukemic cells at different levels (cell, chromosome, DNA, mRNA) in terms of phenotype and genotype, in order to support diagnostic and therapy of leukemia.

Figure 2: Overview of Molecular Diagnostics: Immunophenotyping is performed on the cell level, cytogenetics detects genetic features of chromosomes, molecular genetics analyses DNA and gene expression profiling measures mRNA.


Computer system

The system is implemented with server-side PERL programs (http://www.perl.com) running on a Linux computer (http://www.suse.de) with an Apache web server (http://www.apache.org) and a PostgreSQL database (http://www.postgresql.org/). At the client side a standard web browser is applied (e. g. Netscape CommunicatorTM or Internet ExplorerTM). To ensure patient data security the system is protected by a firewall.

mediwww

For rapid prototyping of ergonomic, highly adaptive web forms a dedicated software tool (mediwww) has been applied as described earlier [7, 8, 9]. It enables to define a data structure (e. g. database table) interactively. A preview of the web forms can be generated and presented to the clinical user. When the data structure is defined, all PERL programs and database tables are generated from templates, i. e. no line of code is programmed manually. The function of the tool is similar to the UltraDevTM extension of Macromedia DreamweaverTM (http://www.macromedia.com/), but is adapted to the needs of medical databases (e. g. specific templates).

To define the detailed data structure we applied an iterative software engineering approach. By means of regular user meetings and rapid prototyping after approximately 20 iteration cycles a suitable database structure was defined.

To embed the system into the routine workflow of the laboratory, Microsoft WordTM documents for written reports as well as adhesive labels for probes are generated directly from the database by means of Microsoft WordTM templates which are completed with the appropriate item values.

Customizable data entry

To enable flexible documentation, all forms are highly customizable by the user. The content of certain pulldown-menus can be adjusted as needed to address the permanent progress in molecular biology generating e. g. new PCR-primers frequently. In the immunophenotyping module even the names and order of the parameters can be modified by the user for reasons of flexibility. To support the generation of summary reports, text blocks can be easily adapted. This was accomplished by specific extensions of the PERL-based web-tool.


Data management

A very difficult and time-consuming task has been the integration of preexisting records of all laboratory modalites and clinical information covering approximately five years of operation. Special attendance was dedicated to matching of cases from different sources to enable patient-specific evaluations. Surname, first name, date of birth and laboratory number were used as matching criteria. Lists of unmatched cases were created and verified manually to assure correct assignment.

Data concerning cytogenetics, cytomorphology and FISH were exported from a WindowsTM-based desktop database application (CybaseR from MetaSystems; built with ParadoxTM) into DBase/XBase-File format. By means of a dedicated PERL-program the data was adjusted according to the new schema. Data from other modalities (molecular genetics, immunophenotyping) was provided in form of ExcelTM files, which were exported into tab-separated text format and transferred to the new database by means of PERL programs.

Clinical information, especially follow-up data, is provided by external sources like the AML-CG [10] study. We implemented a program to synchronize this external data with our database on a regular basis.


Results

A multi-user database with a web frontend consisting of the following modules was implemented: patient demographics and follow-up, cytomorphology, cytogenetics, FISH, molecular genetics, immunophenotyping, microarrays and summary report.

So far the system contains information on 13541 cases from 7746 different patients (November 2001). The leukemia laboratory of the University of Munich acts as a nationwide reference center, therefore patient data from 1225 physicians located at 302 hospitals are available online.

Precise diagnosis and follow-up (Fig. 3) - i. e. what is the status of the patient several years after the first diagnosis - is the key information to assess the medical relevance of molecular markers. To collect it in a reliable manner, the organisational infrastructure of clinical trials is required. Fig. 4 presents the cytogenetics module. In addition to the karyotype and its aberrations details of the measurement procedures are stored. This is a very important issue, because standards in molecular biology are changing continuously. Before a statistical analysis is performed, it must be ensured that the findings were obtained in a comparable manner.

Figure 3: Clinical data including demographics, diagnosis and follow-up. Precise follow-up information - i. e. what is the status of the patient several years after the first diagnosis - is hard to obtain, but is very important to assess the relevance of molecular markers.

Figure 4: Cytogenetic module of the leukemia database. The karyotype and its aberrations as well as technical parameters are presented.


Fig. 5 describes the microarray documentation module. Experimental conditions and parameters for quality control like 3'/5' ratio are stored. We decided to store the expression data as binary large objects to avoid any loss of information and to provide flexibility when changes of the data format occur by the manufacturer. The gene expression values needed for statistical analysis can be extracted from these files automatically.

Figure 5: Gene expression module of the leukemia database. Experimental conditions and parameters for quality control like 3'/5' ratio are stored. The expression data are stored as binary large objects.


Data structure

The data structure - including administrative items - consists of 15 tables and altogether 888 variables. For each probe 15 cytogenetic items, 10 PCR-markers, 10 FISH probes, 8 MRD (minimal residual disease) markers, 72 immunophenotype measurements and a gene expression profile can be handled; most parameters can be customized by the user. The main data structure, which is also a proposal for international standardisation in the field of leukemia research, is available in XML-format on the Internet: http://mdplot.ibe.med.uni-muenchen.de/


Data analysis

By integration of the follow-up data from the AML-CG10 study the prognostic relevance of specific cytogenetic or molecular genetic anomalies could be confirmed [11, 12, 13, 14, 15, 16]. The detection of new chromosomal aberration patterns is supported by a specific program, which parses the karyotype to determine the breakage points. Due to integration with clinical data a frequency distribution of chromosome alterations ordered by disease can be generated.


Discussion

Linking clinical and genetic data

The Human Genome Project is a major driving force in the evolvement of the new discipline Bioinformatics. However, there is a substantial gap between DNA sequencing, gene function and proteomics on the one hand and clinical relevant information on the other. To identify genomic patterns which are relevant to patients in general, genomic data must be analyzed in conjunction with clinical data in the context of clinical trials with an appropriate sample size. Data protection and ethical considerations are important issues in this context. Surrogate markers and prognostic relevant parameters must be distinguished to answer the question: Is it relevant to the patient?

For this reason a close cooperation between Medical Informatics and Bioinformatics is necessary, as stated by Kohane [17], Altman [18] and Miller [19].


Impact of Data Integration

Biomedical databases are characterized by both complex and dynamic data structures. For the Leukemia database more than 800 variables per patient were appropriate. This requires professional software engineering and project management. Integration of clinical and scientific documentation is laborious, but feasible and provides better data quality and therefore faster research results. Before the integrated database was available, we lost up to 50% of cases when we combined several data sources due to mismatch of patient demographic data and other inconsistencies. Now comprehensive reports on data quality are possible which help to detect missing or presumably wrong values.

Adaptive and highly-customizable data entry is a key factor for success in bioinformatic systems. Forms consisting of up to a hundred items cannot be filled in manually from scratch. Therefore intelligent methods to speed up the data entry process must be implemented - e. g. adjustable default values and customizable text blocks.


Future directions

Functional genomics by means of microarrays is a focus of ongoing research. The number of publications concerning gene expression analysis is growing rapidly, e. g. Alizadeh [20] and Hedenfalk [21] demonstrated recently that new disease entities can be characterized by distinct genetic pathways. So far, the number of patients involved is limited. We plan to analyze this data in an integrated manner to improve the understanding of leukemia's biology on a molecular level and - in the long run - apply this knowledge to improve diagnostics and therapy of the patients.



Acknowledgments

Supported by a grant from the German Ministry of Education and Research (BMBF), Kompetenznetz: Akute und Chronische Leukämien - 01 GI 9980/6 and by a grant from 'Deutsche José Carreras Stiftung e.V.'




References

  1. Collins, F. S. and McKusick, V. A. (2001). Implications of the Human Genome Project for Medical Science. JAMA 285, 540-544.

  2. Moorman, P. W., van Ginneken, A. M., van der Lei, J. and van Bemmel, J. H. (1994). A Model for Structured Data Entry Based on Explicit Descriptional Knowledge. Meth. Inform. Med. 33, 454-463.

  3. Löffler, H. and Rastetter, J. (1999). Atlas of clinical hematology. Springer, Berlin.

  4. Jennings, C. D. and Foon, K. A. (1997). Recent advances in flow cytometry: application to the diagnosis of hematologic malignancy. Blood 90, 2863-2892.

  5. Stollmann, B., Fonatsch, C. and Havers, W. (1985). Persistent Epstein-Barr virus infection associated with monosomy 7 or chromosome 3 abnormality in childhood myeloproliferative disorders. Br. J. Haematol. 60, 183-196.

  6. Mitelman, F. (1995). ISCN 1995, Guidelines for Cancer Cytogenetics, Supplement to: An International System for Human Cytogenetic Nomenclature. S. Karger, Basel.

  7. Dugas, M. (1997). Clinical applications of Intranet-Technology. Stud. Health Technol. Inform. 45, 115-118.

  8. Dugas, M., Bosch, R., Paulus, R. and Lenz, T. (1999). Intranet-based multi-purpose medical records in Orthopaedics. Med. Inform. Internet Med. 24, 269-275.

  9. Dugas, M. and Überla, K. (1999). Intranet Based Clinical Workstations. In: Medical Informatics, Biostatistics and Epidemiology for Efficient Health Care and Medical Research (Victor et al., eds.), Urban und Vogel, München, pp. 235-238.

  10. Büchner, T., Hiddemann, W., Wörmann, B., Löffler, H., Gassmann, W., Haferlach, T., Fonatsch, C., Haase, D., Schoch, C., Hossfeld, D., Lengfelder, E., Aul, C., Heyll, A., Maschmeyer, G., Ludwig, W. D., Sauerland, M. C. and Heinecke, A. (1999). Double Induction Strategy for Acute Myeloid Leukemia: The Effect of High-Dose Cytarabine With Daunorubicin and 6-Thioguanine: A Randomized Trial by the German AML Cooperative Group. Blood 93, 4116-4124.

  11. Schnittger, S., Kinkelin, U., Schoch, C., Heinecke, A., Haase, D., Haferlach, T., Büchner, T., Wörmann, B., Hiddemann, W. and Griesinger, F. (2000). Screening for MLL tandem duplication in 387 unselected patients with AML identify a prognostically unfavorable subset of AML. Leukemia 14, 796-804.

  12. Haferlach, T., Winkemann, M., Löffler, H., Schoch, R., Gassmann, W., Fonatsch, C., Schoch, C., Poetsch, M., Weber-Matthiesen, K. and Schlegelberger, B. (1996). The abnormal eosinophils are part of the leukemic cell population in acute myelomonocytic leukemia with abnormal eosinophils (AML M4Eo) and carry the pericentric inversion 16: a combination of May-Grunwald-Giemsa staining and fluorescence in situ hybridization. Blood 87, 2459-2463.

  13. Schoch, C., Kern, W., Krawitz, P., Dugas, M., Schnittger, S., Haferlach, T. and Hiddemann, W. (2001). Dependence of age-specific incidence of acute myeloid leukemia on karyotype. Blood 98, 3500.

  14. Kern, W., Schoch, C., Haferlach, T., Braess, J., Unterhalt, M., Wörmann, B., Büchner, T. and Hiddemann, W. (2000). Multivariate analysis of prognostic factors in patients with refractory and relapsed acute myeloid leukemia undergoing sequential high-dose cytosine arabinoside and mitoxantrone (S-HAM) salvage therapy: relevance of cytogenetic abnormalities. Leukemia 14, 226-231.

  15. Schoch, C., Haase, D., Haferlach, T., Gudat, H., Büchner, T., Freund, M., Link, H., Lengfelder, E., Wandt, H., Sauerland, M. C., Löffler, H. and Fonatsch, C. (1996). Fifty-one patients with acute myeloid leukemia and translocation t(8;21)(q22;q22): an additional deletion in 9q is an adverse prognostic factor. Leukemia 10, 1288-1295.

  16. Schoch, C., Haferlach, T., Haase, D., Fonatsch, C., Löffler, H., Schlegelberger, B., Staib, P., Sauerland, M. C., Heinecke, A., Büchner, T. and Hiddemann, W (2001). Patients with de novo acute myeloid leukaemia and complex karyotype aberrations show a poor prognosis despite intensive treatment: a study of 90 patients. Br. J. Haematol. 112, 118-126.

  17. Kohane, I. S. (2000). Bioinformatics and Clinical Informatics - The Imperative to Collaborate. J. Am. Med. Inform. Assoc. 7, 512-516.

  18. Altman, R. B. (2000). The Interactions between Clinical Informatics and Bioinformatics. J. Am. Med. Inform. Assoc. 7, 439-443.

  19. Miller, P. L. (2000). Opportunities at the Intersection of Bioinformatics and Health Informatics: A Case Study. J. Am. Med. Inform. Assoc. 7, 431-438.

  20. Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson jr., J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O. and Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503-511.

  21. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O. P., Wilfond, B., Borg, A. and Trent, J. (2001). Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344, 539-548.