In Silico Biology 3, 0020 (2003); ©2002, Bioinformation Systems e.V.  
BGRS 2002

A database on alternative splice forms on the Integrated Genetic Map Service (IGMS)

Heike Pospisil*, Alexander Herrmann, Harald Pankow and Jens G. Reich




Max-Delbrueck-Center for Molecular Medicine,
D-13125 Berlin, Germany
Email: pospisil@mdc-berlin.de

* corresponding author





Edited by H. Michael; received September 26, 2002; revised and accepted November 21, 2002; published December 16, 2002



Abstract

The IGMS is a comprehensive information system that combines the knowledge from genomic sequence, genetic map and genetic disorders databases. This system is updated weekly and focuses on the analysis of EST data. The IGMS identifies UniGene clusters that are differentially expressed in different types of cancer with respect to different reference tissues. The results can be combined with clinical data to asses the potential relevance of specific genes for patient survival or metastatic spread. The second application maps EST with a specific expression profile. Our third application generates a database of alternative splice forms for nine organisms from EST and mRNA sequence data. The results can be used to find splicing patterns specific for certain tissues or tumour types. Availability: http://www.bioinf.mdc-berlin.de/igms/.

Key words: alternative splicing, ESTs, database, gene expression profiles, colon cancer



Introduction

Many databases are available on the Internet regarding DNA sequences, which focuses on different biological or medical properties. There are over 18 millions sequence records at the GenBank ([1], GenBank Release 131.0 September 2002 [2]), including more than 12 million expressed sequence tags (ESTs) (September 2002 [3]), representing a wide variety of different organisms, tissue types, including diseased and normal cell lines. This large amount of data raises a lot of possible questions for investigations the complexity of being. This situation makes the search for specific (often linked) information difficult with regard to a) the completeness and b) the specificity of the received information.

We present an online available, weekly updated service that combines as well the knowledge from genomic sequence, genetic map and genetic disorders databases as a database of potentially alternatively spliced forms. The Integrated Genetic Map Service (IGMS) system focuses on the analysis of EST data and enables to extract sequences of interest from the large amount of entries. It is available under http://www.bioinf.mdc-berlin.de/igms/.



Methods

The IGMS system is an integrated database retrieval system that accesses genetic information from several sources (see Tab. 1). It is available under http://www.bioinf.mdc-berlin.de/igms/.


Table 1: Overview of the several sources that are integrated in the IGMS system. Additionally a short explanation for every source is given.

GeneMap99 (GeneMap99_gb4/ GeneMap99_sg3) This human gene map is the result of the collaboration of an International RH Mapping Consortium. Two Radiation Hybrid (RH) panels were used: the Genebridge4 (GB4) panel and the Stanford G3 panel. GB4 provides long-range map continuity while G3 gives higher local resolution. The GeneMap'99 represents the location of more than 30,000 genes. It's accessible from NCBI.
WHI This is the final STS-based genetic linkage map resulting of the Human Physical Mapping Project at Whitehead Institute/MIT Genome Center.
MGD The Mouse Genome Database at Jackson Laboratory.
HMGD The Human-Mouse Genome Database at Jackson Laboratory.
LDB The Genetic Location Database (LDB) gives locations for expressed sequences and polymorphic markers. Locations are obtained by integrating data of different types (genetic linkage maps, radiation hybrid maps, physical maps, cytogenetic data and mouse homology) and constructing a single 'summary' map.
OMIM The Online Mendelian Inheritance in Man. This database is a catalog of human genes and genetic disorders.
GenBank The GenBank sequence database at NCBI.
RefSeq The NCBI reference sequences (RefSeq) provide standards for complete genomic nucleic acids, assembled contigs, transcripts and proteins. RefSeq records are derived from GenBank and the literature to provide a non-redundant set of sequences that facilitate sequence identification and information retrieval. (Note: The IGMS system has included the human XM_* (e.g. XM_066987) mRNA sequences only. These sequence records represent genes of unknown function.)
UniGene The UniGene database is a collection of non-redundant sequence clusters derived from known genes, ESTs and their high scored GenBank sequence homologies. It must be noted, that the UniGene project did not attempted to build an overlapping consensus sequence or contig.
CGAP All EST libraries for human and mouse created by the Cancer Genome Anatomy Project at NCBI was copied into the IGMS system.
Affymetrix GeneChips The GeneChips HU95A, HU95B, HU95C, HU95D, HU95E of Affymetrix Inc.


The IGMS is divided into four major parts:

  1. The GenBank part: copies complete sequences or coding sequences
  2. The Gene Expression Profiles: useful functions for the analysis of the UniGene clusters, e. g. tissue histology or Affymetrix GeneChip probe set numbers
  3. Integrated Genetic Maps: genetic map locations, genetic disorders, sequence homologies, tissue type profiles on a specific chromosomal region
  4. Alternative Splice Forms: shows the possible alternative splice forms based on ESTs mapped onto mRNA sequences

Furthermore we added a function to summarize all available information for a selected sequence with all available links (see Fig. 1)


Figure 1: Overview of the UniGene cluster Hs.790. The IGMS summarizes the histology information, gives the Affymetrix GeneChip probe set number (if available), the description of the cluster, genetic symbol, location on the chromosome, the size of the UniGene cluster and the corresponding tissue types.


The Alternative Splice Database was created by an algorithm [4, 5, 6] which defines a possible alternative splice form by comparing high-scoring ESTs to mRNA sequences using BLAST. Filtering programs compare the ends of each aligned sequence pair for deletions or insertions in the EST sequence, which suggest the existence of alternative splice forms. In the alternative splice database a list of all possible alternative splice forms for nine organisms is available. This list contains the direct link to the information mentioned before. Furthermore it is possible to search for a specific mRNA sequence to check if this sequence could be alternatively spliced (e. g. D00726, see Fig. 2).

This database was integrated in the IGMS service, but it is also available as a stand-alone tool under http://www.bioinf.mdc-berlin.de/splice/db/.


Figure 2: The possible alternative splice form of D00726 (Human mRNA for ferrochelatase (EC 4.99.1.1)). As useful information the aligned positions, the type of alternative splice forms (skipped or inserted sequence) and the corresponding EST are given.




Results

The IGMS was used to investigate alternative splicing in colon cancer. For that we examined all human UniGene clusters with at least 30 ESTs per cluster by using the "gene expression profile" function. Afterwards all sequences expressed in colon were selected from these 13.954 different clusters ("extract ESTs+Tissue+Chr"). As second step we used our human alternative splice database that was created as described in [4] (the average alignment identity is at least 98% and each HSP is at least 30 bp long). One example is shown in Fig. 1. (The complete table is available at http://www.bioinf.mdc-berlin.de/splice/colon/).


Figure 3: All possible alternative splice forms of mRNA BC005923 (microsomal glutathione S-transferase 1). mRNA is indicated in orange, the ESTs in red. The alternative splice form is indicated by a deletion of >30bp (e. g. AA314967) or an insert of >30bp (e. g. BE566462). The EST AA314967 is derived from a HCC cell line in colon and belongs to UniGene cluster Hs.790 (see Fig. 1).


We found that 1707 colon ESTs indicate alternative splicing of 2857 different mRNAs. The higher number of mRNAs indicated as alternative splice forms is due to the fact that one EST could match with more than one mRNA if these sequences are nearly (but not completely) redundant (cf. Fig. 4). Hence a better parameter for estimating the frequqency of alternative splicing is the number of independent UniGene clusters. In our case we found 963 different clusters. These UniGene clusters including the corresponding mRNAs and ESTs are listed in the denoted list (http://www.bioinf.mdc-berlin.de/splice/colon/).


Figure 4: All found alternative splice forms indicated by one single EST sequence (AA085284). The 5 different mRNAs are not completely redundant. All five mRNAs are grouped into the same UniGene cluster (Hs.258551).




Discussion

We present here a retrieval system to extract genomic data in combination with knowledge from genomic sequences, genetic map and genetic disorders databases from several sources (see Tab. 1). The main advantage of this system is the combination of genomic information with expression information and with alternative splice information. This approach enables to filter out the sequences and information of interest very selectively by successively queries. The so called current list method uses the result of one query as input for the next query.

Furthermore it is possible to create an own database of sequences or UniGene clusters of interest. The IGMS includes three novel functions: (a) identification of UniGene clusters that are differentially expressed, (b) mapping ESTs with a specific expression profile and (c) a database of alternative splice forms for nine organisms.

Compared with other websites, you can find some special functions in the information class "Gene Expression Profiles", which are very useful for the analysis of gene expression, as e.g. tissue histology and Affymetrix GeneChip probe set numbers. For that case we have integrated the complete UniGene cluster sets collection and the CGAP EST libraries created by the Cancer Genome Anatomy Project at NCBI into the IGMS system. As a new function it is now possible to identify UniGene clusters that are differentially expressed in different types of cancer with respect to different reference tissues, using for example, as criteria defined ratios of the number of ESTs found in tumour tissues as compared to the number found in normal tissues and a defined number of ESTs per cluster. It could be very interesting to retrieve e. g. all human UniGene clusters with at least 30 ESTs, more than 90% are found in cancerious tissues and at least one EST must expressed in a specific tissue (e. g. in colon). The results can be combined with clinical data to asses the potential relevance of specific genes for patient survival or metastatic spread.

The part "Integrated Genetic maps" of the IGMS system maps EST with a specific expression profile. One example is to filter all representing genes over expressed in breast cancer, to the corresponding regions of the genome, or vice versa to map all genes on chromosome 8 that are over expressed in breast cancer.

The Alternative Splice Database represents all possible splice forms for the nine organisms Arabidopsis thaliana, Bos taurus, Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Mus musculus, Rattus norvegicus, Xenopus laevis and Homo sapiens from ESTs and mRNA GenBank sequence records. It is possible to check a sequence of interest if an alternative splice form was found or not. A second method to select alternative splice forms is to choose the function "Search for Alternative Splice Forms" at the "Gene Expression Profiles" subsection. This function allows to select all possible alternative splice forms within a (preselected) current list of UniGene clusters. A further (essential) selecting feature is the (minimal) number of alternative splice forms within a UniGene cluster.



References

  1. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A. and Wheeler, D. L. (2002). GenBank. Nucleic Acids Res. 30, 17-20.

  2. NCBI-GenBank, Flat File Release 131.0. Distribution Release Notes  ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

  3. GenBank and dbEST. Overview 2002.
    http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html

  4. Brett, D., Kemmner, W., Koch, G., Roefzaad, C., Gross, S. and Schlag, P. (2001). A rapid bioinformatic method identifies novel genes with direct clinical relevance to colon cancer. Oncogene 20, 4581-4585.

  5. Brett, D., Hanke, J., Lehmann, G., Haase, S., Delbruck, S., Krueger, S., Reich, J. and Bork, P. (2000). EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 474, 83-86.

  6. Brett, D., Pospisil, H., Valcarcel, J., Reich, J. and Bork, P. (2002). Alternative splicing and genome complexity. Nat. Genet. 30, 29-30.