School of Computing Science, Middlesex University,
Bounds Green,
London N11 2NQ, UK
Phone: +44-208-411 6183
Fax: +44-208-411 5924
Email: 1n.x.khan@mdx.ac.uk, 2s.rahman@mdx.ac.uk
For a biomedical and drug development related research, it is essential to analyse the complete data set of human gene mutation data to understand the underlying molecular mechanism of diseases. The gene mutation database [Cooper et al. 1998] represents a rich source of information and it contains a huge number of entries. However, it lacks the complete data sets which is essential to analyse the trait for indirect association of diseases. A typical example in this context would be to analyse the Recombination Fraction (
) and Relative Dinucleotide Mutabilities (rdm) [Cooper and Krawczak, 1993].
Paton et al. (2000) developed a conceptual model for genomic and related functional data sets, but they used genome data warehouses to implement the model. However, it is evident that data warehouses produce data redundancy and attribute overlapping. Here the paper proposes the theoretical concept for an object model of gene mutation data. The initial outline of this conceptual model interpreting an integrated co-operative framework had been proposed in another paper [Khan et al., 2001].
At present the interoperability between databases is achieved by designing generic class and attributes applicable for general community. This results in overlapping of attributes and objects represented by different classes in different databases. For example, Gene classes exist in both the Genome Database (GDB) and the Genome Sequence Databases (GSD).
Our research is focusing on designing a schema for genetic disorder database which will have unique classes, attributes and objects for variance analysis. This will be unique since it will have no overlapping components with other genetic databases. This implies that if attributes ambd1, ambd2, ambd3,....ambdn exist in classes Cmbd1, Cmbd2, Cmbd3,....Cmbn of public domain molecular biology databases in schema Smbd1, Smbd2, Smbd3,....Smbdn, then the same attributes (ambd1,....mbdn) and classes (Cmbd1,....Cmbdn) will not exist in the schema of genetic disorder disease database Sgd. The intersection of Cmbd1, Cmbd2, Cmbd3,....Cmbn can exist but the intersection of Cmbd1,Cmbdn with any classes of Sgd will not exist (Table 1). The classes, Cgd1,....Cgdn which belongs to Sgd will be unique within all the heterogeneous databases of scientific interest. The classes Cgd1,....,Cgdn can be a subclass of classes Cmbd1, Cmbd2, Cmbd3,....,Cmbdn but Cgd1,....,Cgdn can not be subset of these Cmbd1,....,Cmbdn classes. The classes Cgd1,....,Cgdn can be a superclass of classes Cmbd1, Cmbd2, Cmbd3,....,Cmbdn but there can not be any union of attributes belonging to Cmbd1,....,Cmbdn classes (Table 1). This will ensure that Cgd can never be a part of any derived subclass or superclass. The attributes of classes Cgd1,....,Cgdn will not be a composite attributes by taking the values from the attributes of Cmbd1,....,Cmbdn classes. This will ensure no data redundancy during constructing data warehouses and its interoperability between heterogeneous databases.
Table 1: Non-Redundant Schema Integration
In Figure 1, we have presented the object data model of human gene mutation data based on the non-redundant schema integration concept. Here data have been presented as objects, which are instances of classes. We have emphasized on the domain of mutation data. All the superclasses and subclasses are the representation of this concept. MutatedGene class represents any common features of the gene that has been mutated. It will store the coding region, promoter region, consensus sequence and the transcribed area. Two subclasses that refers by this superclass objects are DiseaseCausing class and IndirectAssociation class. The DiseaseCausing class is further subdivided into two subclasses. These are SequenceFeature and ProductFeature. SequenceFeature stores the information on mutation site characteristics, i.e., cross-over, and gap. ProductFeature stores information on protein structure, affinity site modification and quantity of products. AssayTechnique class supports the information with laboratory protocol and evidence information, i.e., images of Radiography, Ideogram etc. The object classes described here are independent but two main classes, WildtypeGene and MutatedGene are related to each other. This relation has been represented with the arrow in Fig. 1. The relation itself has been represented as a class, Mutation. This relation is functionally dependent on each other, i.e., a mutated gene depends on a particular gene sequence. This Mutation class is specialized with two subclasses, MutationTypes and MutationMeasurement. MutationTypes class will categorize the mutation data and stores the information in two distinctive subclasses. Mutation data is stored either in class LesssynthesizedGeneProduct or in AbnormalGeneProduct. This depends upon the nature and the characteristics of the mutation. LesssynthesizedGeneProduct corresponds to the information if the mutation causes reduced synthesis of a normal gene product. Mutation type could be deletions (frame shifts), insertions, duplication, splice junction mutations, etc. AbnormalGeneProduct class gathers the information on gene structural defects, i.e., elongated gene product or shortened gene product. It also stores information related to the post translational defects, i.e., modification of processing and instability of protein products. MutationMeasurement will store the qualitative and quantitative measurements of the mutation, i.e., Mutation Rate and Mutation from public domain databases and it will depend on the mutation data entry.
|
Figure 1: A partial class hierarchy of human gene mutation data model. |
This paper proposes to develop and implement a conceptual database model of gene mutation data. The proposed conceptual model will search defective genes within human genome databases more effectively. The hyper link access and open-ended feature of the database will act as a component of a community or federated database for specialised research community. The WWW links will establish correlation with global databases and it will represent all the information, protein structural details, gene map and disease information corresponding to a specific gene mutation locus using single interface.