|In Silico Biology 8, 0006 (2007); ©2007, Bioinformation Systems e.V.|
1 Bio-IT Business Promotion Center, NEC Corporation, 34 Miyukigaoka, Tsukuba, Ibaraki 305-8501, Japan
2 Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan
3 Database Center for Life Science, Research Organization of Information and Systems, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan
4 Bio-IT Business Promotion Center, NEC Corporation, 5-7-1 Shiba, Minato-ku, Tokyo 108-8001, Japan
5 Internet Systems Research Laboratories, NEC Corporation, 8916-46 Takayama-cho, Ikoma, Nara 630-0101, Japan
6 Fundamental and Environmental Research Laboratories, NEC Corporation, 34 Miyukigaoka, Tsukuba, Ibaraki 305-8501, Japan
* Corresponding author
Edited by E. Wingender; received May 01, 2007; revised August 17, October 16, and December 17, 2007; accepted December 17, 2007; published December 22, 2007
Microarray technology has become employed widely for biological researchers to identify genes associated with conditions such as diseases and drugs. To date, many methods have been developed to analyze data covering a large number of genes, but they focus only on statistical significance and cannot decipher the data with biological concepts. Gene Ontology (GO) is utilized to understand the data with biological interpretation; however, it is restricted to specific ontology such as biological process, molecular function, and cellular component. Here, we attempted to apply MeSH (Medical Subject Headings) to interpret groups of genes from biological viewpoint. To assign MeSH terms to genes, in this study, contexts associated with genes are retrieved from full set of MEDLINE data using machine learning, and then extracted MeSH terms from retrieved articles. Utilizing the developed method, we implemented a software called BioCompass. It generates high-scoring lists and hierarchical lists for diseases MeSH terms associated with groups of genes to utilize MeSH and GO tree, and illustrated a wiring diagram by linking genes with extracted association from articles. Researchers can easily retrieve genes and keywords of interest, such as diseases and drugs, associated with groups of genes. Using retrieved MeSH terms and OMIM in conjunction with, we could obtain more disease information associated with target gene. BioCompass helps researchers to interpret groups of genes such as microarray data from a biological viewpoint.
Keywords: MeSH terms, Gene Ontology, MEDLINE, text mining, OMIM, machine learning, microarray
Microarray technology is one of the most powerful tools for genome-wide analysis of gene expression changes under multiple conditions. Many methods, such as noise reduction [1, 2], hierarchical clustering , self-organization map (SOM) , and k-means , have been developed to decipher microarray data. For further interpretation of those data, they must be understood not only in the context of statistical significance but also in the context of biological phenomena. For example, researchers try to relate genes in specific clusters to diseases and pathways annotated in the biological database. However, it is often difficult for many biologists to grasp a biological overview of their own microarray data.
To overcome this difficulty, Gene Ontology (GO)  has been developed for eukaryotic model organisms. In conjunction with GO annotations, which are an association made between gene products and GO terms, GO helps understanding of biological interpretation of genes [7, 8]. GO terms come only from specific ontologies, biological process, cellular component, and molecular function, so they do not contain species-specific terms, especially ones for human (disease and anatomy fields) and chemical substances that are very useful for drug discovery.
In this paper, we propose a novel method for microarray data analysis by introducing MeSH (Medical Subject Headings) terms  from the MEDLINE database in association with GO terms. MeSH contains terms regarding other concepts such as "Diseases", "Chemicals and Drugs", and "Biological Phenomena". We extracted relationships among genes and these three categories. In this paper, we focus on gene-disease associations and show them as results as we were interested in this relationship for the integrated analyses with OMIM (Online Mendelian Inheritance in Man) database . Although use of MeSH to interpret microarray data has been reported [11, 12], these terms came from articles referred to only in sequence databases, such as GenBank, RefSeq, and Swiss-Prot. We therefore utilized the full set of MEDLINE data to retrieve articles related to each gene and extracted MeSH terms from these articles. To identify contexts indicating each gene from articles is one of the major themes, and many approaches using text mining are reported [13-16]. We attempted to apply machine learning to extract contexts, and collected gene-disease associations. To implement this method, we developed a software called BioCompass. Researchers can easily search for genes and keywords of interest and construct networks from those genes with BioCompass. We provided a slimmed version of BioCompass, which is available at http://gendoo.dbcls.jp/.
Entrez Gene  was downloaded as gene/protein data from the NCBI (National Center for Biotechnology Information) FTP site (ftp://ftp.ncbi.nlm.nih.gov/gene/) in January 2006. MEDLINE article data available in January 2006 was obtained from NLM (National Library of Medicine). MeSH terms  and Substance Names (2006 Release) were also obtained from NLM (http://www.nlm.nih.gov/mesh/meshhome.html). GO terms  were downloaded from the Gene Ontology Consortium web site (http://www.geneontology.org/).
Articles extraction related to each gene by text-mining technology
We applied machine learning (ML)  to extract all articles related to each gene from full MEDLINE data. Fig. 1 shows a schematic view of the pipeline for extracting articles describing each gene. The pipeline consists of three steps.
Click on the thumbnail to enlarge the picture
|Figure 1: Schematic view of the pipeline for collecting contexts, PMID and Gene ID. The pipeline consists of three steps. First, contexts are collected by utilizing Entrez Gene and MEDLINE. Next, contexts are collected without considering text mining, and a decision tree is generated with collected contexts as a training set. More contexts were then re-collected by the constructed decision tree.|
In the first step, PubMed IDs (PMIDs) of articles referred to in Entrez Gene are retrieved with gene2pubmed, because these articles are expected to give a description of the gene. Contexts, PMID, and Gene ID from MEDLINE data were then extracted and stocked as a "gold standard".
In the second step, contents clearly indicating each gene were extracted from the full set of MEDLINE data. Here, no contents were extracted that have spelling variation or refer to several genes.
In the last step, a decision tree was generated using already stocked context, MeSH terms, and co-occurring words as a training set. Also, articles were extracted from the full set of MEDLINE data by applying a decision tree to contexts in each article, especially candidate abbreviation/long-form pairs. This step was performed several times to brush up the rules for decision tree. After these steps, MeSH terms were extracted from the stocked articles and assigned to each gene.
Scoring of genes and MeSH terms
MeSH terms associated with each gene were ranked according to information gain , which refers to the frequency of co-occurrence of a gene and a term and to the specificity of the term.
Information gain is defined as the change in information entropy from a prior state to a state that receives some information. The change in information entropy of a term was calculated before and after the corresponding gene was given as information. The information entropy is calculated as
where pi is the probability that the term is or is not assigned to articles in MEDLINE.
The information gain I(t, g) of term t and gene g is defined as
where |T| refers to the total number of articles, and |Gj| represents the number of articles describing or not describing gene g.
BioCompass was developed using Python and Perl under Linux. Users access the BioCompass server with Internet Explorer in Windows. They can search for a few genes interactively via a web interface of BioCompass.
When a user is interested in many genes at a time, it is possible to upload files to BioCompass, including additional attributes for these genes such as accession numbers and probe IDs of a microarray. Here, BioCompass outputs Microsoft Excel format files as a result. The Excel files are generated by using a Perl module, namely, Spreadsheet::WriteExcel. The Excel report files are divided into several parts containing no more than 200 genes, because of the Excel limitation to treat up to 65,536 rows and 256 columns in one sheet. (The latest version of Excel, Excel 2007, can treat up 1,048,576 rows and 16,384 columns, but it is not popular yet.)
We also developed a client application called BioCompass Network Viewer for generating a wiring-diagram file as another way of viewing of the results. We developed BioCompass Network Viewer using Visual Studio on Microsoft Windows.
Availability of MeSH
MeSH  is a controlled vocabulary for indexing articles in MEDLINE curated by NLM. There are more than 23,000 terms hierarchically categorized in 15 concepts such as "Anatomy", "Chemicals and Drugs", and "Diseases". NLM also prepares the terms for names of chemical compounds, called "Substance Names". There are more than 156,000 terms listed as Substance Names.
To ascertain MeSH terms that represent biological property of groups of genes, we examined what kind of links and how many links are assigned to terms in MEDLINE.
First, we checked how many genes can be annotated with MeSH. We tested mouse data of Entrez Gene as an example. We counted the number of genes that have reference articles, because MeSH terms are not directly assigned to genes but articles. Fig. 2 shows a Venn diagram illustrating the number of genes to which we could assign GO or MeSH terms: 38,007 entries (60.1%) could be assigned to MeSH terms but only 16,753 entries (26.5%) to GO terms. This result shows that MeSH is well-annotated and thus more effective in biological interpretation of gene function than GO.
Click on the thumbnail to enlarge the picture
|Figure 2: Venn diagram for comparison of the number of MeSH and GO terms associated with genes. The numbers of genes having reference articles and assigned GO terms in Entrez Gene are compared. Gene data for mouse was used for this comparison. The coverage of annotated genes is raised from 26.5% to 60.1% by using MeSH terms in addition to GO terms.|
Next, we checked what kind of terms was assigned to a gene. We took a gene named "SLC26A2" (DTDST, diastrophic dysplasia sulfate transporter, Gene ID: 1836) as example. We retrieved articles annotated with the Substance Name, "SLC26A2 protein, human" from MEDLINE. We retrieved 59 articles. We then extracted MeSH terms from the retrieved articles and counted the number of terms in these articles. We also extracted corresponding GO terms from Entrez Gene.
Tab. 1 lists the extraction results. We ranked MeSH terms by their frequency (Tab. 1b). The assigned GO terms indicate that the product of the gene SLC26A2 has sulfate-transporter activity (exactly, SLC26A2 encodes a member of the sulfate-transporter family ), and MeSH terms indicate SLC26A2 has a relation to developmental disorders of bone. This result suggests that we can understand not only gene functions but also species-specific information (for example, disease) related to each gene by using MeSH in addition to GO.
|Table 1: Lists of MeSH and GO terms related to gene "SLC26A2" retrieved without text mining|
|Molecular Function||sulfate transport activity|
|Biological Process||sulfate transporter|
|Cellular Component||integral to membrane|
|Chemicals and Drugs||Sulfates||25|
|Diseases||Bone Diseases, Developmental||6|
|GO terms were retrieved from Entrez Gene (a) and MeSH terms by searching MEDLINE by inputting Substance Name, "SLC26A2 protein, human" (b). The frequency in (b) refers to the number of retrieved articles (i.e., 59).|
Articles extraction related to each gene by text mining technology
To assign MeSH terms to genes, we utilized articles from MEDLINE, which has gathered more than 16 million articles from the mid-1950s. In other works [11, 12], associated articles were extracted from the description in sequence databases, such as GenBank, RefSeq, and Swiss-Prot, but most articles contained in MEDLINE remained to be analyzed. We therefore attempted to retrieve all articles related to each gene from the full set of MEDLINE data. Indeed gene names appear in articles from 1980's, but there was little difference of process time to extract associations with/without old articles, so we included data from articles published since the 1950's.
To retrieve information from articles, many methods and tools utilizing text mining are reported [13-16]. The major difficulty to link context to a gene is that gene names are often ambiguously described in the text. One of the problems is that one gene often has many aliases, such as "PDS", "pendrin", and "SLC26A4". Another problem is that exactly the same abbreviation is referred to by several genes, diseases, chemical substances, and so on; for example, "PDS" means "pendrin" (gene), "personality disorder" (disease), or "polydioxanone" (chemical substance). Usually, abbreviations are defined by their long-form names in the article, but long-form names have spelling variations such as "protein" and "proteins", "induce" and "inducing", and "ubiquitin specific" and "ubiquitin-specific". In addition, we should take into account description variations of subtypes, ion names and species names. Tab. 2 shows an example of long-form names of NHE3 gene by searching PubMed. Here, we should identify these descriptions to extract articles describing the NHE3 gene. Arabic numerals and letters are used as descriptions indicating the subtype, and Roman numerals are also used, such as "carbonic anhydrase isozyme II". We therefore attempted to apply machine learning to extract articles related to each gene from the full set of MEDLINE data. Fig. 1 shows a schematic view of the pipeline for extracting articles describing each gene. The pipeline consists of three steps. We collected contexts from articles linked to each gene in sequence databases, and we then generated the decision tree by using these contexts as a training set. After that, we collected more contexts from full MEDLINE data with the generated decision tree. After these steps, we extracted MeSH terms from articles and assigned terms to the genes.
|Table 2: The descriptions-variations of "NHE3" gene|
|Long-form names of NHE3|
|Na(+)/H(+) exchanger 3|
|Na+/H+ exchanger 3|
|human Na(+)/H(+) exchanger (NHE)3|
|Na(+)/H(+) exchanger isoform 3|
|Na(+)/H(+) exchanger NHE3|
|Na(+)-H(+) exchanger NHE3|
|Na/H exchanger isoform 3|
|Na/H exchanger isoform-3|
|type 3 Na(+)/H(+) exchanger|
|type 3 sodium hydrogen exchanger|
|sodium hydrogen exchanger type 3|
|sodium/hydrogen exchanger isoform, NHE3|
|sodium/proton exchanger NHE3|
|Long-form names of NHE3 gene are listed. We extracted these descriptions from newest 20 articles retrieved by searching PubMed. To extract long-form names from articles, we should take into account not only spelling variations but also descriptions of subtypes and ion names.|
Scoring for pairs of genes and MeSH terms
We ranked MeSH terms related them to each gene according to the frequency of co-occurrence of the gene and term, and the specificity of occurrence of the term. Tab. 3 lists high-scoring disease MeSH terms as an example of the output for searching by inputting "SLC26A2". In this search, we were able to retrieve more disease terms compared to searching without text mining (as shown in Tab. 1).
|Table 3: List of top-ten disease MeSH terms related to "SLC26A2" including extension by text mining technology|
|Bone Diseases, Developmental||0.66|
|We extracted contexts related to each genes from full set of MEDLINE data, and estimated scores of gene-term associations. Here, we show the scores normalized by the highest scored term (here, Osteochondroplasias). This list is an example of result for searching disease MeSH terms by inputting "SLC26A2".|
This result suggests relationships between the genes and diseases that are not mentioned directly in the OMIM (Online Mendelian Inheritance in Man)  and Entrez Gene databases.
Additionally, we investigated whether we could obtain more disease information related to the focused gene by searching OMIM database by inputting disease MeSH terms retrieved by this method. Tab. 4 lists OMIM entries related to SLC26A2. It is clear from Tab. 4a that we can easily retrieve disease information by extracting OMIM entries referred to in Entrez Gene. In addition, as shown in Tab. 4b, we could retrieve further relations between disease and genes by searching OMIM by inputting disease MeSH terms as listed in Tab. 3. It is thus concluded that by using this method in conjunction with OMIM, researchers can retrieve more information about diseases associated with a targeted gene.
|Table 4: Lists of OMIM terms associated with gene "SLC26A2"|
|OMIM ID||OMIM entry|
|*606718||Solute carrier family 26 (sulfate transporter), member 2; SLC26A2|
|#600972||Achondrogenesis, type IB; ACG1B|
|#256050||Neonatal osseous dysplasia I|
|#226900||Epiphyseal dysplasia, multiple, 4; EDM4|
|MeSH terms as input||Retrieved OMIM entry||OMIM ID|
|Osteochondrodysplasias||Osteochondrodysplasia, rhizomelic, with callosal agenesis, thrombocytopenia, hydrocephalus, and hypertension||166990|
|Dwarfism||Microcephalic osteodysplastic primordial dwarfism, type I||%210710|
|Microcephalic osteodysplastic primordial dwarfism, type II||%210720|
|Microcephalic osteodysplastic primordial dwarfism, type III||%210730|
|We retrieved OMIM entries from Entrez Gene (a) and by searching OMIM database by inputting disease MeSH terms listed in Tab. 3, for example, "Osteochondroplasias", "Dwarfism", and "Achondroplasia".|
We present a novel method for functional analysis of groups of genes that can be applied to microarray data by introducing MeSH in association with GO. This method is aimed to help understanding of groups of genes in a biological context. Researchers can easily retrieve genes and keywords of interest from a gene list. It is useful for researchers to interpret a group of genes by searching for biological features such as genes, diseases, and chemical substances associated with the genes of interest and by constructing wiring diagram of retrieved genes.
We thank Kohei Tomizuka for the implementation of BioCompass. This work was supported by NEDO (New Energy and Industrial Technology Development Organization) as part of a project for developing biotechnology IT integration equipment (Focus21).