| In Silico Biology 4, 0005 (2003); ©2003, Bioinformation Systems e.V. |
| Ontology Workshop Tokyo 2003 |
1 Biological Information Research Center, National Institute of Advanced Industrial Science and Technology (AIST)
Email: t-hishiki@jbirc.aist.go.jp
2 National Institute of Genetics
Email: oogasawa@lab.nig.ac.jp, kousaku@genomatrix.com
3 CREST, JST (Japan Science and Technology Corporation)
Email: tsuruoka@is.s.u-tokyo.ac.jp
* corresponding author
Edited by E. Wingender; received September 22, 2003; revised and accepted December 24, 2003; published December 28, 2003
As a first step toward the quantitative comparison of clinical features of diseases, we indexed the text descriptions in the Clinical Synopsis section of the Online Mendelian Inheritance in Man (OMIM) with concepts for the body parts, organs, and tissues contained in the Metathesaurus of the Unified Medical Language System (UMLS). We also indexed the text with the diseases and disorders having links to body parts specified in the thesaurus. The vocabulary size was approximately 177,540 representations for 81,435 concepts, and 2,161 concepts were indexed to 3,779 OMIM entries. The indexed concepts included 134 concepts for the noun forms of anatomical concepts and 985 indexed concepts for diseases and disorders that were linked to 132 and 408 anatomical concepts, respectively. We report herein that the retrieval of OMIM entries for diseases affecting specific organs can be made more comprehensive through the anatomical concepts indexed to the Clinical Synopsis or linked to the indexed concepts, as compared to simply matching organ names to the Clinical Synopsis text. The recall and precision of identifying relevant body parts in the Clinical Synopsis were calculated as 78% and 92.5%, respectively, based on random sampling. The examination of the unidentified body parts due to lack of indexed diseases and disorders showed that although most of the concepts for diseases and disorders were contained in the Metathesaurus, their relations to body parts were not. The indexing result proved the effectiveness of the Metathesaurus as a resource for the identification of concepts indicating body parts, diseases, and disorders.
Key words: text mining, automated indexing, OMIM, UMLS
Online Mendelian Inheritance in Man (OMIM) [1], a database of human genes and genetic disorders, has been widely used for research and education in the field of human genomics and the practice of clinical genetics. It provides a summary of genes and genetically determined phenotypes, as well as many databases of gene and genome sequences. Clinical manifestations caused by mutations or disorders are summarized in the Clinical Synopsis (CS) section of a record in the form of free-text. The free-text format is flexible and adaptable to expansion and revision of our system of knowledge. OMIM would be one of the primary resources for investigating the relationship between human phenotypes and the features of genes and gene products. Toward this goal, however, we must compare the free-text description in a quantitative way; i.e. we must judge the degree of similarity between features of diseases.
The motivation for the present study was to introduce an underlying structure to free-text CS that would enable this comparison. With this in mind, we identified two tasks. The first is to map or index pre-defined concepts to the free-text descriptions. The extent to which relations between the concepts are structured is not questioned at this point. The second is the definition of the distance relations between mapped concepts, the merger of closely related concepts, and the calculation of the similarity between descriptions of the diseases which have been defined as a collection of such concepts. The present paper addresses the first task. As a by-product of completing this task, a more comprehensive search of some of the concepts would become possible, because the concepts to be retrieved would be indexed based on the content of the text, regardless of the representation of the text.
We chose to map anatomical structures such as tissues, organs, and body parts on CS. We believe that indexing is important, because anatomical description has been at the basis of medicine. In addition, indexing will provide a link to genome-wide research that focuses on transcriptome or proteome from specific types of tissue. Combining genome-wide research with comprehensively indexed descriptions of clinical manifestations would facilitate the computation of the relationship between gene features and clinical features.
The structure of CS is loosely classified according to the affected systems or organs, and employs headers such as Lung and Liver. However, there is inconsistency in the use of the headers. For example, descriptions of emphysema are distributed under the headings of Lung, Pulmonary, RESPIRATORY/Lung ('/' indicates the subheading structure of an OMIM entry and is introduced in the present study, not in OMIM), and Resp. Similarly, descriptions of hepatomegaly appear under ABDOMEN, ABDOMEN/Liver, Abdomen, Abd, GI, Liver, MISCELLANEOUS, and MOLECULAR BASIS. Therefore, simple string matching against these headers by the organ names; e.g., Lung or Liver, may be insufficient for enabling comprehensive retrieval of relevant descriptions of the diseases and disorders affecting the organs; nor would these variations be convenient for the comparison of the descriptions. Therefore, we did not use the structures existing in OMIM, but rather chose to index concepts directly on the free-text description.
The main problems were vocabulary for indexing terms and the indexing method. Our approach to the indexing method problem was to index the text while making maximum use of automation, and we planned to simply match a large vocabulary of organs, tissues, and body parts with the free-text description of the CS. We addressed the vocabulary issue in a way that avoided the re-invention of such a vocabulary. We chose to compile a subset of the Metathesaurus, a large vocabulary maintained in the Unified Medical Language System (UMLS) project [2] led by the National Library of Medicine (NLM) (http://www.nlm.nih.gov/research/umls/). The vocabulary has been built to include numerous biomedical terms, and links have been assigned among the terms. The Metathesaurus is the largest thesaurus in the biomedical domain that provides a representation of biomedical knowledge consisting of concepts classified by semantic types, as well as relationships among the concepts. It has been tested in the representation of biomedical specialties including bioinformatics [3], large-scale automated indexing targeting all the MEDLINE records [4], and natural language processing of biomedical texts [5].
OMIM Clinical Synopsis: An OMIM text-format file was downloaded from the NCBI FTP site (ftp://ftp.ncbi.nih.gov/) in August, 2003. The CS fields were extracted and converted into a table with each row representing a line (usually a sentence or a phrase) of a description. Each row of the table was set to have four columns: MIM ID, the higher-level subheading, the lower-level subheading, and the free-text description. The CS table had 51,549 rows corresponding to 4,523 MIM IDs.
Vocabulary for indexing: Metathesaurus files as of July 2003 were downloaded from the UMLS Knowledge Server. We defined that nine semantic types, as follows, are likely to represent tissues, organs, and body parts to which clinical manifestations will be mapped: 'Anatomical Structure', 'Embryonic Structure', 'Fully Formed Anatomical Structure', 'Body System', 'Body Part, Organ, or Organ Component', 'Tissue', 'Body Location or Region', 'Body Space or Junction', and 'Spatial Concept'. Terms of these semantic types were collected. The adjectival forms of these terms were linked to their noun forms using the 'noun_form_of' relation defined in the Metathesaurus. Although several terms of the 'Spatial Concept' semantic type were found to contain body part expressions, a number of non-biomedical terms were also included. Therefore, we manually excluded from the collection 306 representations of the semantic type.
Some concepts do not represent anatomical objects, but are tightly bound to anatomical structures. For example, hepatomegaly is the pathologically enlarged state of the liver. This type of relations is defined in UMLS as 'location_of' relation in the Metathesaurus. Using this relation, we collected non-anatomical terms related to the terms of the nine semantic types listed above, and merged these two collections.
Indexing: Each term representation in the vocabulary was matched to a free-text description field of the CS table with the word-based complete match in a case-insensitive manner. The row, offset of the start of the match from the beginning of the text description, and the length of the match were recorded. The longest matches were selected from overlapping matches. Any links to the noun forms from adjectival form anatomical terms were added, as were links to anatomical terms from non-anatomical terms.
Validation of the indexing: We used two types of metrics to check the overall appropriateness of indexing: the recall, or the comprehensiveness with which appropriate organs or tissues were indexed; and the precision, or the ratio of correct indexing. We selected a small subset of the entire table: in each of ten iterations we randomly selected 100 lines (1,000 random selections in total).
The effect of the indexing: We measured the effect of indexing by the increase in obtained OMIM entries in response to an anatomical term as a query. The numbers of retrieved OMIM entries were compared between two search methods: one method searched for anatomical concepts that were indexed to CS directly or linked from indexed concepts, and the other executed a text search through Entrez (http://www.ncbi.nih.gov/entrez/) with 'query terms'. As example queries, we arbitrarily selected 'liver', having the UMLS unique concept identifier (CUI) C0023884, and 'lung' (C0024109).
The effect of ambiguous representations: We evaluated the extent of the ambiguity problem, or the variation in word sense depending on the context, because our indexing method was a simple matching of a large vocabulary on the text. We selected 'ambiguous' representations from our vocabulary as follows: First, we selected representations denoting more than one CUI; then, for each representation, we examined all pairs of concepts for any relation with each other as defined in the Metathesaurus 'MRREL' table, and eliminated the pairs having any relation; as a result, we excluded the representations in which all of the represented concepts are inter-related. We performed further filtering by excluding concept pairs such that one concept concerns a procedure on the organs/tissues and the other concept relates to the neoplasm or tumors affecting them, through literal matching of the concept titles with, e.g., procedure, neoplasm, tumor, and cancer. Representations actually indexed to OMIM CS were selected together with the indexed description fields. We judged whether the CUI attached to the representations was correct.
Vocabulary and indexing: The size of our vocabulary for indexing amounted to 177,540 representations (counted in a case sensitive manner, decreasing to 168,661 when counted in a case insensitive manner) for 81,435 concepts. We found that one of the terms (BIOPSY with UMLS CUI C0005558) had predefined links to various possible target organs and produced unnecessary mapping to anatomical structures. Therefore, we excluded the term before obtaining the statistics for the matching. The CS text description had 21,274 lines for 3,779 MIM IDs that matched 2,161 concepts (in 2,956 representations). Table 1shows major semantic types in the vocabulary and how often they matched to the CS, along with the sizes of these collections of terms. The table shows that the number of indexed concepts for diseases and disorders is large relative to the anatomical concepts (consisting of 985 unique concepts).
| Table 1: | Distribution of major UMLS semantic types in the vocabulary and the matched representations. The number of unique matched concepts for the diseases and disorders (those having semantic type ID as T047, T191, T019, T037, T046, T184, T020, T190, or T033) was 985. |
| Semantic Type ID | Semantic Type | Number of Representations in the vocabulary | Number of Concepts in the vocabulary | Number of matches on Clinical Synopsis |
Number of matched Representations | Number of matched concepts |
| T023 | Body Part, Organ, or Organ Component | 85686 | 49194 | 7474 | 1035 | 705 |
| T047 | Disease or Syndrome | 28459 | 4417 | 3961 | 734 | 550 |
| T029 | Body Location or Region | 15890 | 9801 | 4056 | 233 | 140 |
| T191 | Neoplastic Process | 12088 | 2787 | 314 | 104 | 99 |
| T030 | Body Space or Junction | 9261 | 4703 | 577 | 106 | 79 |
| T019 | Congenital Abnormality | 8089 | 2142 | 4224 | 446 | 318 |
| T037 | Injury or Poisoning | 7637 | 2646 | 20 | 12 | 10 |
| T082 | Spatial Concept | 3958 | 2480 | 2606 | 168 | 152 |
| T046 | Pathologic Function | 2107 | 219 | 757 | 80 | 49 |
| T024 | Tissue | 1770 | 713 | 1202 | 105 | 60 |
| T184 | Sign or Symptom | 1388 | 76 | 488 | 69 | 41 |
| T020 | Acquired Abnormality | 1372 | 292 | 122 | 27 | 22 |
| T190 | Anatomical Abnormality | 1076 | 98 | 269 | 46 | 31 |
| T022 | Body System | 1052 | 443 | 0 | 0 | 0 |
| T017 | Anatomical Structure | 914 | 648 | 626 | 19 | 11 |
| T018 | Embryonic Structure | 884 | 361 | 55 | 28 | 24 |
| T033 | Finding | 753 | 174 | 46 | 18 | 13 |
The total number of concepts for the organs, tissues, and body parts indexed to the text or linked from the indexed concepts was 1,233. A total of 985 diseases and disorders were associated with 408 anatomical concepts. Table 2 shows how the concepts for diseases and disorders were associated to anatomical concepts. For example, 441 out of 550 matched concepts for the diseases and syndromes were linked to the concepts for the body parts, organs, or organ components. A total of 134 anatomical concepts were found represent the adjective forms of 132 anatomical concepts.
| Table 2: | Links from concepts for diseases and disorders to anatomical concepts. The rows show the distribution of concepts for diseases and disorders linked to each type of anatomical concept. Multiple semantic types for the destination concepts make the sum larger than the number of matched concepts. |
| Number of the matched concepts linked to concepts with each Semantic Type | ||||||||
| Semantic Types for diseases and disorders | Number of matched concepts | T017: Anatomical Structure | T018: Embryonic Structure | T022: Body System | T023: Body Part, Organ, or Organ Component | T024: Tissue | T029: Body Location or Region | T030: Body Space or Junction |
| T019:Congenital Abnormality | 318 | 6 | 14 | 19 | 249 | 1 | 54 | 9 |
| T020:Aquired Abnormality | 22 | 5 | 18 | 1 | 2 | 2 | ||
| T033:Findings | 13 | 1 | 2 | 10 | 1 | |||
| T037:Injury or Poisoning | 10 | 4 | 8 | 1 | 1 | |||
| T046:Pathologic Function | 49 | 1 | 1 | 8 | 33 | 5 | 1 | 4 |
| T047:Disease or Syndrome | 550 | 18 | 3 | 101 | 441 | 38 | 29 | 16 |
| T048:Mental or Behavioral Dysfunction | 1 | 1 | ||||||
| T184:Sign or Symptom | 41 | 6 | 7 | 29 | 3 | 5 | ||
| T190:Anatomical Abnormality | 31 | 1 | 1 | 26 | 7 | 3 | ||
| T191:Neoplastic Process | 99 | 2 | 8 | 84 | 3 | 4 | 3 | |
Validation of the indexing: In order to check the correctness of indexing, we collected 990 unique rows after ten iterations of random sampling of 100 rows out of the total of 51,549 rows. The sampled collection included 399 text description fields (40% of the sample collection) containing indexed terms. Manual checking of the collection revealed that 482 text fields (49% of the sample collection) actually contained terms (body parts, organs, tissues, diseases, etc.) that appeared to have a clear link to specific body parts, organs, or tissues. We judged that in 377 text fields (78% of the text fields containing terms linked to anatomical concepts) the links have been correctly identified by automated indexing. In addition, we judged that 30 text fields (7.5% of the text fields containing indexed terms) contained inappropriately indexed terms. In summary, the recall of the relevant body part concept was 78%, and the precision was 92.5%.
In order to explain the comparatively lower recall, we analyzed the 83 text fields (17% of the text fields judged to have links to body parts) to which no indexing term had been assigned, although the text fields contained terms to be indexed, and found that 22 of these terms remained unassigned due to the lack of anatomical term representations in our vocabulary. For example, six fields had the representation of retinal, used as the adjectival form of retina. The defined semantic type of the represented concept was not Body Part, Organ, or Organ Component, as assigned to retina, but rather Organic Chemical and Biologically Active Substance. Other cases of adjective forms (e.g. acetabular for acetabulum and palmar for palm), as well as plural forms (e.g. scaplae for scapla), were identified. However, the larger factor leading to the lower recall was diseases and disorders, because 60 of the unassigned terms (72% of the total unassigned terms in the sample) were due to diseases that were judged to be mapped to specific organs/tissues, although at present no definition of the location has been assigned in Metathesaurus. Supplementary Table 1 lists the diseases and disorders as well as the organs/systems to which OMIM has assigned them, in the form of subheadings. The table shows that most of these diseases are actually contained in the entire set of the Metathesaurus, although they are not related to specific organs or tissues.
Effect of indexing: Using Entrez databases, we retrieved 193 OMIM entries related to the liver. We searched for UMLS CUI for the liver in the indexing results, and retrieved 238 entries. The 238 entries contained 60 entries retrieved only through our indexing (56 correct retrievals and 4 false retrievals), and 15 entries were retrieved only by Entrez. By means of our concept-based method, we retrieved 41 (= 60 - 15 - 4 ) additional OMIM entries for which disorders/mutations may affect the liver. Supplementary Table 2 shows the results.
Meanwhile, we retrieved 149 lung-related OMIM entries by Entrez, including 24 entries retrieved only by Entrez, and 251 lung-related entries retrieved by the UMLS CUI for the lung in the indexing results. The 251 entries included 126 entries retrieved only by our indexing. However, of the 126 entries, 26 incorrectly contained the term COLD, to which C0024117, short for 'Chronic Obstructive Lung Disease', and C0009443, meaning 'Common Cold' had been assigned, and which had been indexed to the lung and to 'Nose, accessory sinus and nasopharynx', respectively. The match occurred typically in the context of 'cold exposure' or 'cold-induced', which was much more likely to represent 'cold temperature' (C0009264). Supplementary Table 3 shows the results.
The influence of ambiguous representations: The example of the lungs suggests that additional cases of indexed terms that point to incorrect organs may exist due to the ambiguity of the representations. Our vocabulary included 861 representations (1% of the representations in our vocabulary) that represented more than one CUI. After the exclusion of representations denoting inter-related concepts, 203 representations remained. After the exclusion of the procedure-neoplasm concept pairs, 177 representations remained, 84 of which were actually indexed to 1,196 CS description fields.
We found that 18 representations, including COLD, were used in contexts in which no assigned anatomical concept was appropriate (Supplementary Table 4). For example, none of the contexts to which axis was indexed (axis vertebra was not indexed) were related to 'Second cervical vertebra', but a large number of the contexts were related to 'Electrocardiographic axis'; none of the contexts indexed by anterior represented 'anterior lobe of hypophysis', having been assigned to the representation in the Metathesaurus; none of the contexts indexed by dilation represented 'Endoscopic dilatation', but, rather, all were related to 'Pathological Dilatation', e.g. 'Cardiac dilation'. The table also shows that seven representations are included in the entire set of the Metathesaurus, although these representations are not located to specific organs or tissues.
The concepts assigned to 12 representations were judged to be either correct or incorrect, depending on the context. For example, chest and thoracic represented both the upper part of the trunk and a specific region of organs (e.g. the skin and the spine), overlapping with the body part. Another example was parathyroid and thyroid, which represented both the hormone-secreting organs and part of the names for the hormones. Nineteen of the descriptions of parathyroid and 18 descriptions of thyroid in the sampled text were related to the hormones secreted by the organs, and their assignment to the concept for the organs was judged to be incorrect. For example, 'Serum parathyroid hormone elevated' does not always indicate a disorder of the parathyroid. The Metathesaurus also contains representations for these hormones.
The increasing knowledge of gene function has revealed the need to describe phenotypes in comparative, which is the motivation behind the construction of controlled vocabularies for phenotypes. For example, the Mouse Genome Database (MGD) [6] introduced 105 terms for higher-level classification of phenotypes that will be used to group or compare phenotypes. Similarly, the increasing knowledge of human diseases prompted interest in the general principles of human diseases and the investigation into the relations between gene function and the features of the diseases. Thus, the description of the clinical manifestation of the diseases in a manner that will enable their quantitative comparison has been proposed. For example, Freudenberg and Propping [7] manually indexed the diseases from OMIM in accordance with their episodic occurrence, primary etiology, primary tissue, mode of inheritance, and age of onset. In particular, the reported tissue categories were central nervous system, peripheral nervous system, eye, lens, cornea, ear, heart, lung, kidney, gastro-intestinal, liver, bone-marrow derived cells, endocrine tissue, connective tissue, muscle, skin, and bone. Our approach shares the motivation and is similar to theirs in that the output is the OMIM description indexed with the organs and tissues; however, the indexed concepts are generally of much finer granularity.
Another difference in our approach is the automation of the indexing processes and the applications of existing thesaurus resources to the indexing. We indexed the free-text description of Clinical Synopsis sections with not only anatomical terms and their adjectival variants, but also terms that have a defined relation to the anatomical terms. Application of the UMLS Metathesaurus to the vocabulary building process made the process efficient and achieved a recall of 78% and a precision of 92.5%. However, we think that this level of performance is marginal for the purpose of converting all of the text descriptions to a collection of relevant body parts. The use of OMIM subheadings as an alternative to indexed concepts would be a short-term solution to the absence of indexed concepts. At the same time, adding to relations between the concepts is necessary for specific applications such as that reported herein, because most of the descriptions that lacked indexed diseases linked to body parts actually had potential matches with diseases contained in the remainder of the Metathesaurus, indicating that the problem would not be the coverage of diseases by Metathesaurus, but rather the comparatively sparse mapping of diseases to organs or tissues.
The simple matching of vocabulary on the target text would usually lead to an ambiguity problem, or variation in word sense depending on the context. However, because this section of an OMIM record is prepared for a specific application; i.e. the clinical description of diseases, we predicted that the need to consider the ambiguity of the indexed terms would be smaller than in the case of indexing free-texts with no restriction on scope. We evaluated the influence of ambiguity by inspecting the indexed contexts of known ambiguous representations. The results based on the UMLS Metathesaurus indicated that only about one percent of our vocabulary was assigned multiple concepts. Moreover, we found that a large portion of the incorrect indexing that appeared to have arisen due to problems in disambiguation may stem from the lack of exact representations in our vocabulary, and therefore could be solved by applying the entire set of representations in the Metathesaurus. Of course, whether this applies to the indexing to the other, less structured parts of the OMIM that are separate from the Clinical Synopsis has not yet been determined.
The second task of introducing an underlying structure to free-text CS, or the definition of conceptual closeness between mapped anatomical structures, would require resources suitable for working on a large scale. One approach would be to use hierarchical relations defined in the Metathesaurus. Another approach could be to use a system of ontology targeted specifically to anatomical concepts and to map existing anatomical terms to the hierarchy defined in the ontology. The candidates for such a system would be eVoc [8] and Tissue DB (http://tissuedb.ontology.ims.u-tokyo.ac.jp/tissuedb/index.html). For the unification of gene expression data, the ontologies are designed to computationally deal with semantic differences between the materials for gene expression studies. This approach would combine the coverage of representations of concepts achieved by UMLS Metathesaurus and the well-defined hierarchical structures in the anatomical ontologies. This would require development of methods for efficiently mapping the vocabularies to each other.
We would like to express our sincere appreciation to NCBI and NLM for maintaining the servers from which the OMIM and UMLS, respectively, can be downloaded, and for their permission to use the data.