| In Silico Biology 5, 0008 (2004); ©2005, Bioinformation Systems e.V. |
| Ontology Workshop 2004 |
1 School of Knowledge Science, Japan Advanced Institute of Science and Technology
2 BIRD, Japan Science and Technology Agency (JST)
3 CREST, Japan Science and Technology Agency (JST)
* Corresponding author
Email: ken@jaist.ac.jp
Edited by E. Wingender; received January 14, 2005; revised and accepted March 03, 2005; published March 18, 2005
Since biomedical texts contain a wide variety of domain specific terms, building a large dictionary to perform term matching is of great relevance. However, due to the existence of null boundary between adjacent terms, this matching is not a trivial problem. Moreover, it is known that generative words cannot be comprehensively included in a dictionary because their possible variations are infinite.
In this study, we report our approach to dictionary building and term matching in biomedical texts. Large amount of terms with/without part-of-speech (POS) and/or category information were gathered, and a completion program generated ~1.36 million term variants to avoid stemming problems when matching terms. The dictionary was stored in a relational database management system (RDBMS) for quick lookup, and used by a matching program. Since the matching operation is not restricted to a substring surrounded by space characters, we can avoid the problem of null boundaries. This feature is also useful for generative words. Experimental results on GENIA corpus are promising: nearly half of the possible terms were correctly recognized as a meaningful segment, and most of the remaining half could be correctly recognized by some post-processing process, like chunking and further decomposition. It should be remarked that although we have not used term cost, connectivity cost, or syntactic information, reasonable segmentation and dictionary lookup were performed in most cases.
Keywords: text processing, dictionary building, dictionary lookup and matching, sentence segmentation, term boundary
Text processing is rapidly increasing its importance in bioinformatics. For instance, one of the tasks in KDD Cup 2002 (a competition of knowledge discovery in computer science) was information extraction in biomedical text [1]. Also, BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) was held in 2003 [2]. This growing importance of text processing in bioinformatics is probably associated with the gradual shift of focus in molecular biology from genome sequencing to functional genomics. The function of biomolecules like genes and proteins is, in its ultimate essence, a comprehensive and complex relationship with other molecules that interact with it. Since this kind of functional information is often described in natural language texts, it is desirable to extract useful information on biomolecular function in a formal and structured form [3].
Most studies on biomedical text processing have been usually conducted by using tools and algorithms from natural language processing (NLP) and machine learning fields. On the other hand, some domain experts are focusing on the development of linguistic resources. For example, Unified Medical Language System (UMLS) [4] provides three knowledge sources (Metathesaurus, Semantic Network, and SPECIALIST lexicon). Especially, the UMLS SPECIALIST lexicon contains a large amount of terms including biomedical concepts and vocabularies with POS information, and NLP researchers in bioinfomatics are currently using it as a trustworthy dictionary. On the other hand, GENIA project provides a valuable tagged corpus built from 2000 MEDLINE abstracts [5]. The GENIA corpus has become a de-facto standard for the evaluation of various extraction and prediction algorithms in biomedical text processing.
Aside from dictionary and corpus, Gene Ontology (GO) [6, 7] is being widely used for clustering, rearrangement, and interpretation of various data sets including databases, search results, and experimental data. In addition to the structured concepts curated by domain experts, link information provided via the index files (*2go) makes GO an extremely powerful tool in bioinformatics. However, controlled vocabulary in GO is not enough to cover most of names and concepts in biomedical texts. In addition, it requires enormous efforts by domain experts to update it. Therefore, a massive ontology automatically constructed from texts and dictionaries is desired as a necessary complement to a curated ontology like GO.
Despite these linguistic resources and the application of powerful algorithms, biomedical text processing has not yet accomplished sufficient accuracy and coverage for practical purposes. The improvement of results is mainly prevented by the complicated names (named entities) contained in biomedical texts. For example, research articles in life sciences contain several problematic names:
The amount of such kind of names is growing due to increasing publication of new articles, so small-sized dictionaries are not well suited for biomedical text processing since they provoke a chain effect of cumulative errors in every level: tokenization, part-of-speech tagging, parsing, and semantic analysis. Among various approaches to term recognition classified in [8], we believe that dictionary-based approaches are important because we can extract huge amount of names from databases even if errors and noise are included. The development of sufficiently large vocabulary and its lookup based on appropriate matching are therefore essential preprocessing needed before higher levels of text processing can be carried out.
As a first step for automatic construction of massive ontology, we constructed a large vocabulary from various resources including an English dictionary, a life science dictionary, a sequence database, and a taxonomy database. In addition, chemical compound names generated by a program were added to the collection. A large number of terms in this collection have POS information, as well as category information (i. e. gene name, chemical name, etc.). For those terms with POS information, possible variants were generated (NNS from NN, VBZ/VBG/VBD/VBN from VB, and JJR/JJS from JJ Footnote a) [9] and also added to the collection. In total, the vocabulary contains ~5 million of terms stored in RDBMS for quick lookup.
We also developed a program to match the vocabulary against text. In this program, we adopted an approach similar to POS taggers for agglutinative language, so that it can perform correct matching for cases where there are no obvious word delimiters as space characters. This approach is especially effective for recognizing long chemical names not contained in the vocabulary, although some of their component sub-words might be. By using this method, complex notation like "NF-kappa Bp65/p50" can be reasonably explained by the sub-words "NF-kappa B", "p65", and "p50". It indicates that this method might be useful to decrease the failures in dictionary lookup (i. e. recognizing a word or a term as "unknown").
In this paper, we describe the vocabulary construction, matching program, and results of preliminary experiments on texts. Although the matching program does not make use of POS information yet, it nevertheless outputs informative segmentation of sentences.
Term sources
In order to cover most as many biomedical terms as possible, we used the following resources:
From the above resources, we obtained 3,752,867 terms in total. Although UMLS is widely used as a sufficiently large and correct dictionary, we did not make use of UMLS to avoid the risk of overestimating our experimental results (both MEDLINE and UMLS are maintained in the National Library of Medicine, so most probably both resources are not independent).
Filtering and completion
We did a simple filtering of terms extracted in 2.1 to eliminate some apparently incorrect entries (mainly terms containing no alphabetic characters, such as "123", "#456", or "10 - 20%"). Also, multi-word terms starting and/or ending with prepositions (e. g. "be sure to") were eliminated. Such idiomatic terms frequently occur in Eijiro and are harmful in matching.
In addition to filtering, grammatical completion of terms is important for building a dictionary suitable for matching. Given a raw text without POS tagging, it is hard to correctly identify and stem all the grammatical variants of nouns, verbs, and adjectives to their original forms (i. e. NNS -> NN, VBZ/VBG/VBD/VBN -> VB, and JJR/JJS -> JJ). Conversely, if we have a term with POS information of original form (NN, JJ, or VB), it is easy to generate such variants. We constructed a program to implement this kind of completion. To discriminate these generated variants, a suffix ' # ' is added to their POS information (e. g. NNS#). In a similar way, for multi-word terms without POS information, we checked their last word and if the word is known to be NN or NNS, we generated a new multi-word term by replacement (e. g. "Cdk protein" -> "Cdk proteins"). This resulted quite helpful during the matching process, to avoid needless decoupling of semantically cohesive multi-word nouns. POS tags with the suffix '+' (e. g. NNS+) were attached to this kind of generated multi-word nouns. Information about their original form is also attached in these grammatical completions.
Aside from filtering and grammatical completion, we noticed that single words included in extracted multi-word term are frequently not included in the set of extracted terms. We therefore generated these words as new entries in our dictionary. To discriminate these generated words, category information ('part of term') is attached to them. Finally, we obtained 5,110,971 non-unique terms after filtering and completion. Tables 1, 2, and 3 show information about the extracted terms from the viewpoint of data source, POS, and category information, respectively.
| Table 1: | Vocabulary distribution (data source) |
|
|
|
| Table 2: | Vocabulary distribution (POS) |
|
|
|
| Table 3: | Vocabulary distribution (category of noun) |
|
|
|
Storing and indexing
In order to perform quick lookups in the dictionary, terms in Tables 1, 2, and 3 are stored and indexed using SQLite, with the following schema:
| term_head | term_caseless | term | pos | category | source | original_term | original_tag |
In this schema, the attribute 'term_caseless' represents the original term after discarding case information, and 'term_head' represents the first 3 letters of 'term_caseless'. Both of them are indexed. The attributes 'original_term' and 'original_tag' are optional and used for the terms generated by completion.
Problem in term boundary
In the processing of English text, it is widely believed that a space character can be regarded as the most significant delimiter of terms and it has the higher priority among special characters that can be used as delimiters. However, in biology texts this is not necessarily always true. For instance, in the expression "zinc finger/leucine zipper protein", dividing into 4 segments is clearly incorrect ("zinc", "finger/leucine", "zipper", and "protein"). Therefore, dictionary matching after decomposition by space characters can lead to incorrect results. Furthermore, we found that some biomedical texts even contain null boundary between adjacent terms. For example, "NF-kappa Bp65/p50" should be decomposed into "NF-kappa B", "p65", and "p50", with a null boundary appearing between "NF-kappa B" and "p65". As we can see from these examples, dictionary matching in biomedical text processing is not a trivial problem.
Matching algorithm
The problem of null boundary is well known in the processing of texts written in agglutinative languages (e. g. Japanese). This problem is usually solved by generating possible matching paths and evaluating those paths based on costs related to term frequency and connectivity of adjacent terms [17]. In this study, we adopted a similar approach to develop a program for dictionary matching. The matching algorithm is implemented as follows:
|
Figure 1: Generation of possible paths. "PTSD ..." is a given sentence. All the substrings that were not filtered out in Step 2 (written in lower case) are used for path generation ("tsd" was filtered out in Step 2). The partial path [ptsd] in the bottom of this figure has higher score than the others (e. g. [pts][d] and [pt][sd]) since it contains longer substring. |
Examples of matching results
Given the string "NF-kappa Bp65/p50", our matching program returns the following output:
0|9|NF-kappa B|NF-kappa B||moiety bound name|genbank:bound_moiety|| 10|12|p65|p65||gene name|genbank:gene|| 10|12|p65|p65||protein name|genbank:product|| 10|12|p65|P65||gene name|genbank:gene|| 13|13|/|/|SYMBOL 14|16|p50|p50||gene name|genbank:gene|| 14|16|p50|p50||protein name|genbank:product|| 14|16|p50|p50||strain name|genbank:strain|| 14|16|p50|P50||gene name|genbank:gene|| 14|16|p50|P50||protein name|genbank:product|| 14|16|p50|P50||strain name|genbank:strain||
Each line consists of 9 values separated by bars. These values are: start position, end position, substring in the input, matched term in the dictionary, POS, category, data source, original form of term, and POS of original form of term. In this example, it can be observed that our program works well even with an input containing null boundaries. It also illustrates the treatment of case information: although dictionary lookup is done in a case insensitive manner, the entry with same or similar usage of case is displayed prior to others (e. g. 'p50' is first displayed and then 'P50' since the substring in the input was 'p50'). This output order information can be later used in a higher level of text processing if it is required to choose only one dictionary entry for each substring.
The result of matching a more common sentence, "The granules were found inside rough endoplasmic reticulum cisternae.", is shown below:
0|2|The|The|NNP|personal subname|medline:person-subname|| 0|2|The|the|||eijiro-v66:entry|| 0|2|The|the||repeated sequence family name|genbank:rpt_family|| 0|2|The|the|DT||eijiro-v66:entry|| 0|2|The|the|RB||eijiro-v66:entry|| 3|3| | |SYMBOL 4|11|granules|granules|NNS||eijiro-v66:irreg|granule|NN 4|11|granules|granules|NNS#||eijiro-v66:irreg|granule|NN 4|11|granules|granules|NNS#||lsd4:entry|granule|NN 4|11|granules|granules|NNS#||eijiro-v66:entry|granule|NN 12|12| | |SYMBOL 13|16|were|were|||eijiro-v66:entry|| 13|16|were|were|VBD||eijiro-v66:irreg|be|VB 17|17| | |SYMBOL 18|22|found|found|JJ||eijiro-v66:entry|| 18|22|found|found|VB||eijiro-v66:irreg|found|VB 18|22|found|found|VB||eijiro-v66:entry|| 18|22|found|found|VBD||eijiro-v66:irreg|find|VB 18|22|found|found|VBN||eijiro-v66:irreg|find|VB 23|23| | |SYMBOL 24|29|inside|inside|IN||eijiro-v66:entry|| 24|29|inside|inside|JJ||lsd4:entry|| 24|29|inside|inside|JJ||eijiro-v66:entry|| 24|29|inside|inside|NN||eijiro-v66:irreg|inside|NN 24|29|inside|inside|NN||eijiro-v66:entry|| 24|29|inside|inside|RB||eijiro-v66:entry|| 30|30| | |SYMBOL 31|57|rough endoplasmic reticulum|rough endoplasmic reticulum|||eijiro-v66:entry|| 31|57|rough endoplasmic reticulum|rough endoplasmic reticulum|NN||lsd4:entry|| 58|58| | |SYMBOL 59|67|cisternae|cisternae|NNS||lsd4:entry|| 68|68|.|.|SYMBOL
In this example, it is noticeable that a semantically cohesive multi-word noun "rough endoplasmic reticulum" is recognized as a whole. Of course, it is possible to display shorter terms inside it (i. e. "rough", "endoplasmic", "reticulum", and "endoplasmic reticulum") if they are contained in the dictionary (currently, such detailed output is suppressed).
The following result with the input "naphthylaminoethylcarbamoylmethylthioethylcarbamoylmethyl" illustrates that our approach is effective also for long, generative-type of words:
0|7|naphthyl|naphthyl||chemical name|chemical-name-generator|| 0|7|naphthyl|naphthyl|NN||lsd4:entry|| 0|7|naphthyl|naphthyl|NN||eijiro-v66:entry|| 0|7|naphthyl|Naphthyl|NN|chemical name|chemical-name-generator|| 8|17|aminoethyl|aminoethyl||part of term|part-of-term-generator|| 18|32|carbamoylmethyl|carbamoylmethyl|NN|chemical name|chemical-name-generator|| 33|36|thio|thio||chemical name|chemical-name-generator|| 33|36|thio|thio||gene name|genbank:gene|| 33|36|thio|thio|JJ||eijiro-v66:entry|| 33|36|thio|thio|JJ||lsd4:entry|| 33|36|thio|thio|NN|chemical name|chemical-name-generator|| 33|36|thio|Thio|NNP|personal subname|medline:person-subname|| 37|41|ethyl|ethyl||chemical name|chemical-name-generator|| 37|41|ethyl|ethyl|NN|chemical name|chemical-name-generator|| 37|41|ethyl|ethyl|NN||lsd4:entry|| 37|41|ethyl|ethyl|NN||eijiro-v66:entry|| 37|41|ethyl|Ethyl|||eijiro-v66:entry|| 42|56|carbamoylmethyl|carbamoylmethyl|NN|chemical name|chemical-name-generator||
Although the whole word was not contained in the dictionary, the matching program successfully decomposed it into reasonable sub-words.
The result below for the sentence "Surfactants are pyrolysed ..." illustrates a failed matching:
0|10|Surfactants|surfactants|NNS#||lsd4:entry|surfactant|NN 0|10|Surfactants|surfactants|NNS#||eijiro-v66:entry|surfactant|NN 11|11| | |SYMBOL 12|14|are|are|||eijiro-v66:entry|| 12|14|are|are|NN||eijiro-v66:entry|| 12|14|are|are|VBZ||eijiro-v66:irreg|be|VB 12|14|are|Are|NNP|personal subname|medline:person-subname|| 12|14|are|Are|NNP|place name|eijiro-v66:entry|| 12|14|are|ARE|||eijiro-v66:abbrev|asymptotic relative efficiency| 12|14|are|ARE||moiety bound name|genbank:bound_moiety|| 15|15| | |SYMBOL 16|23|pyrolyse|pyrolyse|||eijiro-v66:entry|| 24|24|d|d|||eijiro-v66:abbrev|dead| 24|24|d|d||gene name|genbank:gene|| 24|24|d|d||protein name|genbank:product|| 24|24|d|d||strain name|genbank:strain|| 24|24|d|D|||eijiro-v66:abbrev|Democrat| 24|24|d|D|||eijiro-v66:abbrev|deca-| 24|24|d|D|||eijiro-v66:abbrev|deep| 24|24|d|D|||eijiro-v66:abbrev|dendrite| 24|24|d|D|||eijiro-v66:abbrev|depth| 24|24|d|D|||eijiro-v66:abbrev|deuterium| 24|24|d|D|||eijiro-v66:abbrev|diaphragm| 24|24|d|D||gene name|genbank:gene|| 24|24|d|D||protein name|genbank:product|| 24|24|d|D||repeated sequence family name|genbank:rpt_family|| 24|24|d|D||strain name|genbank:strain|| 24|24|d|D|||lsd4:entry|| 24|24|d|D|NN||lsd4:entry|| 24|24|d|D|NN||eijiro-v66:entry|| 25|25| | |SYMBOL ...
In this example, "Surfactants" was successfully recognized due to the effect of dictionary completion mentioned in 2.2. On the other hand, "pyrolysed" was not. Since only the entry "pyrolyse" without POS information was contained in the dictionary, the entry "pyrolysed" was not generated and "d" was orphaned in this matching. This kind of failure can be avoided by building a more comprehensive dictionary.
Materials
Using GENIA corpus (version 3.02), we evaluated how this approach can recognize semantically cohesive nouns (single- and multi-word). In GENIA corpus, single- and multi-word terms are surrounded by <w></w> and <cons></cons>, respectively. We compared such segmentation described in GENIA corpus with the result of matching our dictionary to the raw text of the corpus.
Results on segmentation
Figures 2, 3, 4, and 5 show the results of evaluation. In Figure 2, it can be seen that the rate of perfect match is quite low. However, in Figure 3 we see how accuracy can be increased by using some post-processing method, like chunking and/or further decomposition. Figures 4 and 5 show how the evaluated accuracy of segmentation is still high even if we strictly count the numbers of matches and mismatches based on unique combinations of term and POS, or unique terms.
|
Figure 4: Accuracy of segmentation. Similar to Figure 3, but the number of matches and mismatches were counted by unique combinations of term and POS. For example, "bound/JJ" and "bound/VBN" in GENIA corpus are counted separately, while the same "receprots/NNS" at different positions are not. |
|
Figure 5: Accuracy of segmentation. Similar to Figure 3 but the numbers of match and mismatch were counted by unique terms (differences in POS and positions are ignored). |
In this study, we presented our approach to dictionary building and matching in biomedical texts. Large amount of terms with and without POS and/or category information were gathered. In addition, a completion program generated ~1.36 million term variants to avoid stemming problem in matching. The dictionary was stored in RDBMS for quick lookup, and used by a matching program. Since the matching is not restricted to a substring surrounded by space characters, we can avoid the problem of null boundaries. This feature is also useful for generative words, such as chemical compound names, which cannot be comprehensively contained in a dictionary because their variation is intrinsically infinite. Experimental results on GENIA corpus were promising: about half of terms were correctly recognized as a meaningful segment, and most of the remaining half could be properly recognized by post-processing techniques, like chunking or further decomposition. It is noticeable that although we have not used term cost, connectivity cost, or syntactic information, reasonable segmentation and dictionary lookup could be performed in most cases. By incorporating such information into the evaluation function for path, we could increase further the accuracy of segmentation. Also, including new term sources may contribute to more correct segmentation and a more informative output. It could be interesting to implement a POS tagger in this system, which would return the POS tag as well as useful information about semantic category, stemmed form of grammatical variants, link information to database entries, and so on.
This work was partly supported by BIRD of Japan Science and Technology Agency (JST) and also partly by the Grant-in-Aid for Scientific Research on Priority Area "Genome Information Science" from the Ministry of Education, Culture, Sports, Science, and Technology of Japan.
a In this paper, POS are denoted using Penn Treebank II tags [9].