In Silico Biology 5, 0008 (2004); ©2005, Bioinformation Systems e.V.  
Ontology Workshop 2004

Utilizing weakly controlled vocabulary for sentence segmentation in biomedical literature


Kenji Satou1, 2,* and Kaoru Yamamoto3




1 School of Knowledge Science, Japan Advanced Institute of Science and Technology
2 BIRD, Japan Science and Technology Agency (JST)
3 CREST, Japan Science and Technology Agency (JST)



* Corresponding author
   Email: ken@jaist.ac.jp





Edited by E. Wingender; received January 14, 2005; revised and accepted March 03, 2005; published March 18, 2005



Abstract

Since biomedical texts contain a wide variety of domain specific terms, building a large dictionary to perform term matching is of great relevance. However, due to the existence of null boundary between adjacent terms, this matching is not a trivial problem. Moreover, it is known that generative words cannot be comprehensively included in a dictionary because their possible variations are infinite.

In this study, we report our approach to dictionary building and term matching in biomedical texts. Large amount of terms with/without part-of-speech (POS) and/or category information were gathered, and a completion program generated ~1.36 million term variants to avoid stemming problems when matching terms. The dictionary was stored in a relational database management system (RDBMS) for quick lookup, and used by a matching program. Since the matching operation is not restricted to a substring surrounded by space characters, we can avoid the problem of null boundaries. This feature is also useful for generative words. Experimental results on GENIA corpus are promising: nearly half of the possible terms were correctly recognized as a meaningful segment, and most of the remaining half could be correctly recognized by some post-processing process, like chunking and further decomposition. It should be remarked that although we have not used term cost, connectivity cost, or syntactic information, reasonable segmentation and dictionary lookup were performed in most cases.

Keywords: text processing, dictionary building, dictionary lookup and matching, sentence segmentation, term boundary



Introduction

Text processing is rapidly increasing its importance in bioinformatics. For instance, one of the tasks in KDD Cup 2002 (a competition of knowledge discovery in computer science) was information extraction in biomedical text [1]. Also, BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) was held in 2003 [2]. This growing importance of text processing in bioinformatics is probably associated with the gradual shift of focus in molecular biology from genome sequencing to functional genomics. The function of biomolecules like genes and proteins is, in its ultimate essence, a comprehensive and complex relationship with other molecules that interact with it. Since this kind of functional information is often described in natural language texts, it is desirable to extract useful information on biomolecular function in a formal and structured form [3].

Most studies on biomedical text processing have been usually conducted by using tools and algorithms from natural language processing (NLP) and machine learning fields. On the other hand, some domain experts are focusing on the development of linguistic resources. For example, Unified Medical Language System (UMLS) [4] provides three knowledge sources (Metathesaurus, Semantic Network, and SPECIALIST lexicon). Especially, the UMLS SPECIALIST lexicon contains a large amount of terms including biomedical concepts and vocabularies with POS information, and NLP researchers in bioinfomatics are currently using it as a trustworthy dictionary. On the other hand, GENIA project provides a valuable tagged corpus built from 2000 MEDLINE abstracts [5]. The GENIA corpus has become a de-facto standard for the evaluation of various extraction and prediction algorithms in biomedical text processing.

Aside from dictionary and corpus, Gene Ontology (GO) [6, 7] is being widely used for clustering, rearrangement, and interpretation of various data sets including databases, search results, and experimental data. In addition to the structured concepts curated by domain experts, link information provided via the index files (*2go) makes GO an extremely powerful tool in bioinformatics. However, controlled vocabulary in GO is not enough to cover most of names and concepts in biomedical texts. In addition, it requires enormous efforts by domain experts to update it. Therefore, a massive ontology automatically constructed from texts and dictionaries is desired as a necessary complement to a curated ontology like GO.

Despite these linguistic resources and the application of powerful algorithms, biomedical text processing has not yet accomplished sufficient accuracy and coverage for practical purposes. The improvement of results is mainly prevented by the complicated names (named entities) contained in biomedical texts. For example, research articles in life sciences contain several problematic names:

The amount of such kind of names is growing due to increasing publication of new articles, so small-sized dictionaries are not well suited for biomedical text processing since they provoke a chain effect of cumulative errors in every level: tokenization, part-of-speech tagging, parsing, and semantic analysis. Among various approaches to term recognition classified in [8], we believe that dictionary-based approaches are important because we can extract huge amount of names from databases even if errors and noise are included. The development of sufficiently large vocabulary and its lookup based on appropriate matching are therefore essential preprocessing needed before higher levels of text processing can be carried out.

As a first step for automatic construction of massive ontology, we constructed a large vocabulary from various resources including an English dictionary, a life science dictionary, a sequence database, and a taxonomy database. In addition, chemical compound names generated by a program were added to the collection. A large number of terms in this collection have POS information, as well as category information (i. e. gene name, chemical name, etc.). For those terms with POS information, possible variants were generated (NNS from NN, VBZ/VBG/VBD/VBN from VB, and JJR/JJS from JJ Footnote a) [9] and also added to the collection. In total, the vocabulary contains ~5 million of terms stored in RDBMS for quick lookup.

We also developed a program to match the vocabulary against text. In this program, we adopted an approach similar to POS taggers for agglutinative language, so that it can perform correct matching for cases where there are no obvious word delimiters as space characters. This approach is especially effective for recognizing long chemical names not contained in the vocabulary, although some of their component sub-words might be. By using this method, complex notation like "NF-kappa Bp65/p50" can be reasonably explained by the sub-words "NF-kappa B", "p65", and "p50". It indicates that this method might be useful to decrease the failures in dictionary lookup (i. e. recognizing a word or a term as "unknown").

In this paper, we describe the vocabulary construction, matching program, and results of preliminary experiments on texts. Although the matching program does not make use of POS information yet, it nevertheless outputs informative segmentation of sentences.




Building the vocabulary dictionary


Term sources

In order to cover most as many biomedical terms as possible, we used the following resources:

[Eijiro]
Eijiro [10] is a dictionary for English/Japanese translation. It contains 1,354,245 entries covering various areas (not limited to biomedical and scientific topics). Although it contains errors, it provides a sufficient amount of entries (347,302 entries) with POS information, as well as a large amount of terms without POS (1,006,943 entries).
[LSD4]
LSD4 (Life Science Dictionary [11], version 4) is also a dictionary for English/Japanese translation, mainly covering terms in life sciences. Compared to Eijiro, LSD4 contains less number of entries (37,946), although most of them have POS information. Also, we expect those entries to include more specific technical terms not contained in Eijiro.
[GenBank]
Using a simple program, we extracted various names from the FEATURES field of GenBank entries [12]. For instance, from descriptions like "/gene=...", we extracted gene names. Including the "/gene=" subfield, we selected 19 subfields and extracted names from them (names of protein, organism, strain, etc). In total, 1,233,995 entries were generated from GenBank database. Gathering names from a database has the advantage of the large amount of terms that can be extracted at a relatively small cost, although the correctness of the gathered names cannot be guaranteed. POS information is not added to the extracted names, but category information (e. g. names of gene, protein, etc.) is.
[Taxonomy]
Taxonomy [13, 14] is also a database maintained and distributed by NCBI. Since it contains 168,488 entries of organism names, we can regard it as a comprehensive collection of names of life. In addition to the scientific term, some variants (e. g. common name, synonym, acronym, etc.) are included, and information about the original scientific name is attached. We can use this information to normalize a set of semantically equivalent names into a single scientific name, in higher level of text processing.
[Personal name]
Personal names frequently occur in research articles as authors' names. To cover them, we extracted personal full names from AU field of 209,860 MEDLINE abstracts of papers published in 2002 (from January to September). We extracted a total of 596,361 full names, from which we generated 245,174 partial names (i. e. first, middle, and last names). We also added category information (personal full or partial name) to these terms.
[Chemical names generated by program]
Chemical molecule names of ligands, sugars, ions, and drugs appear often in biomedical texts. Since chemical names are generative (intrinsically infinite) and there exist standardized nomenclatures for them (CAS [15] and IUPAC [16]), we can recognize chemical molecule names by implementing the nomenclatures as a program and scanning biomedical texts with this program. In this study we obtained a number of chemical names by coding an original program which implements a subset of the rules defined in these nomenclatures. For linear molecules, from a word stem like "pent" (of "pentane"), we can generate various names and adverbs like "pentene", "pentyne", "pentyl", "pentanol", "pentanoic", etc. Also, element and sugar names like "sulfonium" and "glucose" have systematic variations ("sulfa", "sulfic", "sulfide", etc. for former, "glucitol" "gluconic", "glucalic", etc. for latter). In addition to these names, which can be generated systematically, we adopted widely used common names (e. g. "toluene" for methylbenzene"). In total, the generated chemical names and collected chemical common names amounted a total of 116,658 entries.

From the above resources, we obtained 3,752,867 terms in total. Although UMLS is widely used as a sufficiently large and correct dictionary, we did not make use of UMLS to avoid the risk of overestimating our experimental results (both MEDLINE and UMLS are maintained in the National Library of Medicine, so most probably both resources are not independent).


Filtering and completion

We did a simple filtering of terms extracted in 2.1 to eliminate some apparently incorrect entries (mainly terms containing no alphabetic characters, such as "123", "#456", or "10 - 20%"). Also, multi-word terms starting and/or ending with prepositions (e. g. "be sure to") were eliminated. Such idiomatic terms frequently occur in Eijiro and are harmful in matching.

In addition to filtering, grammatical completion of terms is important for building a dictionary suitable for matching. Given a raw text without POS tagging, it is hard to correctly identify and stem all the grammatical variants of nouns, verbs, and adjectives to their original forms (i. e. NNS -> NN, VBZ/VBG/VBD/VBN -> VB, and JJR/JJS -> JJ). Conversely, if we have a term with POS information of original form (NN, JJ, or VB), it is easy to generate such variants. We constructed a program to implement this kind of completion. To discriminate these generated variants, a suffix ' # ' is added to their POS information (e. g. NNS#). In a similar way, for multi-word terms without POS information, we checked their last word and if the word is known to be NN or NNS, we generated a new multi-word term by replacement (e. g. "Cdk protein" -> "Cdk proteins"). This resulted quite helpful during the matching process, to avoid needless decoupling of semantically cohesive multi-word nouns. POS tags with the suffix '+' (e. g. NNS+) were attached to this kind of generated multi-word nouns. Information about their original form is also attached in these grammatical completions.

Aside from filtering and grammatical completion, we noticed that single words included in extracted multi-word term are frequently not included in the set of extracted terms. We therefore generated these words as new entries in our dictionary. To discriminate these generated words, category information ('part of term') is attached to them. Finally, we obtained 5,110,971 non-unique terms after filtering and completion. Tables 1, 2, and 3 show information about the extracted terms from the viewpoint of data source, POS, and category information, respectively.


Table 1: Vocabulary distribution (data source)
Number of
non-unique terms
Data source
252994chemical-name-generator
25516eijiro-v66:abbrev
2117971eijiro-v66:entry
79558eijiro-v66:irreg
2518genbank:bound_moiety
10795genbank:cell_line
6589genbank:cell_type
34199genbank:clone_lib
9293genbank:dev_stage
471120genbank:gene
1304genbank:lab_host
10genbank:organelle
144435genbank:organism
586genbank:phenotype
Number of
non-unique terms
Data source
2595genbank:plasmid
482950genbank:product
6894genbank:rpt_family
8489genbank:specific_host
125809genbank:strain
1632genbank:sub_clone
4859genbank:sub_species
10320genbank:tissue_type
4238genbank:transposon
105863lsd4:entry
597669medline:person-fullname
245284medline:person-subname
183252part-of-term-generator
   
Number of
non-unique terms
Data source
635taxonomy:ACRONYM
238taxonomy:ANAMORPH
137taxonomy:BLAST NAME
4334taxonomy:COMMON NAME
226taxonomy:EQUIVALENT NAME
197taxonomy:IN-PART
922taxonomy:INCLUDES
403taxonomy:MISNOMER
2463taxonomy:MISSPELLING
26taxonomy:PREFERRED ACRONYM
9347taxonomy:PREFERRED COMMON NAME
137635taxonomy:SCIENTIFIC NAME
17657taxonomy:SYNONYM
9taxonomy:TELEOMORPH
Table 2: Vocabulary distribution (POS)
Number of
non-unique terms
POS
2604566(null)
127?? (unidentified)
62CC
7DT
162IN
68613JJ
695JJR
135948JJR#
694JJS
135948JJS#
Number of
non-unique terms
POS
25MD
240480NN
116359NN+
951812NNP
10633NNS
238227NNS#
447346NNS+
133PRP
5755RB
4SYM
Number of
non-unique terms
POS
816UH
29501VB
5357VBD
25412VBD#
5350VBG
25412VBG#
5354VBN
25412VBN#
5349VBZ
25412VBZ#

Table 3: Vocabulary distribution (category of noun)
Number of
non-unique terms
Category
2218096(null)
58TV program name
5band name
2biology name
209book name
1bookstore name
3car name
10795cell line name
6589cell type name
1certification name
252994chemical name
34199clone library name
2comics name
1committee name
520country name
9293developmental stage name
17drama name
1ethnic name
2examination name
Number of
non-unique terms
POS
471120gene name
91japanese aircraft code name
1304laboratory host name
2lesion name
1059magazine name
2518moiety bound name
5337movie name
3musician group name
8489natural host name
1466newspaper name
2onomatopoeia
10organelle name
318664organism name
17698organization name
183252part of term
597669personal fullname
55080personal name
245284personal subname
586phenotype name
Number of
non-unique terms
POS
27488place name
2595plasmid name
10play name
4product name
482950protein name
1region name
6894repeated sequence family name
1ruin name
1song name
77sports team name
549state name
2statistics name
125809strain name
1632sub-clone name
4859sub-species name
10320tissue type name
1066trade name
4238transposon name
53tune name




Storing and indexing

In order to perform quick lookups in the dictionary, terms in Tables 1, 2, and 3 are stored and indexed using SQLite, with the following schema:

term_head term_caseless term pos category source original_term original_tag

In this schema, the attribute 'term_caseless' represents the original term after discarding case information, and 'term_head' represents the first 3 letters of 'term_caseless'. Both of them are indexed. The attributes 'original_term' and 'original_tag' are optional and used for the terms generated by completion.



Matching and lookup


Problem in term boundary

In the processing of English text, it is widely believed that a space character can be regarded as the most significant delimiter of terms and it has the higher priority among special characters that can be used as delimiters. However, in biology texts this is not necessarily always true. For instance, in the expression "zinc finger/leucine zipper protein", dividing into 4 segments is clearly incorrect ("zinc", "finger/leucine", "zipper", and "protein"). Therefore, dictionary matching after decomposition by space characters can lead to incorrect results. Furthermore, we found that some biomedical texts even contain null boundary between adjacent terms. For example, "NF-kappa Bp65/p50" should be decomposed into "NF-kappa B", "p65", and "p50", with a null boundary appearing between "NF-kappa B" and "p65". As we can see from these examples, dictionary matching in biomedical text processing is not a trivial problem.


Matching algorithm

The problem of null boundary is well known in the processing of texts written in agglutinative languages (e. g. Japanese). This problem is usually solved by generating possible matching paths and evaluating those paths based on costs related to term frequency and connectivity of adjacent terms [17]. In this study, we adopted a similar approach to develop a program for dictionary matching. The matching algorithm is implemented as follows:

[Step 1]
Given a sentence as a string of characters, all possible substrings are generated, except the ones starting/ending with space characters. For example, from a simple sentence "A gene expressed", the algorithm generates "A", "g", "e", ..., "ge", "en", "ne", ..., "A g", "gen", "ene", ..., and so on. The longest substring is the given sentence itself. Each substring is associated to its position on the given sentence.
[Step 2]
For each substring, its term head (first 3 letters without case information) is computed. Then, the algorithm looks up terms with the term head from the dictionary by using the indexed attribute 'term_head'. The retrieved terms are then matched to the substring using exact match ignoring case information (case insensitive matching). For example, from a substring "MccA", the term head "mcc" is computed. Then, various terms with the same term head "mcc" are looked up from the dictionary (e. g. "McC", "MccA", "mccA", "MccA heptapeptide", "McCabe Frank", and so on). These terms are filtered by case insensitive matching (e. g. "MccA" and "mccA" match to the original substring "MccA"). Position information of each substring is propagated to the matched terms.
[Step 3]
From the head of the sentence, possible paths of matched substrings with position information at the sentence are generated (Figure 1). If the number of paths exceeds a threshold (100 is adopted in this study), each path is evaluated by a simple cost function based exclusively on the lengths of substrings and gaps among them contained in the path. The paths with low score are discarded and path extension is resumed.
[Step 4]
After the path generation finished, the algorithm selects the path with highest score, and outputs all the term information attached to the substrings contained in the path.


Figure 1: Generation of possible paths. "PTSD ..." is a given sentence. All the substrings that were not filtered out in Step 2 (written in lower case) are used for path generation ("tsd" was filtered out in Step 2). The partial path [ptsd] in the bottom of this figure has higher score than the others (e. g. [pts][d] and [pt][sd]) since it contains longer substring.



Examples of matching results

Given the string "NF-kappa Bp65/p50", our matching program returns the following output:

0|9|NF-kappa B|NF-kappa B||moiety bound name|genbank:bound_moiety||
10|12|p65|p65||gene name|genbank:gene||
10|12|p65|p65||protein name|genbank:product||
10|12|p65|P65||gene name|genbank:gene||
13|13|/|/|SYMBOL
14|16|p50|p50||gene name|genbank:gene||
14|16|p50|p50||protein name|genbank:product||
14|16|p50|p50||strain name|genbank:strain||
14|16|p50|P50||gene name|genbank:gene||
14|16|p50|P50||protein name|genbank:product||
14|16|p50|P50||strain name|genbank:strain||

Each line consists of 9 values separated by bars. These values are: start position, end position, substring in the input, matched term in the dictionary, POS, category, data source, original form of term, and POS of original form of term. In this example, it can be observed that our program works well even with an input containing null boundaries. It also illustrates the treatment of case information: although dictionary lookup is done in a case insensitive manner, the entry with same or similar usage of case is displayed prior to others (e. g. 'p50' is first displayed and then 'P50' since the substring in the input was 'p50'). This output order information can be later used in a higher level of text processing if it is required to choose only one dictionary entry for each substring.

The result of matching a more common sentence, "The granules were found inside rough endoplasmic reticulum cisternae.", is shown below:

0|2|The|The|NNP|personal subname|medline:person-subname||
0|2|The|the|||eijiro-v66:entry||
0|2|The|the||repeated sequence family name|genbank:rpt_family||
0|2|The|the|DT||eijiro-v66:entry||
0|2|The|the|RB||eijiro-v66:entry||
3|3| | |SYMBOL
4|11|granules|granules|NNS||eijiro-v66:irreg|granule|NN
4|11|granules|granules|NNS#||eijiro-v66:irreg|granule|NN
4|11|granules|granules|NNS#||lsd4:entry|granule|NN
4|11|granules|granules|NNS#||eijiro-v66:entry|granule|NN
12|12| | |SYMBOL
13|16|were|were|||eijiro-v66:entry||
13|16|were|were|VBD||eijiro-v66:irreg|be|VB
17|17| | |SYMBOL
18|22|found|found|JJ||eijiro-v66:entry||
18|22|found|found|VB||eijiro-v66:irreg|found|VB
18|22|found|found|VB||eijiro-v66:entry||
18|22|found|found|VBD||eijiro-v66:irreg|find|VB
18|22|found|found|VBN||eijiro-v66:irreg|find|VB
23|23| | |SYMBOL
24|29|inside|inside|IN||eijiro-v66:entry||
24|29|inside|inside|JJ||lsd4:entry||
24|29|inside|inside|JJ||eijiro-v66:entry||
24|29|inside|inside|NN||eijiro-v66:irreg|inside|NN
24|29|inside|inside|NN||eijiro-v66:entry||
24|29|inside|inside|RB||eijiro-v66:entry||
30|30| | |SYMBOL
31|57|rough endoplasmic reticulum|rough endoplasmic reticulum|||eijiro-v66:entry||
31|57|rough endoplasmic reticulum|rough endoplasmic reticulum|NN||lsd4:entry||
58|58| | |SYMBOL
59|67|cisternae|cisternae|NNS||lsd4:entry||
68|68|.|.|SYMBOL

In this example, it is noticeable that a semantically cohesive multi-word noun "rough endoplasmic reticulum" is recognized as a whole. Of course, it is possible to display shorter terms inside it (i. e. "rough", "endoplasmic", "reticulum", and "endoplasmic reticulum") if they are contained in the dictionary (currently, such detailed output is suppressed).

The following result with the input "naphthylaminoethylcarbamoylmethylthioethylcarbamoylmethyl" illustrates that our approach is effective also for long, generative-type of words:

0|7|naphthyl|naphthyl||chemical name|chemical-name-generator||
0|7|naphthyl|naphthyl|NN||lsd4:entry||
0|7|naphthyl|naphthyl|NN||eijiro-v66:entry||
0|7|naphthyl|Naphthyl|NN|chemical name|chemical-name-generator||
8|17|aminoethyl|aminoethyl||part of term|part-of-term-generator||
18|32|carbamoylmethyl|carbamoylmethyl|NN|chemical name|chemical-name-generator||
33|36|thio|thio||chemical name|chemical-name-generator||
33|36|thio|thio||gene name|genbank:gene||
33|36|thio|thio|JJ||eijiro-v66:entry||
33|36|thio|thio|JJ||lsd4:entry||
33|36|thio|thio|NN|chemical name|chemical-name-generator||
33|36|thio|Thio|NNP|personal subname|medline:person-subname||
37|41|ethyl|ethyl||chemical name|chemical-name-generator||
37|41|ethyl|ethyl|NN|chemical name|chemical-name-generator||
37|41|ethyl|ethyl|NN||lsd4:entry||
37|41|ethyl|ethyl|NN||eijiro-v66:entry||
37|41|ethyl|Ethyl|||eijiro-v66:entry||
42|56|carbamoylmethyl|carbamoylmethyl|NN|chemical name|chemical-name-generator||

Although the whole word was not contained in the dictionary, the matching program successfully decomposed it into reasonable sub-words.

The result below for the sentence "Surfactants are pyrolysed ..." illustrates a failed matching:

0|10|Surfactants|surfactants|NNS#||lsd4:entry|surfactant|NN
0|10|Surfactants|surfactants|NNS#||eijiro-v66:entry|surfactant|NN
11|11| | |SYMBOL
12|14|are|are|||eijiro-v66:entry||
12|14|are|are|NN||eijiro-v66:entry||
12|14|are|are|VBZ||eijiro-v66:irreg|be|VB
12|14|are|Are|NNP|personal subname|medline:person-subname||
12|14|are|Are|NNP|place name|eijiro-v66:entry||
12|14|are|ARE|||eijiro-v66:abbrev|asymptotic relative efficiency|
12|14|are|ARE||moiety bound name|genbank:bound_moiety||
15|15| | |SYMBOL
16|23|pyrolyse|pyrolyse|||eijiro-v66:entry||
24|24|d|d|||eijiro-v66:abbrev|dead|
24|24|d|d||gene name|genbank:gene||
24|24|d|d||protein name|genbank:product||
24|24|d|d||strain name|genbank:strain||
24|24|d|D|||eijiro-v66:abbrev|Democrat|
24|24|d|D|||eijiro-v66:abbrev|deca-|
24|24|d|D|||eijiro-v66:abbrev|deep|
24|24|d|D|||eijiro-v66:abbrev|dendrite|
24|24|d|D|||eijiro-v66:abbrev|depth|
24|24|d|D|||eijiro-v66:abbrev|deuterium|
24|24|d|D|||eijiro-v66:abbrev|diaphragm|
24|24|d|D||gene name|genbank:gene||
24|24|d|D||protein name|genbank:product||
24|24|d|D||repeated sequence family name|genbank:rpt_family||
24|24|d|D||strain name|genbank:strain||
24|24|d|D|||lsd4:entry||
24|24|d|D|NN||lsd4:entry||
24|24|d|D|NN||eijiro-v66:entry||
25|25| | |SYMBOL
...

In this example, "Surfactants" was successfully recognized due to the effect of dictionary completion mentioned in 2.2. On the other hand, "pyrolysed" was not. Since only the entry "pyrolyse" without POS information was contained in the dictionary, the entry "pyrolysed" was not generated and "d" was orphaned in this matching. This kind of failure can be avoided by building a more comprehensive dictionary.




Experimental results


Materials

Using GENIA corpus (version 3.02), we evaluated how this approach can recognize semantically cohesive nouns (single- and multi-word). In GENIA corpus, single- and multi-word terms are surrounded by <w></w> and <cons></cons>, respectively. We compared such segmentation described in GENIA corpus with the result of matching our dictionary to the raw text of the corpus.

Results on segmentation

Figures 2, 3, 4, and 5 show the results of evaluation. In Figure 2, it can be seen that the rate of perfect match is quite low. However, in Figure 3 we see how accuracy can be increased by using some post-processing method, like chunking and/or further decomposition. Figures 4 and 5 show how the evaluated accuracy of segmentation is still high even if we strictly count the numbers of matches and mismatches based on unique combinations of term and POS, or unique terms.


Figure 2: Accuracy of segmentation. In this figure, lines in the row with "cons or w" represent single- and multi-word terms in GENIA corpus. Also, lines in the row with "vocabulary" represent the terms found by our algorithm. Positional divergence of the terms at the given sentence is represented as horizontal coordinate of the lines. In this evaluation, only the perfect match is classified as a successful match. The numbers of match and mismatch were counted by occurrence in GENIA corpus (i. e. the same terms at the different sentence or position are counted separately).

Figure 3: Accuracy of segmentation. In this evaluation, a term covered by shorter terms is classified as a successful match. For example, if a term "calcium-mobilizing receptors" surrounded by <cons></cons> in GENIA corpus is recognized as "calcium", "-", "mobilizing", and "receptors" by our algorithm, then it is a successful match (corresponds to the case "<cons> includes terms" in this figure). The numbers of match and mismatch were counted by occurrence again.

Figure 4: Accuracy of segmentation. Similar to Figure 3, but the number of matches and mismatches were counted by unique combinations of term and POS. For example, "bound/JJ" and "bound/VBN" in GENIA corpus are counted separately, while the same "receprots/NNS" at different positions are not.

Figure 5: Accuracy of segmentation. Similar to Figure 3 but the numbers of match and mismatch were counted by unique terms (differences in POS and positions are ignored).





Conclusion and future work

In this study, we presented our approach to dictionary building and matching in biomedical texts. Large amount of terms with and without POS and/or category information were gathered. In addition, a completion program generated ~1.36 million term variants to avoid stemming problem in matching. The dictionary was stored in RDBMS for quick lookup, and used by a matching program. Since the matching is not restricted to a substring surrounded by space characters, we can avoid the problem of null boundaries. This feature is also useful for generative words, such as chemical compound names, which cannot be comprehensively contained in a dictionary because their variation is intrinsically infinite. Experimental results on GENIA corpus were promising: about half of terms were correctly recognized as a meaningful segment, and most of the remaining half could be properly recognized by post-processing techniques, like chunking or further decomposition. It is noticeable that although we have not used term cost, connectivity cost, or syntactic information, reasonable segmentation and dictionary lookup could be performed in most cases. By incorporating such information into the evaluation function for path, we could increase further the accuracy of segmentation. Also, including new term sources may contribute to more correct segmentation and a more informative output. It could be interesting to implement a POS tagger in this system, which would return the POS tag as well as useful information about semantic category, stemmed form of grammatical variants, link information to database entries, and so on.



Acknowledgement

This work was partly supported by BIRD of Japan Science and Technology Agency (JST) and also partly by the Grant-in-Aid for Scientific Research on Priority Area "Genome Information Science" from the Ministry of Education, Culture, Sports, Science, and Technology of Japan.




References


  1. KDD Cup (2002). http://www.biostat.wisc.edu/~craven/kddcup/

  2. BioCreAtIve. http://www.pdg.cnb.uam.es/BioLINK/BioCreative.eval.html

  3. Koike, A., Niwa, Y. and Takagi, T. (2004). Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics (in press).

  4. UMLS. http://www.nlm.nih.gov/research/umls/

  5. Yuka, T. and Tsujii, J. (2004). Part-of-Speech Annotation of Biology Research Abstracts. In: Proceedings of 4th International Conference on Language Resource and Evaluation (LREC2004). IV. pp. 1267-1270.

  6. Ashburner, M. and Lewis, S. E. (2002). On ontologies for biologists: the Gene Ontology - uncoupling the web. In: In Silico Biology. Novartis Found Symp 247: pp. 66-80; discussion pp. 80-83, pp. 84-90, pp. 244-252.

  7. Hill, D. P., Blake, J. A., Richardson, J. E. and Ringwald, M. (2002). Extension and Integration of the Gene Ontology (GO): Combining GO vocabularies with external vocabularies. Genome Res. 12, 1982-1991.

  8. Krauthammer, M. and Nenadic, G. (2004). Term identification in the biomedical literature. J. Biomed. Inform. 37, 512-526.

  9. The Penn Treebank Project. http://www.cis.upenn.edu/~treebank/

  10. Eijiro. http://www.alc.co.jp/

  11. Life Science Dictionary. http://lsd.pharm.kyoto-u.ac.jp/en/

  12. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. and Wheeler, D. L. (2004). GenBank: update. Nucleic Acids Res. 32, D23-D26.

  13. Wheeler, D. L., Chappey, C., Lash, A. E., Leipe, D. D., Madden, T. L., Schuler, G. D., Tatusova, T. A. and Rapp, B. A. (2000). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 28, 10-14.

  14. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A. and Wheeler, D. L. (2000). GenBank. Nucleic Acids Res. 28, 15-18.

  15. CAS. http://www.cas.org/

  16. IUPAC. http://www.iupac.org/

  17. Yamamoto, K., Kudo, T., Konagaya, A., Matsumuto, Y. (2003). Protein Name Tagging for Biomedical Annotation in Text ACL. Workshop on Natural Language Processing in Biomedicine (Sapporo, Japan).



Footnote

a In this paper, POS are denoted using Penn Treebank II tags [9].