In Silico Biology 5, 0003 (2004); ©2004, Bioinformation Systems e.V.  
Ontology Workshop Göttingen 2004

PRIME: automatically extracted PRotein Interactions and Molecular Information databasE


Asako Koike1, 2,* and Toshihisa Takagi1




1 Dept. of Computational Biology, Graduate School of Frontier Science, The University of Tokyo, Kiban-3A1(CB01) 5-1-5, Kashiwanoha Kashiwa, Chiba, 277-8561, Japan
2 Central Research Laboratory, Hitachi Ltd. 1-280 Higashi-koigakubo Kokubunji city, Tokyo, 185-8601, Japan



* Corresponding author
   Dept. of Computational Biology Graduate School of Frontier Science,
   Univ. of Tokyo Kiban-3A1(CB01) 1-5-1 Kashiwanoha Kashiwa, Chiba, 277-8561, Japan
   Phone: +81-4-7136-3982;  Fax: +81-4-7136-3975
   Email: akoike@hgc.jp





Edited by H. Michael; received September 23, 2004; revised and accepted November 25, 2004; published December 22, 2004



Abstract

With the exponentially increasing amount of information in the biomedical field, the significance of advanced information retrieval and information extraction, as well as the role of databases, has been increasing. PRIME is an integrated gene/protein informatics database based on natural language processing. It provides automatically extracted protein/family/gene/compound interaction information including both physical and genetic interactions, gene ontology based functions, and graphic pathway viewers. Gene/protein/family names and functional terms are recognized based on dictionaries developed in our laboratory. The interaction and functional information are extracted by syntactic dependencies and various phrase patterns. We have included about 920,000 (non-redundant) protein interactions and 360,000 annotated gene-function relationships for major eukaryotes. By combining the sequence and text information, the pathway comparison between two organisms and simple pathway deduction based on other organism interaction data, and pathway filtering using tissue expression data, are also available. This database is accessible at http://prime.ontology.ims.u-tokyo.ac.jp:8081.

Keywords: protein interaction, biological process, pathway database, natural language processing



Introduction

Due to the rapid progress of molecular biology and developments in high throughput technologies, such as the two-hybrid system, DNA/protein microarrays, and mass spectrometry, an enormous amount of data has been created and is stored mainly in each database in document files. To interpret the newly obtained high throughput experimental data, we must first extract information from various scattered resources. However, manual information extraction is a labor-intensive task and is sometimes not feasible. Due to these circumstances, a promising solution to this problem is the natural language processing technique. Many NLP-based information retrieval and information extraction systems have been developed. However, most of them use only MEDLINE abstracts because a large portion of useful information is described in scientific journal abstracts. There are several varieties of systems that provide information, such as protein-protein interactions [1, 2, 3], implicit relationship discovery between genes and diseases [4, 5], micro-array interpretations [6], and more general information retrieval systems such as MedMiner [7]. Since one of major subjects in post-genomic biology is to clarify how genes, proteins, and compounds interact to form signaling/metabolic/regulatory networks, and to understand the kinds of roles in the biological processes the networks have, protein network extraction and protein function extraction are crucial subjects.

In our laboratory, we developed a "Kinase Pathway Database", which contains protein-protein/gene/compound interactions that are automatically extracted using natural language processing techniques [1]. The target genes/proteins are not limited to the protein kinases, but include all gene/proteins because the proteins included in the kinases' protein network are not known in advance. Further, it contains classifications and orthologous definitions of protein kinases in major eukaryotes, while providing graphic viewers for protein networks.

We have developed the "PRIME" database, which is the developed version of the "Kinase Pathway Database". Compared to the "Kinase Pathway Database", the PRIME database has been upgraded in at least four ways:

In this report, we will mainly summarize the new features of this database.



System description

Figure 1 represents the overall system structure of the PRIME database. In this figure the hatched parts represent the new features of the PRIME database.



Figure 1: The overall structure of PRIME database system. Hatched parts represent the new features of the PRIME database compared to Kinase Pathway Database.


Information extraction of protein interactions and protein functions

Gene/protein/family/compound interactions and biological functions are automatically extracted using natural language processing techniques. The definition of biological function is based on gene ontology (GO). GO-ID assignments to each gene/protein/family were done. The basic idea of protein interaction extraction is the same as the "Kinase Pathway Database" [1]. The main changes are that the gene/name recognition process has been replaced by the newly developed method [8] and that family interactions were added to avoid ambiguous names that cannot specify gene locus. Because ambiguous names are preferably used when paralogous sequences exist (e. g. 14-3-3 instead of 14-3-3 alpha (YWHAA), gamma(YWHAG), epsilon(YWHAE) etc.), they are recognized based on the family name dictionary. In our interaction gathering policy, we include both physical and genetic interactions such as synthetic lethal and more general relationships. For example, the relationship between MEKK and MAPK is neither a physical nor a genetic interaction; but they indirectly interact. The biological function extraction process is described in the reference [9], while the extraction method and the extraction performance of gene/protein/family/compound interactions are described in the following section. The content numbers are summarized in Tables 1 and 2. In addition, some data are gathered from external databases, such as journal tables of tandem-affinity purification and mass spectrometry complex data [10] and yeast-two hybrid data [11]. Although they are distinguished as PRIME original data and external data, both data are accessible at PRIME database.

Table 1: The number of abstracts used and extracted interactions. (Aug, 2004).
Organism # of abstracts # of protein/compound/family kinds # of extracted interactions
(non-redundant)
S. cerevisiae 52,925 3,169/5,311 /1,800 26,861 (18,139) + family 23,030 (13,481) *
C. elegans 5,477 1,039/1,249/636 2,139 (1,822) + family 2,257 (1,757)
D. melanogaster 19,802 2,048/574/820 7,288 (5,684) + family 5,178 (3,731)
M. musculus 681,391 6,549/10,325/3,114 239,667 (143,878) + family 273,457 (102,515)
R. norvegicus 1,049,615 3,585/10,333/2,859 224,774 (122,448) + family 390,515 (119,680)
H. sapiens 8,386,525 8,704/10,913/3,382 449,664 (202,217) + family 633,859 (182,872)
* The former represents the number of protein/gene-protein/gene/compound interactions, and the latter represents the number of interactions including family names. The numbers in parentheses represent the non-redundant interaction counts.



Table 2: Function extractions for each organism. (Aug, 2004)
Organism Protein (family) kinds Extracted function gene
(non-redundant) + family (non-redundant)
S. cerevisiae 2,607 (1,563) 22,426 (13,975) + 16,147 (8,201)
C. elegans 904 (535) 3,511 (2,591) + 1,943 (1,365)
D. melanogaster 1,695 (744) 8,039 (5,739) + 4,108 (2,437)
M. musculus 6,036 (2,725) 160,794 (61,465) + 117,174 (31,183)
R. norvegicus 3,128 (2,241) 107,196 (47,221) + 130,641 (32,885)
H. sapiens 7,155 (3,448) 259,100 (93,030) + 271,777 (56,317)



Sequence information and tissue information

Amino acid sequences were gathered from LocusLink (http://www.ncbi.nlm.nih.gov), WormBase (http://www.wormbase.org/), and Flybase (http://flybase.bio.indiana.edu/). In addition to protein domain compositions based on InterPro (http://www.ebi.ac.uk/interpro/), tables of protein kinase orthologs, and structural information, such as SCOP classification ID [12] for each protein, (calculated using PSI-BLAST similarly [13] in the Kinase Pathway Database), we add the sequence similarity between the organisms, which was calculated simply by BLAST [13]. All sequence data are stored by a unique GENA-ID linked to literature information and used in the advanced pathway search. Tissue information was imported from the tissue database (http://tissuedb.ontology.ims.u-tokyo.ac.jp:8082/tissuedb/) developed by Ogasawara et al.


Graphic viewer

In the Kinase Pathway Database, we can perform both simple protein network drawing and orthologous pathway drawing using orthologous information and protein interaction information. In addition, using PRIME simple deduction and interaction filtering are also possible (Figure 2). Snapshots of pathway drawing are shown in Figure 2d and e. In the pathway deduction, the protein network drawing of one organism is done based on the protein interactions of the other organism using its domain compositions and/or sequence similarities. For example, the two-step D. melanogaster drawings can be done using H. sapiens, two-steps interactions from TAK1 to JNK under the condition that the corresponding mouse proteins are selected as the threshold of the same domain composition and sequence similarity E-values < 10-2. (Figure 2e) When there exist corresponding genes/proteins or interactions, the nodes and edges are drawn in red as shown in Figure 2e. Interaction filtering is done using the tissue database contents, which include the expression data about each organ, and the ontology of which (organ ontology) has a hierarchical structure. That is, the protein network is drawn using proteins that are expressed in the specified organ.



Figure 2: The snapshots of PRIME database. a) top page, b) interaction information, c) text information, d) pathway drawing, e) pathway deduction.


Each network node and edge can be deleted interactively in the editing mode. Further, each node and edge is linked to the evidential text information.


PRIME interface

As shown in Figure 2a, PRIME provides several search menus: (1) pathway searches, including pathway deductions and comparisons, (2) protein/gene/family/compound interaction data searches, (3) protein data searches (includes biological function of each protein), (4) orthologue data searches, (5) phylogenetic tree searches, and (6) protein structure search. We use GENA searches to convert queries into GENA-ID and accept broadly defined gene/protein/compound name queries. Further, in order to get objective information easily, we added many options. Accordingly, the interaction information is searchable with multiple options, such as "negative relation" and "protein-gene interaction" etc. The automatically extracted information is searchable with options such as extraction reliabilities and linked to evidential text information. Further, concerning interaction information, the filtering by specified verb kinds is also possible.


System software configurations

The PRIME is implemented by PostgreSQL. The table structure of relational database is summarized on the Web (http://prime.ontology.ims.u-tokyo.ac.jp:8081/comment/Table.html).



Information extraction

In Figure 3, we simply summarize the information extraction system. An explanation of each step is given in the following.



Figure 3: Overview of extraction process of gene/protein/family/compound interactions and gene/protein/family functions.


Step 1a) Gene/protein/family/compound name
To specify the corresponding gene/protein sequences and compound information from the gene/protein/compound names in the documents, the gene/protein/compound names are recognized based on the gene name dictionary (GENA) developed by our group. However, GENA is not sufficient to recognize all the protein names because, in abstracts, many proteins are described using ambiguous gene names. For example, "interleukin 1" does not specify whether the protein is "interleukin 1 alpha" or "interleukin 1 beta". That is, "interleukin 1" is the family name. To address this problem, we have developed a family name dictionary. The current version (2004 Aug) contains more than 16,000 family names. Compared to the Kinase Pathway Database, the contents of registered compounds in GENA are greatly increased and the registered compound kinds are more than 140,000 names.

In the present study, we used MEDLINE abstracts for the period 1965 - 2004, using the MeSH terms of each species, such as "Saccharomyces cerevisiae", "Caenorhabditis elegans", "Drosophila melanogaster", "mice", "rats", and "human". The gene/protein/family/compound name and functional term recognition were quickly recognized by devised "tries" (data structures to do a fast search) with many heuristics, such as special characteristic treatments.

As much as possible we resolved the problems presented in post-processing processes, such as full-name abbreviation and keyword searches and gene name ambiguity problems (for example, NIK is a synonym for both "Nck interacting kinase" and "NF-kappaB-inducing kinase"). In this step, gene/protein/family IDs were assigned with about a 90% precision and recall rate [8].
Step 1b) Functional term recognition
To widely assign gene ontology-ID (GO-ID) to each gene/protein/family, the functional terms for biological processes are gathered using the following five methods: (1) related terms having a high co-occurrence score with GO terms; (2) similar terms having similar collocations with GO terms; (3) enzyme name extraction by pattern matching; (4) rule-based generation of syntactic/semantic variations; and (5) verb-technical term combination variations. By using all five methods, we gathered about 240,000 terms. (There were about 10,000 original GO terms.) The functional terms were also quickly recognized from all abstracts using a trie that considered trivial term variations (replacement of special characters with a space, etc.).
Step 2) Shallow parsing, noun phrase bracketing
The sentences that included the gene/protein/family/compound IDs assigned in step1 were shallowly parsed by FDG-Lite (Conexor http://www.conexoroy.com/products.htm), which assigns a word form, base form, part-of-speech, and light syntactic representation. Using part of speech and syntactic representation, we performed noun phrase bracketing. After that, we analyzed coordinate and subordinate clauses and so on, using various standard rules. More details are described in the reference [9].
Step 3) ACTOR-OBJECT relationships extraction
Here ACTOR and OBJECT are used as the meaning of "doer of action" and "receiver of action". In functional annotation, in most cases, ACTOR is the protein/family and OBJECT is the functional term. In interaction extraction, both ACTOR and OBJECT are protein/gene/family/compounds. ACTOR and OBJECT are described not only in the subject-object/complement relationship with pre-defined verbs, but also in modification relationships, subject-complement, subject-adverb, and so on. Some examples for gene/protein/family/compound interactions are written in Table 3. Actually, more complicated patterns were also used. Detailed examples of functional annotations are described in the reference [9].
Step 4) Interaction extraction
In this study, protein/gene/family-protein/gene/family/compound interactions were extracted. The interaction meaning included genetic interactions such as synthetic lethal, physical interactions, and more general relationship. Kinds of interactions, such as physical (physical or indirect interaction), biological process (activate, regulate, inhibit, react, modify (chemical modification), and others), protein-protein or protein-gene (promoter region), were automatically assigned using patterns. Whether it was negative or affirmative, and whether it was a contingent fact (including "predicated", "investigate", "test", "examine", "study", and "design") or not, were checked and some marks were added. The details regarding reliability and rank/mark are described on the web. Many restrictions for the interaction extraction were imposed to avoid errors. Further, the extraction reliability was also added by the extraction process. For example, the extraction reliability of the gene name after the preposition even in the same noun phrase is set lower to avoid semantic errors and to meet the needs of various researchers. How far interaction is required largely depends on each user. For example, in the following sentence, "Silencing the UPR by gene-A deletion diminished gene-B expression." Probably, most of researchers want to extract the relationship between gene-A and gene-B. On the other hand, in the sentence, "Silencing the UPR by gene-A deletion was diminished by gene-B expression." Whether to extract the relationship between gene-A and gene B or not depends on the research purpose. Accordingly we extract the latter relationship with the lowest reliability.
Step 5) Biological function extraction (type-1)
In some GO classes, such as "leaning" and "transcription", it is difficult to decide whether the assigned function is appropriate or not, based on one sentence. Concerning these classes, a keyword search in the same abstract was also done. After the extraction of the gene-function relationship, whether it was negative or affirmative and whether it was a contingent fact or not, were checked and some marks were added.
Step 6) Biological function extraction (type-2)
To consider the wide varieties of terms (for example, "GO:0008037" cell recognition, cell recognition of neural cell"), a keyword search in the OBJECT was done. The score for each word consisting of functional terms (=TermScore [i], i-th term score) was defined by "1/(1 + log('frequency' + 1)), where 'frequency' is the frequency of appearance in abstracts over two years. The sum score of each collocation (=SumScore) was calculated. The score for each collocation with given key words was defined as j=given keywordsTermScore[j]/SumScore (=CollScore). When the top CollScore was over 0.75, the corresponding GO-ID was accepted.
Step 7) GO-ID assignment to genes
This step is the same as step 4.


Table 3: Example extraction patterns.
Patterns SentenceTypes Examples
Basic Type
A gene name and its interacting partner appear in different noun phrases connected by a verb phrase.
NP-VP-NP
(kinds of verbs are restricted in protein interaction extraction)
[Tumorigenic mutants of <gene>p53</gene>] bind to Daxx and inhibit [Daxx-dependent activation of the <gene>apoptosis signal-regulating kinase 1 stress-inducible kinases <gene>and<gene>Jun NH(2)-terminal kinase</gene>].
NP-VP-PP [Tumorigenic mutants of <gene>p53</gene>] bind to [<gene>Daxx<gene>] and inhibit...
The structure suggests [that one face of <gene>Prp18</gene>] interacts with [the <gene>splicing factor Slu7<gene>]
NP-VP-NP
Verb is not specified even in protein interaction extraction, when keywords exists in the object. Keywords:(affinity/activation/coprecipitation..) for/of/through...NP. The relative position of keywords in NP are restricted.
[<gene>Flibanserin</gene>] has [preferential affinity for <gene>serotonin 5-HT(1A)</gene>].
[<compound> ZK 91587</compound>] has been commercialized as [the 'ideal' ligand for the <gene>MCR</gene>].
In addition, chlamydocin induces apoptosis by[activating <gene>caspase-3</gene>, which in turn] leads to [the cleavage of <gene>p21</gene>]
NP-VP-NP PREP Verb-ing NP In addition,[<compound> chlamydocin</compound>] induces apoptosis by activating [<gene>caspase-3</gene>, which in turn] leads to the cleavage of p21.
NP-VP-NP (relative pronoun)-VP-NP [<gene>Rox1</gene>] is [an HMG-domain, DNA binding protein with a repression domain that] recruits [the <gene>Tup1</gene>/<gene>Ssn6</gene> general repression complex to achieve repression].
NP-VP-to-infinitive [<family>Lp (a)</family>] has been expected to bind [<compound>fibrin</compound>] by a competitive mechanism.
NP-VP-NP-PP NP [<gene>5-Hydroxytryptamine1A receptor/Gibetagamma</gene>] stimulates [<family>mitogen-activated protein kinase</family>] via [<family>NAD(P)H oxidase</family>].
Modification inside of NP NP-to infinitive [The ability of <gene>Interferon (IFN) alpha</gene>, <gene>beta</gene> and <gene>gamma</gene> to induce <gene>IgA</gene> production from IgA deficient patients lymphocytes] was tested "in vitro".
NP-verb/EN-PP [<gene>Bim-EL<gene> phosphorylated by <gene>Erk1/2</gene> ]is rapidly degraded via the proteasome pathway.
NP-verb/EN-NP We demonstrated [complete inhibition of <family>RLX</family> induced <gene>NF-kappaB</gene> activation].
NP keywords (inhibitor, regulator, et al.) NP <gene>Xanthine oxidase</gene> inhibitor <compound>Allopurinol</compound> ....
NP by NP [<family>IIA PLA(2)</family> up-regulation] by [<gene>NF-kappaB</gene> inhibition].
Keywords (interaction/association et al.) between NP and NP In this regard, [the interaction between <gene>CCL5</gene> and <gene>CCR5</gene>] may be critical in regulating T cell functions, by mediating their recruitment and polarization, activation, and differentiation.
Keywords (role, et al.) of NP in NP [Most of the experimental evidence for a role of <gene>arginine-vasopressin (AVP)</gene> in <gene>adrenocorticotropic hormone (ACTH)</gene> release] comes from in vitro studies.
Keywords (complex/heterodimer, et. al ) VPing NP and NP ... binds [a multimeric complex including <gene>Sp1</gene> and <gene>Sp3</gene> transcription factors].

[] represents noun phrase.





Results

The preliminary experimental results for the extraction of the gene/protein's biological functions are summarized in the reference [9]. They demonstrated that our method has an estimated recall of 54 - 64%, with a precision of 91 - 94% for functions actually described in abstracts. When applied to all MEDLINE abstracts, it extracted over 224,000 gene-GO relationships and 132,000 family-GO relationships for major eukaryotes, which are summarized in Table 2.

Concerning gene/protein/family interaction extraction, the precision and recall were increased compared to the Kinase Pathway Database by improving the gene/protein/family name recognition and sentence structure analysis processes. The precision (true_positive/(true_positive+false_positive)) is summarized in Table 4. The precision was calculated using 100 randomly extracted interactions (excluding compound names). As far as investigating 100 randomly extracted human interactions with compound names, the precision of interactions with compound names are a little lower (2 - 3%) due to wrong compound name recognition. In this evaluation, the interaction type/kind, such as "activate", "inhibit", or "regulate", "physical interaction" or not, were not the adjudicative targets. Our only objective was to decide whether the two gene/protein/family/compound-IDs and their interaction direction were appropriate since, in many cases, the interaction kinds are not clearly recognized only from the target sentence. Even if they are written in one sentence, they are frequently described in complex sentences using combinations of nouns/adverbs and verbs, where the meaning of each verb sometimes changes within the context. For example, "The anticancer potency of TRAIL is associated with the decreased expression of NF-kappaB and survivin and increased expression of Caspase3 of gastric cancer cells." In this sentence, "associate" is used with the same meaning as "relate". But, in many sentences, "associate" is also used to mean "bind". Thus, we cannot ascertain the "physical interaction" or "indirect interaction" only from the "verb".

Table 4: Precision of gene/protein/family interaction (including GENA errors).
Organism Precision
S. cerevisiae 92 (93)%
C. elegans 94 (94)%
D. melanogaster 86 (94)%
M. musculus 89 (92)%
R. norvegicus 90 (94)%
H. sapiens 90 (93)%
The numbers in parentheses represent the precision, excluding errors of gene name recognition.

As shown in Table 4, since D. melanogaster contains many ambiguous gene names with the same spellings with "general noun" and "verb", the precision is a little lower than for other organisms. Many of the false positives in all organisms are caused by shallow-parser errors. Mistakes between "adverb/past particle" and "past tense" and between "verb" and "noun" are the main causes. Some false positives are in the gray zone, and some are semantic meaning difference errors. For example, "The full activity of a recombination initiation site located 5' of HIS4 requires the binding of the transcription factors RAP1, BAS1, and BAS2." The meaning of "binding" between RAP1 and BAS1 is not clear in this sentence. (Of course, the possibility of binding between them is high). However, the "binding of protein-A and protein-B" is sometimes used with the same meaning as "binding between protein-A and protein-B". Removing this kind of error is difficult. Although some verbs with broad meaning such as "related to" and "cause" are also used to increase the recall, their precision is not high. Further, some false positives are observed in the patterns whose verbs are not specified when the keywords are found in the OBJECT (Table 3: NP-VP-NP without verb specification). In this pattern, although many restrictions were imposed in the relative position between keywords and gene/protein/family/compound names, some are not related ones and some are wrong in the interaction directions. Errors in the SUBJECT-OBJECT recognition part are infrequent.

The recall (true_positive/(true_positive+false_negative) was calculated using manually extracted interactions from 370 abstracts for S. cerevisiae and from 250 abstracts for H. sapiens. The recall largely depends on how far the relation is extracted. Since the genetic interaction is also our extraction target, the "sds3-swi6" relationship must be extracted in the following sentence. "We found that sds3 is synthetically lethal in combination with a deletion of the SWI6 (SDS11) gene." Further, various relationships are also our extraction targets, and both glucose-HXK1, glucose-GLK1, and glucose-HXK2 relationships and Hxk2p-HXK1, Hxk2p-GLK1, and Hxk2p-HXP2 relationships can be extracted in the following sentence. "Here we demonstrate the involvement of Hxk2p in the glucose-induced repression of the HXK1 and GLK1 genes and the glucose-induced expression of the HXK2 gene". However, implicit relationships are not our target. "In the absence of GCN4, BAS1, and BAS2, the RAP1 protein binds to the HIS4 promoter in vivo but cannot efficiently stimulate HIS4 transcription." In this sentence, although there is a possibility of a relationship between GCN4, BAS1, BAS2, RAP1 and the HIS4 promoter, they are not clearly described, and so their relationships are not the extraction targets in our method. In our criteria, the recall is 51% (including compound) in S. cerevisiae (manually extracted interactions: 507) and 54% in H. sapiens (278). In this corpus, GENA-IDs/Family-name IDs are manually assigned for 95% of the gene/protein names and for 85% of the compound names. These numbers represent the upper limits of information extraction recall in the current dictionary-based approach. Most of the false negatives were due to anaphora (co-reference of an expression with its antecedent) from multiple sentences or only one sentence, and their resolution techniques were not included in our methods. Although various trials for anaphora resolution have been reported so far, they were not accurate enough to apply to this system. The use of anaphora is more frequently observed in the description of genetic interactions or general relationships than in physical interactions. These problems are not resolved using a full parser in most cases. In preliminary investigations, shallow parsing and rule-based SUBJECT-OBJECT recognition are enough to extract this kind of information.



Discussion

We have developed an integrated database, which contains automatically extracted protein interactions and functions. This database also provides graphic viewers and, using a combination of extracted text information and sequence information, simple pathway comparison, deduction, and filtering are possible.

Concerning protein function extraction, the more competitions such as BioCreAtIvE and TREC are held, the more widely known their techniques are becoming. The extracted number of functional information is about 360,000 (1,100,000 in non-redundant) in our system. The estimated recall is 54 - 64% with a precision of 91 - 94% for functions actually described in abstracts. Although our method's recall is not high, its' precision seems to be at a practical use level and may be as useful as the advanced information retrieval system discussed in the previous paper [9].

Concerning protein interactions, various techniques have been reported [1, 2, 3, 14]. However, the extraction targets of most of these methods are limited to physical interactions or other clear regulation relationships, such as activate, inhibit, regulate and so on. In our methods, the extraction targets also include genetic interactions such as synthetic lethal. Their descriptions are more complicated than physical interactions, and this presents a challenging task, since they are important to know the relationship between two genes. Further, many of the reported methods use only gene/protein name recognition; although network descriptions are written in more ambiguous names, and sometimes cell signaling is mediated by compounds. In our methods, ambiguous names and compound names are recognized based on family name dictionaries and GENA. Although currently the compound name content is not sufficient, an exhaustive collection of protein interaction data is becoming feasible. The number of extracted interactions in our system is 920,000 (2,280,000 in non-redundant). The precision is about 86 - 94% (92-94%, excluding GENA errors) and the recall is 51 - 54%. Since the main cause of low recall rate is the anaphora problem, the advancements of the recall without decreasing precision may be difficult. However, this recall is calculated at sentence level. Since the interaction information is repeatedly described, the recall at fact level is expected to be higher. Further, the numbers of registered full-papers in PUBMED-central have been increasing, the recall at fact level will be higher by using full-papers.

There are many manually developed databases, such as DIP [15], BIND [16], KEGG [17], TRANSPATH [18], and Reactome [19]. DIP and BIND are simple interaction data, and KEGG, TRANSPATH, and Reactome are metabolic network and/or signal transduction databases. Those data has detailed information but is not sufficient and sometimes lacks in up-to date information. At this stage, this automatic method can assist the former databases. However, automatic construction of the latter databases seems to be difficult using only abstract information because, in these databases, the curators select the appropriate interaction/network and not all interactions described in the papers are registered into the databases. Further, consensual networks are obtained from reliable reviews and class books. However, the contents of extracted information by automatic method are definitely more abundant than those of manual extraction. Considering this, this kind of automatic extracted database is appropriate for use in advanced information retrieval or data-mining tools, including implicit knowledge discovery and the interpretation of high throughput data. In the near future, a data-mining tool for implicit knowledge discovery based on PRIME-data and other NLP techniques will be available to the public.



Acknowledgements

We wish to acknowledge Yo Shidahara, Koji Shintaku, and Kouichiro Yamada for their extensive readings of abstracts and assistance in the construction of the family name dictionary. We would like to thank Mr. K. Kodama at Hitachi ULSI Systems for helping us by programming the PRIME database. This work is supported in part by a grant from the Ministry of Education, Culture, Sports, Science, and Technology of Japan for scientific research in priority areas, such as genome information science.




References


  1. Koike, A., Kobayashi, Y. and Takagi, T. (2003). Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. Genome Res. 13, 1231-1243.

  2. Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A. and Mazo, I. (2004). Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20, 604-611.

  3. Temkin, J. M. and Gilder, M. R. (2003). Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19, 2046-5203.

  4. Perez-Iratxeta, C., Bork, P. and Andrade, M. A. (2002). Association of genes to genetically inherited diseases using data mining. Nat. Genet. 31, 316-319.

  5. Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A. and Mazo, I. (2004). Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20, 604-611.

  6. Jenssen, T. K., Kuo, W. P., Stokke, T. and Hovig, E. (2000). Associations between gene expressions in breast cancer and patient survival. Hum Genet. 111, 411-420.

  7. Tanabe, L., Scherf, U., Smith, L. H., Lee, J. K., Hunter, L. and Weinstein, J. N. (1999). MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques 27, 1210-1214, 1216-1217.

  8. Koike, A. and Takagi, T. (2004). Proceedings of HLT/NAACL BioLINK workshop, Boston, Massachusetts, pp. 9-16.

  9. Koike, A., Niwa, Y. and Takagi, T. (2004). Automatic extraction of gene/protein biological functions from biomedical text, Bioinformatics, in press.

  10. Gavin, A. C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J. M., Michon, A. M., Cruciat, C. M., Remor, M., Hofert, C., Schelder, M., Brajenovic, M., Ruffner, H., Merino, A., Klein, K., Hudak, M., Dickson, D., Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B., Leutwein, C., Heurtier, M. A., Copley, R. R., Edelmann, A., Querfurth, E., Rybin, V., Drewes, G., Raida, M., Bouwmeester, T., Bork, P., Seraphin, B., Kuster, B., Neubauer, G. and Superti-Furga, G. (2002). Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141-147.

  11. Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. and Sakaki, Y. (2001). A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 98, 4569-4574.

  12. Lo Conte, L., Brenner, S. E., Hubbard, T. J., Chothia, C. and Murzin, A. G. (2002). SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res. 30, 264-267.

  13. Altschul, S.F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.

  14. Friedman, C., Kra, P., Yu, H., Krauthammer, M. and Rzhetsky, A. (2001). GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17 Suppl. 1, S74-S82.

  15. Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Kim, S. and Eisenberg, D. (2002). DIP: The Database of Interacting Proteins. A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30, 303-305.

  16. Bader, G. D., Betel, D. and Hogue, C. W. (2003). BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 31, 248-250.

  17. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y. and Hattori, M. (2004). The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, D277-D280.

  18. Krull, M., Voss, N., Choi, C., Pistor, S., Potapov, A. and Wingender, E. (2003). TRANSPATH: an integrated database on signal transduction and a tool for array analysis. Nucleic Acids Res. 31, 97-100.

  19. Joshi-Tope, G., Vstrik, I., Gopinath, G. R., Matthews, L., Schmidt, E., Gillespie, M., D'Eustachio, P., Jassal, B., Lewis, S., Wu, G., Birney, E. and Stein, L. (2003). The Genome Knowledgebase: a resource for biologists and bioinformaticists. CSHL Symposium 2003 68, 237-243, CSHL Press, CSH, NY.