| In Silico Biology 5, 0003 (2004); ©2004, Bioinformation Systems e.V. |
| Ontology Workshop Göttingen 2004 |
1 Dept. of Computational Biology, Graduate School of Frontier Science, The University of Tokyo, Kiban-3A1(CB01) 5-1-5, Kashiwanoha Kashiwa, Chiba, 277-8561, Japan
2 Central Research Laboratory, Hitachi Ltd. 1-280 Higashi-koigakubo Kokubunji city, Tokyo, 185-8601, Japan
* Corresponding author
Dept. of Computational Biology Graduate School of Frontier Science,
Univ. of Tokyo Kiban-3A1(CB01) 1-5-1 Kashiwanoha Kashiwa, Chiba, 277-8561, Japan
Phone: +81-4-7136-3982; Fax: +81-4-7136-3975
Email: akoike@hgc.jp
Edited by H. Michael; received September 23, 2004; revised and accepted November 25, 2004; published December 22, 2004
With the exponentially increasing amount of information in the biomedical field, the significance of advanced information retrieval and information extraction, as well as the role of databases, has been increasing. PRIME is an integrated gene/protein informatics database based on natural language processing. It provides automatically extracted protein/family/gene/compound interaction information including both physical and genetic interactions, gene ontology based functions, and graphic pathway viewers. Gene/protein/family names and functional terms are recognized based on dictionaries developed in our laboratory. The interaction and functional information are extracted by syntactic dependencies and various phrase patterns. We have included about 920,000 (non-redundant) protein interactions and 360,000 annotated gene-function relationships for major eukaryotes. By combining the sequence and text information, the pathway comparison between two organisms and simple pathway deduction based on other organism interaction data, and pathway filtering using tissue expression data, are also available. This database is accessible at http://prime.ontology.ims.u-tokyo.ac.jp:8081.
Keywords: protein interaction, biological process, pathway database, natural language processing
Due to the rapid progress of molecular biology and developments in high throughput technologies, such as the two-hybrid system, DNA/protein microarrays, and mass spectrometry, an enormous amount of data has been created and is stored mainly in each database in document files. To interpret the newly obtained high throughput experimental data, we must first extract information from various scattered resources. However, manual information extraction is a labor-intensive task and is sometimes not feasible. Due to these circumstances, a promising solution to this problem is the natural language processing technique. Many NLP-based information retrieval and information extraction systems have been developed. However, most of them use only MEDLINE abstracts because a large portion of useful information is described in scientific journal abstracts. There are several varieties of systems that provide information, such as protein-protein interactions [1, 2, 3], implicit relationship discovery between genes and diseases [4, 5], micro-array interpretations [6], and more general information retrieval systems such as MedMiner [7]. Since one of major subjects in post-genomic biology is to clarify how genes, proteins, and compounds interact to form signaling/metabolic/regulatory networks, and to understand the kinds of roles in the biological processes the networks have, protein network extraction and protein function extraction are crucial subjects.
In our laboratory, we developed a "Kinase Pathway Database", which contains protein-protein/gene/compound interactions that are automatically extracted using natural language processing techniques [1]. The target genes/proteins are not limited to the protein kinases, but include all gene/proteins because the proteins included in the kinases' protein network are not known in advance. Further, it contains classifications and orthologous definitions of protein kinases in major eukaryotes, while providing graphic viewers for protein networks.
We have developed the "PRIME" database, which is the developed version of the "Kinase Pathway Database". Compared to the "Kinase Pathway Database", the PRIME database has been upgraded in at least four ways:
In this report, we will mainly summarize the new features of this database.
Figure 1 represents the overall system structure of the PRIME database. In this figure the hatched parts represent the new features of the PRIME database.
|
Figure 1: The overall structure of PRIME database system. Hatched parts represent the new features of the PRIME database compared to Kinase Pathway Database. |
Information extraction of protein interactions and protein functions
Gene/protein/family/compound interactions and biological functions are automatically extracted using natural language processing techniques. The definition of biological function is based on gene ontology (GO). GO-ID assignments to each gene/protein/family were done. The basic idea of protein interaction extraction is the same as the "Kinase Pathway Database" [1]. The main changes are that the gene/name recognition process has been replaced by the newly developed method [8] and that family interactions were added to avoid ambiguous names that cannot specify gene locus. Because ambiguous names are preferably used when paralogous sequences exist (e. g. 14-3-3 instead of 14-3-3 alpha (YWHAA), gamma(YWHAG), epsilon(YWHAE) etc.), they are recognized based on the family name dictionary. In our interaction gathering policy, we include both physical and genetic interactions such as synthetic lethal and more general relationships. For example, the relationship between MEKK and MAPK is neither a physical nor a genetic interaction; but they indirectly interact. The biological function extraction process is described in the reference [9], while the extraction method and the extraction performance of gene/protein/family/compound interactions are described in the following section. The content numbers are summarized in Tables 1 and 2. In addition, some data are gathered from external databases, such as journal tables of tandem-affinity purification and mass spectrometry complex data [10] and yeast-two hybrid data [11]. Although they are distinguished as PRIME original data and external data, both data are accessible at PRIME database.
| Table 1: | The number of abstracts used and extracted interactions. (Aug, 2004). |
| Organism | # of abstracts | # of protein/compound/family kinds | # of extracted interactions (non-redundant) |
| S. cerevisiae | 52,925 | 3,169/5,311 /1,800 | 26,861 (18,139) + family 23,030 (13,481) * |
| C. elegans | 5,477 | 1,039/1,249/636 | 2,139 (1,822) + family 2,257 (1,757) |
| D. melanogaster | 19,802 | 2,048/574/820 | 7,288 (5,684) + family 5,178 (3,731) |
| M. musculus | 681,391 | 6,549/10,325/3,114 | 239,667 (143,878) + family 273,457 (102,515) |
| R. norvegicus | 1,049,615 | 3,585/10,333/2,859 | 224,774 (122,448) + family 390,515 (119,680) |
| H. sapiens | 8,386,525 | 8,704/10,913/3,382 | 449,664 (202,217) + family 633,859 (182,872) |
| * The former represents the number of protein/gene-protein/gene/compound interactions, and the latter represents the number of interactions including family names. The numbers in parentheses represent the non-redundant interaction counts. |
| Table 2: | Function extractions for each organism. (Aug, 2004) |
| Organism | Protein (family) kinds | Extracted function gene (non-redundant) + family (non-redundant) |
| S. cerevisiae | 2,607 (1,563) | 22,426 (13,975) + 16,147 (8,201) |
| C. elegans | 904 (535) | 3,511 (2,591) + 1,943 (1,365) |
| D. melanogaster | 1,695 (744) | 8,039 (5,739) + 4,108 (2,437) |
| M. musculus | 6,036 (2,725) | 160,794 (61,465) + 117,174 (31,183) |
| R. norvegicus | 3,128 (2,241) | 107,196 (47,221) + 130,641 (32,885) |
| H. sapiens | 7,155 (3,448) | 259,100 (93,030) + 271,777 (56,317) |
Sequence information and tissue information
Amino acid sequences were gathered from LocusLink (http://www.ncbi.nlm.nih.gov), WormBase (http://www.wormbase.org/), and Flybase (http://flybase.bio.indiana.edu/). In addition to protein domain compositions based on InterPro (http://www.ebi.ac.uk/interpro/), tables of protein kinase orthologs, and structural information, such as SCOP classification ID [12] for each protein, (calculated using PSI-BLAST similarly [13] in the Kinase Pathway Database), we add the sequence similarity between the organisms, which was calculated simply by BLAST [13]. All sequence data are stored by a unique GENA-ID linked to literature information and used in the advanced pathway search. Tissue information was imported from the tissue database (http://tissuedb.ontology.ims.u-tokyo.ac.jp:8082/tissuedb/) developed by Ogasawara et al.
Graphic viewer
In the Kinase Pathway Database, we can perform both simple protein network drawing and orthologous pathway drawing using orthologous information and protein interaction information. In addition, using PRIME simple deduction and interaction filtering are also possible (Figure 2). Snapshots of pathway drawing are shown in Figure 2d and e. In the pathway deduction, the protein network drawing of one organism is done based on the protein interactions of the other organism using its domain compositions and/or sequence similarities. For example, the two-step D. melanogaster drawings can be done using H. sapiens, two-steps interactions from TAK1 to JNK under the condition that the corresponding mouse proteins are selected as the threshold of the same domain composition and sequence similarity E-values < 10-2. (Figure 2e) When there exist corresponding genes/proteins or interactions, the nodes and edges are drawn in red as shown in Figure 2e. Interaction filtering is done using the tissue database contents, which include the expression data about each organ, and the ontology of which (organ ontology) has a hierarchical structure. That is, the protein network is drawn using proteins that are expressed in the specified organ.
|
Figure 2: The snapshots of PRIME database. a) top page, b) interaction information, c) text information, d) pathway drawing, e) pathway deduction. |
Each network node and edge can be deleted interactively in the editing mode. Further, each node and edge is linked to the evidential text information.
PRIME interface
As shown in Figure 2a, PRIME provides several search menus: (1) pathway searches, including pathway deductions and comparisons, (2) protein/gene/family/compound interaction data searches, (3) protein data searches (includes biological function of each protein), (4) orthologue data searches, (5) phylogenetic tree searches, and (6) protein structure search. We use GENA searches to convert queries into GENA-ID and accept broadly defined gene/protein/compound name queries. Further, in order to get objective information easily, we added many options. Accordingly, the interaction information is searchable with multiple options, such as "negative relation" and "protein-gene interaction" etc. The automatically extracted information is searchable with options such as extraction reliabilities and linked to evidential text information. Further, concerning interaction information, the filtering by specified verb kinds is also possible.
System software configurations
The PRIME is implemented by PostgreSQL. The table structure of relational database is summarized on the Web (http://prime.ontology.ims.u-tokyo.ac.jp:8081/comment/Table.html).
In Figure 3, we simply summarize the information extraction system. An explanation of each step is given in the following.
|
Figure 3: Overview of extraction process of gene/protein/family/compound interactions and gene/protein/family functions. |
| Table 3: | Example extraction patterns. |
| Patterns | SentenceTypes | Examples |
|
Basic Type A gene name and its interacting partner appear in different noun phrases connected by a verb phrase. |
NP-VP-NP (kinds of verbs are restricted in protein interaction extraction) |
[Tumorigenic mutants of <gene>p53</gene>] bind to Daxx and inhibit [Daxx-dependent activation of the <gene>apoptosis signal-regulating kinase 1 stress-inducible kinases <gene>and<gene>Jun NH(2)-terminal kinase</gene>]. |
| NP-VP-PP | [Tumorigenic mutants of <gene>p53</gene>] bind to [<gene>Daxx<gene>] and inhibit... | |
| The structure suggests [that one face of <gene>Prp18</gene>] interacts with [the <gene>splicing factor Slu7<gene>] | ||
|
NP-VP-NP Verb is not specified even in protein interaction extraction, when keywords exists in the object. Keywords:(affinity/activation/coprecipitation..) for/of/through...NP. The relative position of keywords in NP are restricted. |
[<gene>Flibanserin</gene>] has [preferential affinity for <gene>serotonin 5-HT(1A)</gene>]. | |
| [<compound> ZK 91587</compound>] has been commercialized as [the 'ideal' ligand for the <gene>MCR</gene>]. | ||
| In addition, chlamydocin induces apoptosis by[activating <gene>caspase-3</gene>, which in turn] leads to [the cleavage of <gene>p21</gene>] | ||
| NP-VP-NP PREP Verb-ing NP | In addition,[<compound> chlamydocin</compound>] induces apoptosis by activating [<gene>caspase-3</gene>, which in turn] leads to the cleavage of p21. | |
| NP-VP-NP (relative pronoun)-VP-NP | [<gene>Rox1</gene>] is [an HMG-domain, DNA binding protein with a repression domain that] recruits [the <gene>Tup1</gene>/<gene>Ssn6</gene> general repression complex to achieve repression]. | |
| NP-VP-to-infinitive | [<family>Lp (a)</family>] has been expected to bind [<compound>fibrin</compound>] by a competitive mechanism. | |
| NP-VP-NP-PP NP | [<gene>5-Hydroxytryptamine1A receptor/Gibetagamma</gene>] stimulates [<family>mitogen-activated protein kinase</family>] via [<family>NAD(P)H oxidase</family>]. | |
| Modification inside of NP | NP-to infinitive | [The ability of <gene>Interferon (IFN) alpha</gene>, <gene>beta</gene> and <gene>gamma</gene> to induce <gene>IgA</gene> production from IgA deficient patients lymphocytes] was tested "in vitro". |
| NP-verb/EN-PP | [<gene>Bim-EL<gene> phosphorylated by <gene>Erk1/2</gene> ]is rapidly degraded via the proteasome pathway. | |
| NP-verb/EN-NP | We demonstrated [complete inhibition of <family>RLX</family> induced <gene>NF-kappaB</gene> activation]. | |
| NP keywords (inhibitor, regulator, et al.) NP | <gene>Xanthine oxidase</gene> inhibitor <compound>Allopurinol</compound> .... | |
| NP by NP | [<family>IIA PLA(2)</family> up-regulation] by [<gene>NF-kappaB</gene> inhibition]. | |
| Keywords (interaction/association et al.) between NP and NP | In this regard, [the interaction between <gene>CCL5</gene> and <gene>CCR5</gene>] may be critical in regulating T cell functions, by mediating their recruitment and polarization, activation, and differentiation. | |
| Keywords (role, et al.) of NP in NP | [Most of the experimental evidence for a role of <gene>arginine-vasopressin (AVP)</gene> in <gene>adrenocorticotropic hormone (ACTH)</gene> release] comes from in vitro studies. | |
| Keywords (complex/heterodimer, et. al ) VPing NP and NP | ... binds [a multimeric complex including <gene>Sp1</gene> and <gene>Sp3</gene> transcription factors]. |
[] represents noun phrase.
The preliminary experimental results for the extraction of the gene/protein's biological functions are summarized in the reference [9]. They demonstrated that our method has an estimated recall of 54 - 64%, with a precision of 91 - 94% for functions actually described in abstracts. When applied to all MEDLINE abstracts, it extracted over 224,000 gene-GO relationships and 132,000 family-GO relationships for major eukaryotes, which are summarized in Table 2.
Concerning gene/protein/family interaction extraction, the precision and recall were increased compared to the Kinase Pathway Database by improving the gene/protein/family name recognition and sentence structure analysis processes. The precision (true_positive/(true_positive+false_positive)) is summarized in Table 4. The precision was calculated using 100 randomly extracted interactions (excluding compound names). As far as investigating 100 randomly extracted human interactions with compound names, the precision of interactions with compound names are a little lower (2 - 3%) due to wrong compound name recognition. In this evaluation, the interaction type/kind, such as "activate", "inhibit", or "regulate", "physical interaction" or not, were not the adjudicative targets. Our only objective was to decide whether the two gene/protein/family/compound-IDs and their interaction direction were appropriate since, in many cases, the interaction kinds are not clearly recognized only from the target sentence. Even if they are written in one sentence, they are frequently described in complex sentences using combinations of nouns/adverbs and verbs, where the meaning of each verb sometimes changes within the context. For example, "The anticancer potency of TRAIL is associated with the decreased expression of NF-kappaB and survivin and increased expression of Caspase3 of gastric cancer cells." In this sentence, "associate" is used with the same meaning as "relate". But, in many sentences, "associate" is also used to mean "bind". Thus, we cannot ascertain the "physical interaction" or "indirect interaction" only from the "verb".
| Table 4: | Precision of gene/protein/family interaction (including GENA errors). |
| Organism | Precision |
| S. cerevisiae | 92 (93)% |
| C. elegans | 94 (94)% |
| D. melanogaster | 86 (94)% |
| M. musculus | 89 (92)% |
| R. norvegicus | 90 (94)% |
| H. sapiens | 90 (93)% |
| The numbers in parentheses represent the precision, excluding errors of gene name recognition. |
As shown in Table 4, since D. melanogaster contains many ambiguous gene names with the same spellings with "general noun" and "verb", the precision is a little lower than for other organisms. Many of the false positives in all organisms are caused by shallow-parser errors. Mistakes between "adverb/past particle" and "past tense" and between "verb" and "noun" are the main causes. Some false positives are in the gray zone, and some are semantic meaning difference errors. For example, "The full activity of a recombination initiation site located 5' of HIS4 requires the binding of the transcription factors RAP1, BAS1, and BAS2." The meaning of "binding" between RAP1 and BAS1 is not clear in this sentence. (Of course, the possibility of binding between them is high). However, the "binding of protein-A and protein-B" is sometimes used with the same meaning as "binding between protein-A and protein-B". Removing this kind of error is difficult. Although some verbs with broad meaning such as "related to" and "cause" are also used to increase the recall, their precision is not high. Further, some false positives are observed in the patterns whose verbs are not specified when the keywords are found in the OBJECT (Table 3: NP-VP-NP without verb specification). In this pattern, although many restrictions were imposed in the relative position between keywords and gene/protein/family/compound names, some are not related ones and some are wrong in the interaction directions. Errors in the SUBJECT-OBJECT recognition part are infrequent.
The recall (true_positive/(true_positive+false_negative) was calculated using manually extracted interactions from 370 abstracts for S. cerevisiae and from 250 abstracts for H. sapiens. The recall largely depends on how far the relation is extracted. Since the genetic interaction is also our extraction target, the "sds3-swi6" relationship must be extracted in the following sentence. "We found that sds3 is synthetically lethal in combination with a deletion of the SWI6 (SDS11) gene." Further, various relationships are also our extraction targets, and both glucose-HXK1, glucose-GLK1, and glucose-HXK2 relationships and Hxk2p-HXK1, Hxk2p-GLK1, and Hxk2p-HXP2 relationships can be extracted in the following sentence. "Here we demonstrate the involvement of Hxk2p in the glucose-induced repression of the HXK1 and GLK1 genes and the glucose-induced expression of the HXK2 gene". However, implicit relationships are not our target. "In the absence of GCN4, BAS1, and BAS2, the RAP1 protein binds to the HIS4 promoter in vivo but cannot efficiently stimulate HIS4 transcription." In this sentence, although there is a possibility of a relationship between GCN4, BAS1, BAS2, RAP1 and the HIS4 promoter, they are not clearly described, and so their relationships are not the extraction targets in our method. In our criteria, the recall is 51% (including compound) in S. cerevisiae (manually extracted interactions: 507) and 54% in H. sapiens (278). In this corpus, GENA-IDs/Family-name IDs are manually assigned for 95% of the gene/protein names and for 85% of the compound names. These numbers represent the upper limits of information extraction recall in the current dictionary-based approach. Most of the false negatives were due to anaphora (co-reference of an expression with its antecedent) from multiple sentences or only one sentence, and their resolution techniques were not included in our methods. Although various trials for anaphora resolution have been reported so far, they were not accurate enough to apply to this system. The use of anaphora is more frequently observed in the description of genetic interactions or general relationships than in physical interactions. These problems are not resolved using a full parser in most cases. In preliminary investigations, shallow parsing and rule-based SUBJECT-OBJECT recognition are enough to extract this kind of information.
We have developed an integrated database, which contains automatically extracted protein interactions and functions. This database also provides graphic viewers and, using a combination of extracted text information and sequence information, simple pathway comparison, deduction, and filtering are possible.
Concerning protein function extraction, the more competitions such as BioCreAtIvE and TREC are held, the more widely known their techniques are becoming. The extracted number of functional information is about 360,000 (1,100,000 in non-redundant) in our system. The estimated recall is 54 - 64% with a precision of 91 - 94% for functions actually described in abstracts. Although our method's recall is not high, its' precision seems to be at a practical use level and may be as useful as the advanced information retrieval system discussed in the previous paper [9].
Concerning protein interactions, various techniques have been reported [1, 2, 3, 14]. However, the extraction targets of most of these methods are limited to physical interactions or other clear regulation relationships, such as activate, inhibit, regulate and so on. In our methods, the extraction targets also include genetic interactions such as synthetic lethal. Their descriptions are more complicated than physical interactions, and this presents a challenging task, since they are important to know the relationship between two genes. Further, many of the reported methods use only gene/protein name recognition; although network descriptions are written in more ambiguous names, and sometimes cell signaling is mediated by compounds. In our methods, ambiguous names and compound names are recognized based on family name dictionaries and GENA. Although currently the compound name content is not sufficient, an exhaustive collection of protein interaction data is becoming feasible. The number of extracted interactions in our system is 920,000 (2,280,000 in non-redundant). The precision is about 86 - 94% (92-94%, excluding GENA errors) and the recall is 51 - 54%. Since the main cause of low recall rate is the anaphora problem, the advancements of the recall without decreasing precision may be difficult. However, this recall is calculated at sentence level. Since the interaction information is repeatedly described, the recall at fact level is expected to be higher. Further, the numbers of registered full-papers in PUBMED-central have been increasing, the recall at fact level will be higher by using full-papers.
There are many manually developed databases, such as DIP [15], BIND [16], KEGG [17], TRANSPATH [18], and Reactome [19]. DIP and BIND are simple interaction data, and KEGG, TRANSPATH, and Reactome are metabolic network and/or signal transduction databases. Those data has detailed information but is not sufficient and sometimes lacks in up-to date information. At this stage, this automatic method can assist the former databases. However, automatic construction of the latter databases seems to be difficult using only abstract information because, in these databases, the curators select the appropriate interaction/network and not all interactions described in the papers are registered into the databases. Further, consensual networks are obtained from reliable reviews and class books. However, the contents of extracted information by automatic method are definitely more abundant than those of manual extraction. Considering this, this kind of automatic extracted database is appropriate for use in advanced information retrieval or data-mining tools, including implicit knowledge discovery and the interpretation of high throughput data. In the near future, a data-mining tool for implicit knowledge discovery based on PRIME-data and other NLP techniques will be available to the public.
We wish to acknowledge Yo Shidahara, Koji Shintaku, and Kouichiro Yamada for their extensive readings of abstracts and assistance in the construction of the family name dictionary. We would like to thank Mr. K. Kodama at Hitachi ULSI Systems for helping us by programming the PRIME database. This work is supported in part by a grant from the Ministry of Education, Culture, Sports, Science, and Technology of Japan for scientific research in priority areas, such as genome information science.