GeneWays: A System for Mining Text and for Integrating Data on Molecular Pathways

Andrey Rzhetsky1, 2, Michael Krauthammer1, Tomohiro Koike2, 4, Pauline Kra1, Shawn M. Gomez2, Hong Yu1, Pablo Duboue3, Wubin Weng3, Steven B. Johnson1, Vasileios Hatzivassiloglou3, and Carol Friedman1, 5




1Department of Medical Informatics
2Columbia Genome Center
3Department of Computer Science,
Columbia University, New York, NY 10032, USA
4Hitachi Software Engineering Co., Ltd.,
Yokohama, Japan
5Department of Computer Science, Queens College CUNY,
Flushing, NY 11367, USA







INTRODUCTION

Imagine a group of ignorant yet bright cavepeople who are trying to understand operation of a modern car by analyzing a several damaged cars that were produced by various makers. After many hours of hard manual labor, the cavepeople manage to disassemble the cars into myriad parts. Some parts are damaged and some are intact; some pairs of parts interact with each other whereas others do not; some parts are different in different cars yet apparently have the same function. The leap to understanding of whole from knowing the parts requires reduction of redundant or conflicting pieces of information to a consistent consensus model that can be used for analysis of dynamics. Researchers in post-genome-era molecular biology are in precisely the same situation as the cavepeople: They are contemplating a collection of diverse parts of cellular machinery, and are attempting to understand the whole cell properties. To make their problem more complex, a given part of the cellular machinery can play different roles in different cells of the same organism, or even within the same cell but under different environmental conditions. The number of nodes in human molecular networks is on the order of hundreds of thousands when all substances (genes, RNAs, proteins, and other molecules) are considered together. These numerous substances can be in turn present or absent in dozens of cell types in humans. Evidently, the problem is far too complex to be analyzed manually.

With the aim of relieving the information overload that is currently assaulting our fellow scientists, we - in parallel with a few other groups - have developed a computer system, GeneWays, that comprises a battery of tools for automatic gathering and processing of knowledge about molecular pathways. Due to space limitations, we cannot provide here a comprehensive overview of related work; thus, we will reference related work as we describe GeneWays.



GENEWAYS

The general architecture of GeneWays comprises 10 major components that are being developed at Columbia University (see Figure 1). (The eleventh component, Bio/ Spice, is being developed independently by a team led by Dr. Adam P. Arkin at the University of California, Berkeley). All except the relationship-learning and simulation mod-ules are linked dynamically into a pipeline for processing hypertext markup language (HTML) documents. The two remaining modules do not participate in dynamic processing of a document at production time; rather, one of them is used for statistical learning of terms and relationships with the goal of improving the accuracy of the other natural-language-processing module (relationship-learning module), and another for understanding modes of dynamic behavior of the system (simulation module).


Figure 1: Major modules of GeneWays. Unshaded boxes denote modules for which we have developed functional prototypes. Solid arrows indicate dynamic interactions between modules when a new document arrives in production mode. Dashed arrows indicate information flow during cycles of system improvement. The docu-ment download/ sorting module handles retrieval of all hypertext markup language (HTML) documents available at a remote web sites, such as at the web site of the journal Cell. Further, it separates documents that are relevant and not relevant to signal transduction. The tagger module identifies, and tags with extensible markup language (XML) labels, nouns and noun phrases corresponding to substances (genes, proteins, mRNAs, small molecules, and so on). The disambiguation module resolves ambiguity for substance names that, depending on context, can point to genes or proteins or mRNAs. The synonym/ homonym resolution module converts synonymous names of substances with multiple names to a canonical form and disambiguates homonymous names corresponding to distinct substances. The output of the disambiguation module in combination with part-of-speech annotation allows another module to learn relationships. GENIES parses sentences in the document, extracts the relevant information, and stores the latter in structured templates or frames. The interpreter module analyzes complex nested structures produced by GENIES and reduces them to a collection of binary statements. The statement-filtering module converts a set of redundant and potentially contradictory binary statements to a set of unique, consistent statements about pathway links. The visualization/ editing (CUtenet) module performs visualization of data produced by other modules and has an integration function. The simulation module (Bio/ Spice) is a UC Berkely tool (Adam Arkin's laboratory) the performs analysis of the dynamic behavior of a regulatory system. Finally, the statistical data-integration module carries out statistically efficient combining of heterogeneous data, such as network-connection data, output of natural-language processing, and protein-sequence data. Module interactions are shown by arrows that indicate the direction of information flow.


Document-collection/ document-sorting module. Using on-line access to various scientific journals we downloaded thousands of papers related to regulatory pathways implicated in apoptosis and cell-cycle control. (In the fully implemented GeneWays, the process of downloading and filtering articles related to a process of interest will be completely automated.)

The problem of distinguishing relevant from non-relevant documents, like the more general problem of arranging fully automated subject access, is far from trivial. The main difficulties are the ambiguity and diversity of human languages and the shared-content terms that are used in related topics (e. g., different biology subfields).

State-of-the-art solutions to the relevancy problem include a collection of machine-learning algorithms, such as the Bayes Optimal Classifier, Naïve Bayesian Classifier, Support Vector Machines and a few others, including deci-sion trees and neural networks. The document-sorting module of GeneWays is under development. We intend to implement a toolbox of machine-learning methods that will be used not only in the document-sorting module, but also in several other modules in our system.


Preprocessor/tagger. The preprocessing phase converts a document from an initial representation intended for reading by humans to a representation suitable for computer analysis. This conversion is from primarily graphic-oriented formats (such as HTML or PostScript) to an internal, XML-based representation that contains only content-bearing tags. We have implemented a prototype that handles HTML files; we will extend the coverage of HTML constructs, and will include the ability to recognize different file formats, such as PostScript and Adobe PDF.


Disambiguation module. After terms in a document have been identified automatically, the next stage of semantic tagging is determining the class to which the term belongs (e. g., protein, gene, mRNA). Establishing the class of a term allows us to generalize observed patterns from a given protein - for example, to a more general class - and consequently infer relationships between classes of biological terms. The semantic class assigned to a word or phrase also greatly simplifies the lexicon used by the subsequent parsing system (the GENIES module), which does not need to list every gene or protein separately and can handle new terms that it has not seen previously.


Synonym/ homonym resolution module. The next module converts synonymous names of substances to a canonical form, and disambiguates homonymous names that correspond to distinct substances. We distinguish two types of synonyms. One type, name, is associated with full name- abbreviated name pairs; for example, the protein name lymphocyte associated receptor of death is a definition of its acronym LARD. The other type, name, includes multiple unrelated aliases for the same gene or protein; for example, Apo3, DR3, TRAMP, and LARD are synonyms of this type.


Parsing module (GENIES). GENIES uses a semantic grammar and includes substantial syntactic knowledge interleaved with semantic and syntactic constraints; it works with original, complex (rather than simplified) sentences. It always attempts to obtain a complete parse so as to achieve high precision; however, if a sentence cannot be parsed exactly according to the grammar rules, GENIES uses alternative strategies, such as segmenting and partial parsing, to improve recall.


Interpreter module. The next module analyzes complex nested structures produced by GENIES and transforms them to a collection of binary statements.


The visualization/ editing module is central to GeneWays in that it integrates all other modules of the system.


Statistical data integration module. It is convenient to consider analysis of data in a statistical framework. Following a statistical philosophy, we can imagine that, for each species, there is a "true" molecular network that is completely hidden from direct observation, but that can be inferred from its indirect manifestations, such as from the results of two-hybrid experiments.


Relation-learning module. A major focus of our work in natural-language processing is automatic extraction of patterns that involve protein and gene names, as well as other biological terms (identified by the tagger and labeled by the disambiguation module), and the application of those patterns to new text to extract knowledge that is present in the text but is not listed explicitly in existing databases.

Simulation module. We are integrating our system with an independent dynamics-simulation system, Bio/ Spice, developed by the laboratory of Dr. Arkin at the University of California (Berkeley). We are working on a set of formats that will allow seamless data conversion between the Bio/ Spice and GeneWays systems.


AI curator module and two-level database-pathway data management. We are planning to develop a set of machine-learning algorithms (labeled "AI curator") that will emulate human curators in resolving inconsistencies between statements.



CONCLUSION

We are developing a computational tool that will facilitate integration of the signal-transduction community, based on the collaborative model pioneered by developers of Open Source Software, such as Linux. In the long run, we intend to provide, for a broad spectrum of researchers, a user-friendly computational tool that will allow them to skim quickly the numerous research papers that report accumulating data on regulatory events related to a species and a tissue of interest. Furthermore, we will develop tools for efficient curation of signal-transduction data by human experts that will allow us to create a system that maintains the evolving consensus of signal-transduction field. The system will be useful for hypothesis generation and as an educational and reference tool. Our technology is likely to have a significant influence both within and outside of the field of signal transduction. In particular, the technology will be useful immediately in functional genomics for identifying candidate genes that might be involved in human hereditary maladies. Advances in automated knowledge discovery made as part of this project- particularly in the automated detection of relationships between terms- will be applicable to other tasks that rely on, or could be enhanced by, text processing.