| In Silico Biology 4, 0026 (2004); ©2004, Bioinformation Systems e.V. |
Department of Molecular Sciences
Center of Genomics and Bioinformatics
College of Medicine
University of Tennessee Health Science Center
Email: ycui2@utmem.edu
*corresponding author
Edited by E. Wingender; received January 27, 2003; revised and accepted March 27, 2004; published April 23, 2004
Large amounts of knowledge about genes have been stored in public databases. One of the most challenging problems in Bioinformatics is, given all the information about the genes in the databases, determining the relationships between the genes. For example, how can we determine if genes are related and how closely they are related based on existing knowledge about their biological roles. We developed GeneInfoViz, a web tool for batch retrieval of gene information and construction and visualization of gene relation networks. We created a database containing compiled Gene Ontology information for the genes of several model organisms. Users can batch search for a group of genes and get the Gene Ontology terms that are associated with the genes. Directed acyclic graphs are generated to show the hierarchical structure of the Gene Ontology tree. GeneInfoViz calculates an adjacency matrix to determine whether the genes are related and, if so, how closely they are related based on biological processes, molecular functions, or cellular components they are associated with and then displays a dynamic graph layout of the network among the selected genes.
Availability: http://genenet.org/
Key words: gene network, Gene Ontology, dynamic visualization
With the development of high throughput genomic technologies, the number of genes that people can study at the same time is increasing rapidly. A frequently occurring scenario is that researchers select a number of genes through a genome-wide survey, e. g., potential marker genes for a certain disease. Many of these genes may have multiple biological roles and may be related because of the common biological roles they share. It is important to find out - given all the knowledge on these genes in the public databases what we can tell about the relationships between these genes.
Literature mining approaches have been used to address this problem [Jenssen et al., 2001; Tao et al., 2002; Chaussabel et al., 2002; Masys et al., 2001; Tanabe et al., 1999; Hirschman et al., 2002]. They can also be used for automated database curation and ontology development [Hirschman et al., 2002; Yeh et al., 2003]. However, since many major databases Footnote 1 have used Gene Ontology (GO) to annotate genes, it might be more direct and precise to define the relationships between genes based on the GO terms they are associated with.
GeneInfoViz provides users with a tool to view a group of genes in the GO directed acyclic graphs (DAG). It also enables users to analyze the relation among these genes in a network at a specific level of the DAG.
Creation and automatic update of the GeneInfo database
GeneInfoViz includes parser programs that can download and parse gene information from public databases including UniGene [Schuler et al., 1996], LocusLink [Pruitt et al., 2001] and Gene Ontology [The Gene Ontology Consortium 2000; 2001]. The compiled gene information is then saved in our local database, called GeneInfo. Currently, GeneInfo contains function information of genes of human, mouse, rat, fruit fly, and zebra fish. More species will be added in the future. The parser programs automatically and regularly update the GeneInfo database.
Batch retrieval of gene information
Users can batch search through the web interface of GeneInfoViz. The input is a list of query genes. It permits searches by Gene Bank Accession numbers, LocusLink IDs, UniGene Ids, and official gene symbols. The search results include UniGene IDs (with hypertext link to the corresponding UniGene web page), LocusLink IDs (with hypertext link to the corresponding LocusLink web page), official gene symbols, gene names, and Gene Ontology terms. There are three types of GO terms: biological process [P], molecular function [F], and cellular component [C]. Associations of genes with GO terms are tagged with an evidence code that categorizes the quality of the association - ranging from more error-prone electronic annotation to experimental evidence. Users can choose to exclude the associations between genes and GO tagged with undesired evidence codes. By default, the associations with any evidence code are included.
The directed acyclic graphs of Gene Ontology
The GO system has a hierarchical structure - broad biological roles are at the higher levels, and more specific roles are at the lower levels. In this paper, we define the root - "Gene Ontology, GO:0003673" at level 0 and its three child nodes -"biological_process, GO:0008150", "molecular_function, GO:0003674", and "cellular_component, GO:0005575" at level 1. A DAG is a directed graph where no path starts and ends at the same node (vertex). DAGs can be used to show the hierarchical structure of Gene Ontology [The Gene Ontology Consortium 2000; 2001]. In the DAG, a Gene Ontology term is represented by a node: a parent node and a child node are linked by a directed edge (arrow) starting from the parent node and ending at the child node. However, the DAG of all known Gene Ontology terms includes too many terms that are irrelevant to our group of genes. GeneInfoViz filters the Gene Ontology DAG and shows only the part that is associated with the query genes. It first finds all the GO terms that are assigned to the query genes in NCBI’s LocusLink database [Pruitt et al., 2001], and then traces their ancestral GO terms by querying the Gene Ontology database. Graphviz, an open source graph drawing software, is used to create the graph description file of DAG. WebDot, a CGI program, is used to convert the graph description file into an image that can be included on a web page.
Constructing a gene relation network
The results of batch query include the GO terms that are associated with each gene. GeneInfoViz starts from these GO terms and traces the paths up to a higher level (determined by user) in the GO DAG. An indicator table is used to code the genes’ biological roles. Only 1 and 0 occur in the indicator table: "1" means the gene is associated with the biological role, "0" means it is not. Based on the indicator table, we can calculate Footnote 2
![]() | (1) |
A is the adjacency matrix ( n × n matrix, n is the number of genes), T is the indicator table (m × n matrix, m is the number of biological processes that the genes are
involved in),
is the transpose matrix of T. Let the network G = (V, E), where V is the set of
vertices (in our case, genes) and E is the set of edges.
The structure of network G is determined by the
adjacency matrix A. The two vertices vi and vj are connected by edge eij if and only if Aij > 0. The length of eij is defined as
| (2) |
where M = Max(Aij), the maximum of the numbers in matrix A, L0 is a constant. The more biological roles the two genes share, the closer they are in the graph.
Dynamic visualization of the gene relation network
GeneInfoViz dynamically displays the gene relation network G using a Java
Applet developed by Sun MicroSystems. It makes a planar embedding of G in the two-dimensional space. The planar embedding of an
abstract graph G is an isomorphism between G and a plane graph
.
is a layout of G. Note that the graph is determined only up to the adjacency matrix, so the layout of a graph G is not unique. Two layouts are equivalent if both of them reflect the gene relations that are determined by the adjacency matrix. The
program allows users to change the graph layout by clicking and dragging the
vertices (genes). The program moves the vertices to keep the lengths of the
edges in
as close to L(eij) as possible. In this way, GeneInfoViz allows users to select a graph layout from the equivalent graph layouts of G.
We used a group of genes as input to illustrate our method. Bertucci et al., 2002, did large-scale microarray experiments and identified a predictor set of 23 genes whose expression patterns differentiated two groups of breast cancer patients with different survival after adjuvant chemotherapy. The 23 genes were selected because they are differentially expressed in the two groups of samples: no further functional connections between these genes can be found from the microarray data. Here we show that GeneInfoViz can be used to construct and visualize the relationships between these potential maker genes for breast cancer prognosis based on their biological roles. This is a novel way of using database information to annotate the gene selection results from genomic surveys like microarray analysis.
First, we used GeneInfoViz to search the Gene Ontology terms associated with the 23 genes. The result was that 21 of the 23 genes were assigned to at least one biological process (Figure 1).
A Gene Ontology DAG of the biological process terms to which these 23 genes are associated is shown in Figure 2. A rectangular node in blue indicates that this specific GO term is associated with one or more query genes. Names of the involved genes are listed below the GO term followed by its evidence code in parentheses.
There were 108 nodes at the 10 levels of the DAG. Many genes were assigned to multiple nodes located in different parts of the DAG. For example, insulin-like growth factor 2 (IGF2) is involved in eight biological processes: growth pattern, insulin receptor signaling pathway, imprinting, skeletal development, development, physiological processes, cell proliferation, and regulation of cell cycle. These eight biological processes are offspring of three broad biological processes at level 2: development, cellular process, and physiological processes (Figure 2). In the tree-like DAG, we easily identified genes that belong to the same GO term, as well as genes that are in the same branch of the tree. But since many genes are involved in more than one biological process at different levels, it was difficult to tell the functional relations between genes from the Gene Ontology DAG alone. In order to quantify and visualize the functional relations between genes, GeneInfoViz first created an indicator table (Table 1). It started from the base biological processes the genes are involved in (the blue rectangular nodes in Figure 2), then traced the Gene Ontology terms up to a selected level (defined by the user) in the Gene Ontology DAG and assigned the genes to all the ancestor biological processes along the paths to the selected level.
| Table 1: | The indicator table coding the biological processes the genes are involved in. The first column is Gene Ontology categories in biological process. "1" means the gene belongs to the category, "0" means it does not. |
![]() |
| (Click on table for complete view!) |
Two genes were considered connected if they were associated with at least one common GO term in Table 1. The more common GO terms they were associated with, the closer their connection. The number of common GO terms genes are associated with was shown in an adjacency matrix (Table 2).
| Table 2: | The adjacency matrix. The numbers in this matrix are the number of co-occurrences of the two genes (in the corresponding row and column) in the same biological processes. |
| ANG | CRABP2 | CSF1 | EGFR | ERBB2 | GATA3 | GZMB | IGF2 | MST1 | MYBL2 | MYC | PLAT | SOX4 | SOX9 | SRF | TOP2B | VIL2 | XBP1 | KIAA0427 | SUI1 | |
| ANG | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CRABP2 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 4 | 4 | 0 | 4 | 4 | 4 | 0 | 0 | 0 | 0 | 0 |
| CSF1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 |
| EGFR | 0 | 0 | 0 | 0 | 7 | 0 | 1 | 3 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| ERBB2 | 0 | 0 | 0 | 7 | 0 | 0 | 1 | 2 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| GATA3 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 2 | 3 | 3 | 0 | 0 | 1 | 0 | 0 |
| GZMB | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 4 | 2 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| IGF2 | 0 | 0 | 0 | 3 | 2 | 0 | 0 | 0 | 0 | 2 | 2 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| MST1 | 1 | 0 | 0 | 1 | 1 | 0 | 4 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| MYBL2 | 0 | 4 | 0 | 0 | 0 | 3 | 2 | 2 | 0 | 0 | 7 | 0 | 4 | 5 | 5 | 1 | 0 | 0 | 0 | 0 |
| MYC | 0 | 4 | 0 | 0 | 0 | 3 | 0 | 2 | 0 | 7 | 0 | 0 | 4 | 6 | 6 | 1 | 0 | 0 | 0 | 0 |
| PLAT | 1 | 0 | 0 | 2 | 2 | 0 | 4 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| SOX4 | 0 | 4 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 4 | 4 | 0 | 0 | 4 | 4 | 0 | 0 | 0 | 0 | 0 |
| SOX9 | 0 | 4 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 5 | 6 | 0 | 4 | 0 | 6 | 0 | 0 | 0 | 0 | 0 |
| SRF | 0 | 4 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 5 | 6 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 |
| TOP2B | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| VIL2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| XBP1 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| KIAA0427 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| SUI1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |
We constructed a gene relation network based on biological process at level 5 and lower (Figure 3). The length of the edge between each two genes in the gene relation network was determined according to Equation (2) and was shown in the distance matrix (Table 3).
|
Figure 3: The gene function relation network constructed by GeneInfoViz based on biological processes at level 5 and lower levels. |
| Table 3: | Distance matrix. |
| ANG | CRABP2 | CSF1 | EGFR | ERBB2 | GATA3 | GZMB | IGF2 | MST1 | MYBL2 | MYC | PLAT | SOX4 | SOX9 | SRF | TOP2B | VIL2 | XBP1 | KIAA0427 | SUI1 | |
| ANG | 0 | 0 | 0 | 0 | 0 | 0 | 350 | 0 | 350 | 0 | 0 | 350 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CRABP2 | 0 | 0 | 0 | 0 | 0 | 300 | 0 | 0 | 0 | 200 | 200 | 0 | 200 | 200 | 200 | 0 | 0 | 0 | 0 | 0 |
| CSF1 | 0 | 0 | 0 | 0 | 0 | 350 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 300 | 0 | 0 |
| EGFR | 0 | 0 | 0 | 0 | 50 | 0 | 350 | 250 | 350 | 0 | 0 | 300 | 0 | 0 | 0 | 0 | 0 | 0 | 350 | 350 |
| ERBB2 | 0 | 0 | 0 | 50 | 0 | 0 | 350 | 300 | 350 | 0 | 0 | 300 | 0 | 0 | 0 | 0 | 0 | 0 | 350 | 350 |
| GATA3 | 0 | 300 | 350 | 0 | 0 | 0 | 0 | 0 | 0 | 250 | 250 | 0 | 300 | 250 | 250 | 0 | 0 | 350 | 0 | 0 |
| GZMB | 350 | 0 | 0 | 350 | 350 | 0 | 0 | 0 | 200 | 300 | 0 | 200 | 0 | 0 | 0 | 0 | 0 | 0 | 350 | 350 |
| IGF2 | 0 | 0 | 0 | 250 | 300 | 0 | 0 | 0 | 0 | 300 | 300 | 0 | 0 | 350 | 0 | 350 | 0 | 0 | 0 | 0 |
| MST1 | 350 | 0 | 0 | 350 | 350 | 0 | 200 | 0 | 0 | 0 | 0 | 200 | 0 | 0 | 0 | 0 | 0 | 0 | 350 | 350 |
| MYBL2 | 0 | 200 | 0 | 0 | 0 | 250 | 300 | 300 | 0 | 0 | 50 | 0 | 200 | 150 | 150 | 350 | 0 | 0 | 0 | 0 |
| MYC | 0 | 200 | 0 | 0 | 0 | 250 | 0 | 300 | 0 | 50 | 0 | 0 | 200 | 100 | 100 | 350 | 0 | 0 | 0 | 0 |
| PLAT | 350 | 0 | 0 | 300 | 300 | 0 | 200 | 0 | 200 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 350 | 350 |
| SOX4 | 0 | 200 | 0 | 0 | 0 | 300 | 0 | 0 | 0 | 200 | 200 | 0 | 0 | 200 | 200 | 0 | 0 | 0 | 0 | 0 |
| SOX9 | 0 | 200 | 0 | 0 | 0 | 250 | 0 | 350 | 0 | 150 | 100 | 0 | 200 | 0 | 100 | 0 | 0 | 0 | 0 | 0 |
| SRF | 0 | 200 | 0 | 0 | 0 | 250 | 0 | 0 | 0 | 150 | 100 | 0 | 200 | 100 | 0 | 0 | 0 | 0 | 0 | 0 |
| TOP2B | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 350 | 0 | 350 | 350 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| VIL2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| XBP1 | 0 | 0 | 300 | 0 | 0 | 350 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| KIAA0427 | 0 | 0 | 0 | 350 | 350 | 0 | 350 | 0 | 350 | 0 | 0 | 350 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 250 |
| SUI1 | 0 | 0 | 0 | 350 | 350 | 0 | 350 | 0 | 350 | 0 | 0 | 350 | 0 | 0 | 0 | 0 | 0 | 0 | 250 | 0 |
For example, IGF2 in Figure 3, is connected to 6 genes: EGFR, ERBB2, MYBL2, MYC, SOX9, and TOP28. These connections are also shown in the DAG (Figure 2):
Users can determine the specific levels at which they want to view the gene relation network. The higher the level selected, the more links there are because broader processes are taken into account. Figure 4 shows the gene relation network based on Biological Process at level 6 and lower. This graph contains fewer links than in Figure 3. For example, IGF2 is connected to four genes: EGFR, ERBB2, MYBL2, and MYC. It loses links with SOX9 and TOP28 because the common ancestor between SOX9 and IGF2, skeletal development, and the common ancestor between TOP28 and IGF2, cell cycle, are both at level 5, and therefore their connections are eliminated from this graph.
|
Figure 4: The gene function relation network constructed by GeneInfoViz based on biological processes at level 6 and lower levels. |
GeneInfoViz is not a traditional gene-oriented one-gene-a-page database. It is a relation-oriented Footnote 3 Bioinformatics system for batch information retrieval, gene relation network construction and visualization. GeneInfoViz quantifies the relationships between genes by summarizing the biological processes that the genes are involved in and the relationships among the biological processes defined by the Gene Ontology system. It can also construct the gene relation networks based on other types of Gene Ontology information – molecular functions or cellular components. The method can also be used to construct gene relation networks based on other gene function information systems like MIPS Functional Catalogue [Mewes 1991; Mewes et al., 2002] and KEGG Ontology [Kanehisa et al., 1996; 2000].
We would like to thank Dr. David Armbruster for his help in preparing the manuscript.
Footnote 1: A complete list can be found at Gene Ontology website http://www.geneontology.org.
Footnote 2: Usually, the adjacency matrix contains only 1 and 0 to define whether the vertices are connected. But the adjacency matrix we define here is different. The nondiagonal elements are the frequencies that the two genes (at the corresponding row and column) are involved in the same biological processes (or have the same molecular functions, or are associated with the same cellular components).
Footnote 3: We define "relation-oriented Bioinformatics system" as the Bioinformatics systems that not only provide information about individual biological objects (e. g., genes)
but also provide information about the relationships among the biological
objects.