In Silico Biology 4, 0026 (2004); ©2004, Bioinformation Systems e.V.  

GeneInfoViz: Constructing and visualizing gene relation networks

Mi Zhou and Yan Cui*




Department of Molecular Sciences
Center of Genomics and Bioinformatics
College of Medicine
University of Tennessee Health Science Center
Email: ycui2@utmem.edu

*corresponding author





Edited by E. Wingender; received January 27, 2003; revised and accepted March 27, 2004; published April 23, 2004



Abstract

Large amounts of knowledge about genes have been stored in public databases. One of the most challenging problems in Bioinformatics is, given all the information about the genes in the databases, determining the relationships between the genes. For example, how can we determine if genes are related and how closely they are related based on existing knowledge about their biological roles. We developed GeneInfoViz, a web tool for batch retrieval of gene information and construction and visualization of gene relation networks. We created a database containing compiled Gene Ontology information for the genes of several model organisms. Users can batch search for a group of genes and get the Gene Ontology terms that are associated with the genes. Directed acyclic graphs are generated to show the hierarchical structure of the Gene Ontology tree. GeneInfoViz calculates an adjacency matrix to determine whether the genes are related and, if so, how closely they are related based on biological processes, molecular functions, or cellular components they are associated with and then displays a dynamic graph layout of the network among the selected genes.

Availability: http://genenet.org/

Key words: gene network, Gene Ontology, dynamic visualization



Introduction

With the development of high throughput genomic technologies, the number of genes that people can study at the same time is increasing rapidly. A frequently occurring scenario is that researchers select a number of genes through a genome-wide survey, e. g., potential marker genes for a certain disease. Many of these genes may have multiple biological roles and may be related because of the common biological roles they share. It is important to find out - given all the knowledge on these genes in the public databases what we can tell about the relationships between these genes.

Literature mining approaches have been used to address this problem [Jenssen et al., 2001; Tao et al., 2002; Chaussabel et al., 2002; Masys et al., 2001; Tanabe et al., 1999; Hirschman et al., 2002]. They can also be used for automated database curation and ontology development [Hirschman et al., 2002; Yeh et al., 2003]. However, since many major databases Footnote 1 have used Gene Ontology (GO) to annotate genes, it might be more direct and precise to define the relationships between genes based on the GO terms they are associated with.

GeneInfoViz provides users with a tool to view a group of genes in the GO directed acyclic graphs (DAG). It also enables users to analyze the relation among these genes in a network at a specific level of the DAG.



Methods


Creation and automatic update of the GeneInfo database

GeneInfoViz includes parser programs that can download and parse gene information from public databases including UniGene [Schuler et al., 1996], LocusLink [Pruitt et al., 2001] and Gene Ontology [The Gene Ontology Consortium 2000; 2001]. The compiled gene information is then saved in our local database, called GeneInfo. Currently, GeneInfo contains function information of genes of human, mouse, rat, fruit fly, and zebra fish. More species will be added in the future. The parser programs automatically and regularly update the GeneInfo database.


Batch retrieval of gene information

Users can batch search through the web interface of GeneInfoViz. The input is a list of query genes. It permits searches by Gene Bank Accession numbers, LocusLink IDs, UniGene Ids, and official gene symbols. The search results include UniGene IDs (with hypertext link to the corresponding UniGene web page), LocusLink IDs (with hypertext link to the corresponding LocusLink web page), official gene symbols, gene names, and Gene Ontology terms. There are three types of GO terms: biological process [P], molecular function [F], and cellular component [C]. Associations of genes with GO terms are tagged with an evidence code that categorizes the quality of the association - ranging from more error-prone electronic annotation to experimental evidence. Users can choose to exclude the associations between genes and GO tagged with undesired evidence codes. By default, the associations with any evidence code are included.


The directed acyclic graphs of Gene Ontology

The GO system has a hierarchical structure - broad biological roles are at the higher levels, and more specific roles are at the lower levels. In this paper, we define the root - "Gene Ontology, GO:0003673" at level 0 and its three child nodes -"biological_process, GO:0008150", "molecular_function, GO:0003674", and "cellular_component, GO:0005575" at level 1. A DAG is a directed graph where no path starts and ends at the same node (vertex). DAGs can be used to show the hierarchical structure of Gene Ontology [The Gene Ontology Consortium 2000; 2001]. In the DAG, a Gene Ontology term is represented by a node: a parent node and a child node are linked by a directed edge (arrow) starting from the parent node and ending at the child node. However, the DAG of all known Gene Ontology terms includes too many terms that are irrelevant to our group of genes. GeneInfoViz filters the Gene Ontology DAG and shows only the part that is associated with the query genes. It first finds all the GO terms that are assigned to the query genes in NCBI’s LocusLink database [Pruitt et al., 2001], and then traces their ancestral GO terms by querying the Gene Ontology database. Graphviz, an open source graph drawing software, is used to create the graph description file of DAG. WebDot, a CGI program, is used to convert the graph description file into an image that can be included on a web page.


Constructing a gene relation network

The results of batch query include the GO terms that are associated with each gene. GeneInfoViz starts from these GO terms and traces the paths up to a higher level (determined by user) in the GO DAG. An indicator table is used to code the genes’ biological roles. Only 1 and 0 occur in the indicator table: "1" means the gene is associated with the biological role, "0" means it is not. Based on the indicator table, we can calculate Footnote 2

(1)

A is the adjacency matrix ( n × n matrix, n is the number of genes), T is the indicator table (m × n matrix, m is the number of biological processes that the genes are involved in), is the transpose matrix of T. Let the network G = (V, E), where V is the set of vertices (in our case, genes) and E is the set of edges. The structure of network G is determined by the adjacency matrix A. The two vertices vi and vj are connected by edge eij if and only if Aij > 0. The length of eij is defined as

(2)

where M = Max(Aij), the maximum of the numbers in matrix A, L0 is a constant. The more biological roles the two genes share, the closer they are in the graph.


Dynamic visualization of the gene relation network

GeneInfoViz dynamically displays the gene relation network G using a Java Applet developed by Sun MicroSystems. It makes a planar embedding of G in the two-dimensional space. The planar embedding of an abstract graph G is an isomorphism between G and a plane graph . is a layout of G. Note that the graph is determined only up to the adjacency matrix, so the layout of a graph G is not unique. Two layouts are equivalent if both of them reflect the gene relations that are determined by the adjacency matrix. The program allows users to change the graph layout by clicking and dragging the vertices (genes). The program moves the vertices to keep the lengths of the edges in   as close to L(eij) as possible. In this way, GeneInfoViz allows users to select a graph layout from the equivalent graph layouts of G.



Results and discussion

We used a group of genes as input to illustrate our method. Bertucci et al., 2002, did large-scale microarray experiments and identified a predictor set of 23 genes whose expression patterns differentiated two groups of breast cancer patients with different survival after adjuvant chemotherapy. The 23 genes were selected because they are differentially expressed in the two groups of samples: no further functional connections between these genes can be found from the microarray data. Here we show that GeneInfoViz can be used to construct and visualize the relationships between these potential maker genes for breast cancer prognosis based on their biological roles. This is a novel way of using database information to annotate the gene selection results from genomic surveys like microarray analysis.

First, we used GeneInfoViz to search the Gene Ontology terms associated with the 23 genes. The result was that 21 of the 23 genes were assigned to at least one biological process (Figure 1).



Figure 1: Batch retrieval of gene function information including the three types of Gene Ontology terms that the genes are associated with (according to NCBI’s LocusLink database); [P] denotes biological process; [F] denotes molecular function; [C] denotes cellular component.


A Gene Ontology DAG of the biological process terms to which these 23 genes are associated is shown in Figure 2. A rectangular node in blue indicates that this specific GO term is associated with one or more query genes. Names of the involved genes are listed below the GO term followed by its evidence code in parentheses.



Figure 2: The Gene Ontology DAG including all the biological processes that the 23 query genes are involved in. The blue rectangular nodes represent the biological processes (GO terms) that at least one of the 23 query genes is assigned to (according to NCBI’s LocusLink database). The names of the genes are also shown in the node.


There were 108 nodes at the 10 levels of the DAG. Many genes were assigned to multiple nodes located in different parts of the DAG. For example, insulin-like growth factor 2 (IGF2) is involved in eight biological processes: growth pattern, insulin receptor signaling pathway, imprinting, skeletal development, development, physiological processes, cell proliferation, and regulation of cell cycle. These eight biological processes are offspring of three broad biological processes at level 2: development, cellular process, and physiological processes (Figure 2). In the tree-like DAG, we easily identified genes that belong to the same GO term, as well as genes that are in the same branch of the tree. But since many genes are involved in more than one biological process at different levels, it was difficult to tell the functional relations between genes from the Gene Ontology DAG alone. In order to quantify and visualize the functional relations between genes, GeneInfoViz first created an indicator table (Table 1). It started from the base biological processes the genes are involved in (the blue rectangular nodes in Figure 2), then traced the Gene Ontology terms up to a selected level (defined by the user) in the Gene Ontology DAG and assigned the genes to all the ancestor biological processes along the paths to the selected level.


Table 1: The indicator table coding the biological processes the genes are involved in. The first column is Gene Ontology categories in biological process. "1" means the gene belongs to the category, "0" means it does not.
Table 1 complete
  (Click on table for complete view!)

Two genes were considered connected if they were associated with at least one common GO term in Table 1. The more common GO terms they were associated with, the closer their connection. The number of common GO terms genes are associated with was shown in an adjacency matrix (Table 2).


Table 2: The adjacency matrix. The numbers in this matrix are the number of co-occurrences of the two genes (in the corresponding row and column) in the same biological processes.
  ANG CRABP2 CSF1 EGFR ERBB2 GATA3 GZMB IGF2 MST1 MYBL2 MYC PLAT SOX4 SOX9 SRF TOP2B VIL2 XBP1 KIAA0427 SUI1
ANG 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0
CRABP2 0 0 0 0 0 2 0 0 0 4 4 0 4 4 4 0 0 0 0 0
CSF1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0
EGFR 0 0 0 0 7 0 1 3 1 0 0 2 0 0 0 0 0 0 1 1
ERBB2 0 0 0 7 0 0 1 2 1 0 0 2 0 0 0 0 0 0 1 1
GATA3 0 2 1 0 0 0 0 0 0 3 3 0 2 3 3 0 0 1 0 0
GZMB 1 0 0 1 1 0 0 0 4 2 0 4 0 0 0 0 0 0 1 1
IGF2 0 0 0 3 2 0 0 0 0 2 2 0 0 1 0 1 0 0 0 0
MST1 1 0 0 1 1 0 4 0 0 0 0 4 0 0 0 0 0 0 1 1
MYBL2 0 4 0 0 0 3 2 2 0 0 7 0 4 5 5 1 0 0 0 0
MYC 0 4 0 0 0 3 0 2 0 7 0 0 4 6 6 1 0 0 0 0
PLAT 1 0 0 2 2 0 4 0 4 0 0 0 0 0 0 0 0 0 1 1
SOX4 0 4 0 0 0 2 0 0 0 4 4 0 0 4 4 0 0 0 0 0
SOX9 0 4 0 0 0 3 0 1 0 5 6 0 4 0 6 0 0 0 0 0
SRF 0 4 0 0 0 3 0 0 0 5 6 0 4 6 0 0 0 0 0 0
TOP2B 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0
VIL2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
XBP1 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
KIAA0427 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 3
SUI1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 3 0


We constructed a gene relation network based on biological process at level 5 and lower (Figure 3). The length of the edge between each two genes in the gene relation network was determined according to Equation (2) and was shown in the distance matrix (Table 3).



Figure 3: The gene function relation network constructed by GeneInfoViz based on biological processes at level 5 and lower levels.


Table 3: Distance matrix.
  ANG CRABP2 CSF1 EGFR ERBB2 GATA3 GZMB IGF2 MST1 MYBL2 MYC PLAT SOX4 SOX9 SRF TOP2B VIL2 XBP1 KIAA0427 SUI1
ANG 0 0 0 0 0 0 350 0 350 0 0 350 0 0 0 0 0 0 0 0
CRABP2 0 0 0 0 0 300 0 0 0 200 200 0 200 200 200 0 0 0 0 0
CSF1 0 0 0 0 0 350 0 0 0 0 0 0 0 0 0 0 0 300 0 0
EGFR 0 0 0 0 50 0 350 250 350 0 0 300 0 0 0 0 0 0 350 350
ERBB2 0 0 0 50 0 0 350 300 350 0 0 300 0 0 0 0 0 0 350 350
GATA3 0 300 350 0 0 0 0 0 0 250 250 0 300 250 250 0 0 350 0 0
GZMB 350 0 0 350 350 0 0 0 200 300 0 200 0 0 0 0 0 0 350 350
IGF2 0 0 0 250 300 0 0 0 0 300 300 0 0 350 0 350 0 0 0 0
MST1 350 0 0 350 350 0 200 0 0 0 0 200 0 0 0 0 0 0 350 350
MYBL2 0 200 0 0 0 250 300 300 0 0 50 0 200 150 150 350 0 0 0 0
MYC 0 200 0 0 0 250 0 300 0 50 0 0 200 100 100 350 0 0 0 0
PLAT 350 0 0 300 300 0 200 0 200 0 0 0 0 0 0 0 0 0 350 350
SOX4 0 200 0 0 0 300 0 0 0 200 200 0 0 200 200 0 0 0 0 0
SOX9 0 200 0 0 0 250 0 350 0 150 100 0 200 0 100 0 0 0 0 0
SRF 0 200 0 0 0 250 0 0 0 150 100 0 200 100 0 0 0 0 0 0
TOP2B 0 0 0 0 0 0 0 350 0 350 350 0 0 0 0 0 0 0 0 0
VIL2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
XBP1 0 0 300 0 0 350 0 0 0 0 0 0 0 0 0 0 0 0 0 0
KIAA0427 0 0 0 350 350 0 350 0 350 0 0 350 0 0 0 0 0 0 0 250
SUI1 0 0 0 350 350 0 350 0 350 0 0 350 0 0 0 0 0 0 250 0


For example, IGF2 in Figure 3, is connected to 6 genes: EGFR, ERBB2, MYBL2, MYC, SOX9, and TOP28. These connections are also shown in the DAG (Figure 2):

  1. IGF2 and EGFR have a common ancestor node, transmembrane receptor protein tyrosine kinase signaling pathway (at level 7)
  2. The biological process that ERBB2 is involved in, enzyme linked receptor protein signaling pathway (at level 6), is an ancestor of insulin receptor signaling pathway, the base biological process of IGF2 (at level 8)
  3. IGF2 and MYBL2 are both involved in regulation of cell cycle (at level 6), and MYC can also trace up to this biological process
  4. SOX9 is involved in cartilage condensation, which can trace up to skeletal development (at level 5), a base biological process of IGF2
  5. TOP28 is involved in DNA topological change (at level 10), which can trace up to cell cycle (at level 5); regulation of cell cycle that IGF2 is involved in (at level 6) can also trace up to cell cycle at level 5

Users can determine the specific levels at which they want to view the gene relation network. The higher the level selected, the more links there are because broader processes are taken into account. Figure 4 shows the gene relation network based on Biological Process at level 6 and lower. This graph contains fewer links than in Figure 3. For example, IGF2 is connected to four genes: EGFR, ERBB2, MYBL2, and MYC. It loses links with SOX9 and TOP28 because the common ancestor between SOX9 and IGF2, skeletal development, and the common ancestor between TOP28 and IGF2, cell cycle, are both at level 5, and therefore their connections are eliminated from this graph.



Figure 4: The gene function relation network constructed by GeneInfoViz based on biological processes at level 6 and lower levels.


GeneInfoViz is not a traditional gene-oriented one-gene-a-page database. It is a relation-oriented Footnote 3 Bioinformatics system for batch information retrieval, gene relation network construction and visualization. GeneInfoViz quantifies the relationships between genes by summarizing the biological processes that the genes are involved in and the relationships among the biological processes defined by the Gene Ontology system. It can also construct the gene relation networks based on other types of Gene Ontology information – molecular functions or cellular components. The method can also be used to construct gene relation networks based on other gene function information systems like MIPS Functional Catalogue [Mewes 1991; Mewes et al., 2002] and KEGG Ontology [Kanehisa et al., 1996; 2000].



Acknowledgement

We would like to thank Dr. David Armbruster for his help in preparing the manuscript.



References





Footnotes

Footnote 1: A complete list can be found at Gene Ontology website http://www.geneontology.org.

Footnote 2: Usually, the adjacency matrix contains only 1 and 0 to define whether the vertices are connected. But the adjacency matrix we define here is different. The nondiagonal elements are the frequencies that the two genes (at the corresponding row and column) are involved in the same biological processes (or have the same molecular functions, or are associated with the same cellular components).

Footnote 3: We define "relation-oriented Bioinformatics system" as the Bioinformatics systems that not only provide information about individual biological objects (e. g., genes) but also provide information about the relationships among the biological objects.