| In Silico Biology 4, 0003 (2003); ©2003, Bioinformation Systems e.V. |
| Ontology Workshop Tokyo 2003 |
1 Network Service Solution Business Group, Content Sharing Service Business Unit,
NTT Software Corporation, 223-1 Yamashita-Cho, Naka-ku, Yokohama, Kanagawa
231-8554, Japan
2 Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC),
1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
3 Genome Science Laboratory, Discovery and Research Institute, RIKEN Wako
Main Campus, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan
4 Laboratory for Genome Information Engineering, Department of Bioinformatic
Engineering, Graduate School of Information Science and Technology, Osaka University, 1-3
Machikaneyama, Toyonaka, Osaka, 560-8531 Japan
* corresponding author
Email: matsuda@ist.osaka-u.ac.jp
Edited by E. Wingender; received August 15, 2003; revised December 18, 2003; accepted December 22, 2003; published December 26, 2003
Although the sequencing of the human genome and several model organisms is almost complete, the number of genes in the human is still in debate. cDNA (complementary DNA) is generated from mRNA that is transcribed from the genome and can be regarded as a gene itself; therefore, decoding cDNA sequences is important in characterizing genes. Recently, biologists have been able to describe more knowledge about genes in order to characterize them, and this information is generally called 'annotation.' Furthermore, annotation is important in understanding the systems of organisms in various fields of research. We therefore constructed the MaXML (Mouse annotation XML) format with which mouse cDNA annotation data can be exchanged and shared between laboratories more efficiently. Defining strict data types for annotations is difficult, but we consider XML a feasible format for describing them. We have used the MaXML format to express mouse annotation data in FANTOM DB. We have also developed tools and systems utilizing these MaXML data, including a parser and a server to provide data on-the-fly.
Key words: functional annotation, transcriptome, computational annotation, mouse cDNA
In biology, many research institutes have been decoding huge numbers of DNA sequences from various organisms. In particular, large-scale genome projects such as the human genome project have determined whole genome sequences [1].
As DNA sequences alone do not have explicit biological information; therefore, additional knowledge has been attached to the sequences when various biological facts regarding them are found. Such information includes the locations of exons and introns, corresponding protein entries, and descriptions of biological functions. This type of information is called 'annotations'. Because annotations are important types of information for researchers in biology to use in understanding molecular systems, handling annotation data is a crucial issue. Sequences and annotations can be determined in a high-throughput manner [2, 3], and such data should be manipulated automatically in computer programs.
Traditionally, three international DNA data banks, namely DDBJ (DNA Data Bank of Japan; http://www.ddbj.nig.ac.jp/), EMBL (European Molecular Biology Laboratory; http://www.ebi.ac.uk/), and GenBank (http://www.ncbi.nih.gov/), have collected such annotations. They have also distributed nucleotide data to the world. However, these annotation data are difficult to manipulate automatically using computers. One reason for this is that annotation data are not always consistent; annotation data are written by the submitters who have determined the sequences. Therefore, the annotations have tended to depend on the individual interests of the researchers who submit them. For example, some sequence entries have information about coding sequences on mRNA, but some do not, even if the proteins are coded. Another reason that annotation data is difficult to manage on a computer is that the annotation formats are based on flat text and are designed to be appropriate for humans, but not for computers. Recently, however, DDBJ/EMBL/GenBank has started to release their data in XML formats. Although these efforts have solved difficulties in parsing annotation data in computers, some semantic issues remain; for example, retrieving the functions of products derived from a specific mRNA is not easy. Such information may be stored in a definition line, as a product name, in a "misc_feature," or even in a note.
The DAS (Distributed Annotation System) [4] is one implementation used to express and exchange annotation for genome sequences. In this system, annotation data are written in XML (Extensible Markup Language) and are exchanged and manipulated computationally. While DAS helps to manage genomic sequences and their information, some researchers are interested in cDNA (complementary DNA) sequences that are generated from mRNA. Such researchers need information about cDNA clones. To handle cDNA data in the DAS system, sequences of clones must be mapped to reference genomic sequences. However, organisms whose whole genome sequences have been determined are limited. Furthermore, some cDNA clones are difficult to map to genome sequences (e.g. immunoglobulin and T-cell receptor sequences).
Thus, we have designed a new annotation format based on XML, called MaXML (mouse annotation XML). This format specifically describes functional annotations about cDNA clones and sequences.
The original MaXML format was initially developed to express functional annotation data determined during two FANTOM (Functional ANnoTation of Mouse) meetings [2, 5]. These meetings were held between 2000 and 2002, when collaborative experts in biology and bioinformatics gathered together to annotate 21,076 (FANTOM 1 meeting) and finally 60,770 (FANTOM 2 meeting) mouse full-length clones. Annotation data include various types of descriptions, such as brief definitions of the clones ("gene name"), associations of Gene Ontology [6] terms, information on motif containing sequences, and references to public sequence databases. The FANTOM annotation data described in the MaXML format is available at ftp://fantom2.gsc.riken.go.jp/fantomdb/.
Although the original MaXML format was designed to express the functional annotation of each cDNA clone, we found that classification of these clones using a specific method is also essential for functional analysis. For example, classification of "known in mouse" or "novel" sequences can be used to determine cDNA clones that will be spotted on cDNA microarrays. Another example of necessary classification is the grouping of cDNA clones by protein motif to help selecting proteins used in experiments for protein interactions. Therefore, we have extended the MaXML format in order to manipulate such types of data.
In this paper, we report on the new MaXML format and on systems developed with the aim of handling the extended format.
In annotation data used in the MaXML format, a single annotation record is created for each cDNA clone and each annotation record contains its identifier (clone ID, accession number, etc.), the last time the record was modified, and the annotation fields. In order to express various types of descriptions in annotation records, multiple annotation fields can be included. Each annotation field has four items: qualifier, annotation text, data source, and evidence.
A qualifier indicates the type of each annotation field. The current implementation of MaXML (available at ftp://fantom2.gsc.riken.go.jp/) has several types of annotation fields such as the following:
An annotation text is used to express a brief annotation description along with a qualifier for each annotation field. Annotation texts are often derived from public database entries (we have termed these entries 'source entries'). A data source shows the database names and identifiers (accession numbers) of source entries from which the annotation text is derived.
The 'evidence' field shows how each annotation field was determined, for example, sequence similarity scores, program names, etc.
MaXML format
The MaXML format is defined as an "annotation DTD" (Figure 1). Based on this format, three types of XML instances are constructed: annotation MaXML data, method MaXML data, and category MaXML data. These three data are written in 'sequence', 'method', and 'category' elements, respectively. Figure 2 is a data model of the MaXML format. Figure 3 is a graphical representation of the MaXML format.
Annotation MaXML data has elements for describing the functions of cDNA clones. Figure 4 shows an example of annotation MaXML data. The element set for annotation data reflects the structure of annotation data. This set also has elements for describing a list of classifications in which a cDNA is grouped. This information is written in "method_name" and "category_name" elements, whose values are defined in method and category MaXML data explained below.
In addition to annotation data itself, the classification of cDNA clones with their annotations is also important. For example, a list of clones grouped because they contain a specific motif is directly used as a target set in wet experiments. In order to describe this information, two data types are required: one datum defines a classifying method and all its categories, the second datum shows a list of sequence entries that belong to a specific category. Therefore, the following two element sets in the MaXML format are designed for describing such information:
The element set for method MaXML data is used to define a method and all categories in order to classify MaXML annotation entries by the method. For example, a method is the type of cDNA annotation, and categories are 'known mouse gene', 'homolog to a known protein', 'similar to a known protein', 'weakly similar to a known protein', 'motif containing protein', 'hypothetical protein', and 'none.' Another example is a specific motif contained by cDNA sequences. The element set for category MaXML data is described by referring to all annotation records grouped in a category by a specific method.
The method MaXML data has three main elements: "method_name", "method_desc", and "method_category". A "method_name" element represents a method for classifying cDNAs: 'annotation_type' and 'InterPro', for example. A reference to a public resource is also written in XLink attributes if it exists. A description of the method is expressed in a "method_desc". A "method_category" element defines one category classified by the method. 'IPR001107' is one example of a value in an 'InterPro' method. Each "method_category" element has an identifier ("category_name"), a description ("category_desc") of a category, and a reference ("category_ref") to one category MaXML data. The category element may have an XLink attribute, for example, when a method is 'InterPro', and the category is a motif identifier. Figure 5 is an example of method MaXML data whose method is 'annotation_type.'
The category MaXML data has three kinds of elements: method, category, and sequence. Method and category elements represent values defined in method MaXML data. A "sequence" element includes an XLink reference to an annotation record in annotation MaXML data. Figure 6 is an example of category MaXML data whose method is 'InterPro.'
Current vocabularies of methods and categories used in these MaXML data are summarized as Supplementary Material.
MaXML system
We have developed a system to distribute FANTOM annotation data in MaXML. This system is implemented as a subsystem in the FANTOM DB system [7]. This system receives queries via HTTP and returns data expressed in MaXML. Current implementation supports only annotation MaXML data. We will extend the system to support method and category MaXML data.
We have also developed a parser to transmit annotation data from a server to client applications. This parser was written as a Perl module, and users can use the module in their Perl scripts. The parser is available from http://fantom2.gsc.riken.go.jp/maxml/.
Large-scale functional genome analysis needs the automatic manipulation of functional information in computers. Therefore, we must observe the structure of the annotation data as well as the rules of their vocabulary. Currently, many formats have been proposed for expressing annotation data. Many of them are field/value- based flat files, for example, DDBJ/EMBL/GenBank feature tables, and SWISS-PROT entries. These types of annotation data have some weaknesses, e.g., processing these data in computers is difficult because users must implement parsers for every type of annotation data from scratch. Furthermore, the use of fields in the formats is sometimes ambiguous. For example, the SWISS-PROT keywords have mixed sets of descriptions. Some of the descriptions are related to functional annotations, but others are not (e.g., 3D-STRUCTURE). To solve these problems, the Gene Ontology Consortium is developing a dynamically controlled vocabulary that can be applied to all organisms even while knowledge of gene and protein roles in cells is accumulating and changing. The FANTOM Consortium [2, 5] has also proposed a functional annotation rule and has released annotation data that obey the rule. Because MaXML was designed to manipulate these Gene Ontology terms and FANTOM annotations, our format is suitable for describing data used in computational functional analysis.
Some database systems provide annotation data in HTML. Although these types of annotation data can be shown in WWW browsers, processing them computationally is difficult because HTML tags indicate only how to represent the text in browsers, but not what the text means.
Annotation data written in XML including MaXML can be shown in recent WWW browsers and can be processed easily by computers with various XML libraries and utilities. This increased efficiency reduces the time and labor needed for developing client systems to manipulate annotation data. For this reason we selected XML as a format to describe our annotation data.
DDBJ, EMBL, and GenBank release their XML formats for sequence entries in international data banks. Although these three formats are based on the original flat file formats, they have different design strategies. The GenBank XML format has more elements than the others do, and the data are more granulated. In DDBJ and EMBL XML, some descriptions are kept as they were written in their original flat files. The MaXML format is similar to DDBJ and EMBL, in that less time is required to retrieve annotation data.
The structures of our format and the structures of other formats are slightly different. In particular, in our format, the functional information of a sequence is explicitly written in "annotation" elements, while this information is found in any one of many elements in the other XML formats, as was explained in the introduction. This is because we specifically intend to exchange functional information with this format. This difference makes converting data among the MaXML and the DDBJ/EMBL/GenBank XML formats somewhat complicated.
In the future, we plan to produce a client application that communicates with the annotation server and users through the MaXML format. We also plan to integrate the system in order to manipulate DAS data to combine genome data and cDNA data.
We would like to thank Julian Gough for his comments on this paper. We also wish to thank the members of the FANTOM Consortium. This study was supported by a Research Grant to the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports, Science and Technology of Japanese (MEXT) to Y.H. This research is also supported by ACT-JST (Research and Development for Applying Advanced Computational Science and Technology) of the Japan Science and Technology Corporation (JST) to H.M. This study has also been supported by Special Coordination Funds for Promoting Science and Technology from MEXT to Y.O.