A tool for automatic genome annotation

A. T. Vasconcelos1, L. Gonzaga1 , R. C. de Souza1, D. Q. Mendes 1, R. F. C. Paixão1, R. Kersanach1, M. Oliveira1, S. M. Martins1 and A. J. G. Simpson2




1Bioinformatics Laboratory/LNCC
Av. Getúlio Vargas 333, Quitandinha. 25651-070 Petrópolis
Rio de Janeiro , Brazil
2Laboratory of Cancer Genetics, Ludwig Institute for Cancer Research,
Rua Prof. Antonio Prudente, 109- 4th floor,01509-010,
São Paulo, SP, Brazil.






As the number of sequencing projects of prokaryotic genomes increases, effective and efficient new tools for genome annotation are required. This process has as input the raw DNA sequence produced by genome sequencing projects. The addition of various layers of analysis and interpretation are then required to extract biological meaningful information. Genome annotation comprises analysis: at the nucleotide, protein and functional levels.

Chormobacterium violaceum is being sequenced by a consortium of 25 sequencing groups distributed throughout Brazil. The raw data are being processed by our bioinformatics laboratory. C. violaceum is a gran-negative bacterium, with a genome size of 4.2 Million base pairs that possesses interesting features from the pharmacological and biotechnological points of view. For the annotation of this genome, a new tool is being developed that integrates several public domain and newly developed software programs capable of dealing with several types of databases. During the sequencing phase, long contigs are constantly and automatically annotated to indicate possible open reading frames (ORFs), stop codons, promoters, terminators, Shine Dalgarno patterns as well as other genomic features.

The starting point of this tool is the identification of genome landmarks by a tRNA program [1] that searches for and identifies tRNA genes, BLASTN [2] to search for and identify rRNA genes and RepeatMasker [3] for identification and mapping repetitive elements and excluding repetitive regions during the genome assembly process.

The GLIMMER program [4] (Gene Locator and Interpolated Markov Modeler) is then used for identifying open reading frames within the microbial DNA.GLIMMER uses Interpolated Markov Models (IMMs) to identify coding regions and distinguish them from noncoding DNA. For likely functional attribution, the BLAST family programs are used to search for homology in the main biological sequence databases (GenBank, SWISS-PROT, etc.) and these results used to identify the metabolic pathways. The KEGG - Kyoto Encyclopedia of Genes and Genomes [5] and ECOCYC - Encyclopedia of E. coli Genes and Metabolism [6] are used to provide a framework of molecular, cellular biology and biochemical machinery

Since the comparison of protein sequences between species is a rich source of functional annotation, three tools are used for this process. To identify and cluster groups of orthologous proteins, Clusters of Orthologous Groups of proteins (COGs) [7] is used to classify predicted proteins on the basis of functional domains, folds and motifs. INTERPRO (Integrated Resources of Proteins Domains and Functional Sites) [8] is also used functions as a cross referencing system that provides an integrated view of the commonly used signature databases PROSITE (patterns+profiles). In addition, PRINTS, Pfam, ProDom and Smart are utilized for identifying distant relationships in novel sequences, and hence predicting protein function and structure. Lastly, PSORT [9] is accessed for the prediction of protein localization sites in cells.

Noncoding regions, will be annotated using software that seek RBS, promoters, operators [10], repetitive sequences [11] that are being developed and that will be integrated with those here discussed.


REFERENCES

  1. 1.Lowe, T. & Eddy, S. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955-964 .
  2. 2.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, 1990. Basic local alignment search tool. J Mol Biol 215(3):403-10.
  3. 3.Ewing B, Hillier L, Wendl MC, Green P. 1998 .Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 3:175-185.
  4. 4.Salzberg SL, Pertea M, Delcher AL, Gardner MJ, Tettelin H., 1999. Interpolated Markov models for eukaryotic gene finding. Genomics 1:24-31.
  5. 5.Kanehisa, M. and Goto, S., 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 29-34
  6. 6. Karp PD, Riley M, Paley SM, Pellegrini-Toole A, Krummenacker M. 1999. Eco Cyc: encyclopedia of Escherichia coli genes and metabolism. Nucleic Acids Res 27: 55-58.
  7. 7.Tatusov, R., Galperin, M., Natale, D. & Koonin, E. 2000.The COGdatabase: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33-36
  8. 8.Apweiler, R. et al. 2001The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37-40 ().
  9. 9.Kenta Nakai, 2000. Protein sorting signals and prediction of subcellular localization, Adv. Protein Chem. 54, 277-344.
  10. 10.Lomba M., Vasconcelos AT, Pacheco AB, Almeida DF. 1997. Identification of yebG as a DNA damage inducible Escherichia coli gene. Fems Microbiology Letters Vol. 156 (1): 119-122.
  11. 11.Vasconcelos AT, Maia MAGM, Almeida DF (2000) Short interrupted palindromes on the extragenic DNA of Escherichia coli K-12, Haemophilus influenzae Rd and Neisseria meningitidis Z249. Bioinformatics, 16 (11) 968-977.