1Bioinformatics Laboratory/LNCC
Av. Getúlio Vargas 333, Quitandinha. 25651-070 Petrópolis
Rio de Janeiro , Brazil
2Laboratory of Cancer Genetics, Ludwig Institute for Cancer Research,
Rua Prof. Antonio Prudente, 109- 4th floor,01509-010,
São Paulo, SP, Brazil.
As the number of sequencing projects of prokaryotic genomes increases, effective and efficient new tools for genome annotation are required. This process has as input the raw DNA sequence produced by genome sequencing projects. The addition of various layers of analysis and interpretation are then required to extract biological meaningful information. Genome annotation comprises analysis: at the nucleotide, protein and functional levels.
Chormobacterium violaceum is being sequenced by a consortium of 25 sequencing groups distributed throughout Brazil. The raw data are being processed by our bioinformatics laboratory. C. violaceum is a gran-negative bacterium, with a genome size of 4.2 Million base pairs that possesses interesting features from the pharmacological and biotechnological points of view. For the annotation of this genome, a new tool is being developed that integrates several public domain and newly developed software programs capable of dealing with several types of databases. During the sequencing phase, long contigs are constantly and automatically annotated to indicate possible open reading frames (ORFs), stop codons, promoters, terminators, Shine Dalgarno patterns as well as other genomic features.
The starting point of this tool is the identification of genome landmarks by a tRNA program [1] that searches for and identifies tRNA genes, BLASTN [2] to search for and identify rRNA genes and RepeatMasker [3] for identification and mapping repetitive elements and excluding repetitive regions during the genome assembly process.
The GLIMMER program [4] (Gene Locator and Interpolated Markov Modeler) is then used for identifying open reading frames within the microbial DNA.GLIMMER uses Interpolated Markov Models (IMMs) to identify coding regions and distinguish them from noncoding DNA. For likely functional attribution, the BLAST family programs are used to search for homology in the main biological sequence databases (GenBank, SWISS-PROT, etc.) and these results used to identify the metabolic pathways. The KEGG - Kyoto Encyclopedia of Genes and Genomes [5] and ECOCYC - Encyclopedia of E. coli Genes and Metabolism [6] are used to provide a framework of molecular, cellular biology and biochemical machinery
Since the comparison of protein sequences between species is a rich source of functional annotation, three tools are used for this process. To identify and cluster groups of orthologous proteins, Clusters of Orthologous Groups of proteins (COGs) [7] is used to classify predicted proteins on the basis of functional domains, folds and motifs. INTERPRO (Integrated Resources of Proteins Domains and Functional Sites) [8] is also used functions as a cross referencing system that provides an integrated view of the commonly used signature databases PROSITE (patterns+profiles). In addition, PRINTS, Pfam, ProDom and Smart are utilized for identifying distant relationships in novel sequences, and hence predicting protein function and structure. Lastly, PSORT [9] is accessed for the prediction of protein localization sites in cells.
Noncoding regions, will be annotated using software that seek RBS, promoters, operators [10], repetitive sequences [11] that are being developed and that will be integrated with those here discussed.