In Silico Biology 4, 0019 (2004); ©2004, Bioinformation Systems e.V.  


PPD - Proteome Profile Database

Kishore R. Sakharkar1 and Vincent T. K. Chow2*




1 BioInformatics Institute, Singapore
2 Human Genome Laboratory, Department of Microbiology,
  Faculty of Medicine, National University of Singapore,
  Kent Ridge, Singapore 117597
  Phone: +65-6874 6200
  Fax: +65-6776 6872
  Email: micctk@nus.edu.sg

* corresponding author





Edited by E. Wingender; received December 24, 2003; revised and accepted February 23, 2004; published March 02, 2004



Abstract

With the complete sequencing of multiple genomes, there have been extensions in the methods of sequence analysis from single gene/protein-based to analyzing multiple genes and proteins simultaneously. Therefore, there is a demand of user-friendly software tools that will allow mining of these enormous datasets. PPD is a WWW-based database for comparative analysis of protein lengths in completely sequenced prokaryotic and eukaryotic genomes. PPD's core objective is to create protein classification tables based on the lengths of proteins by specifying a set of organisms and parameters. The interface can also generate information on changes in proteins of specific length distributions. This feature is of importance when the user's interest is focused on some evolutionarily related organisms or on organisms with similar or related tissue specificity or life-style. PPD is available at: PPD Home

Key words: protein length distribution, proteome, bioinformatics, genomics, obligatory intracellular parasites, genome, size, reduction, database, distribution, profile, classification, COG, KEGG, amino acid, composition, evolution, gene, organism life-style




Introduction

Comparative genomics is a powerful approach for identifying genetic variation that could explain differences in the anatomy, physiology and biochemistry of the organisms compared, and the factors responsible for their life-styles in general. It has given us the ability to scrutinize genome/proteome data and acquire knowledge out of this data in a biologically meaningful context. Comparative genomics utilizes the large number of sequences in databases not only for elucidating commonality in all of life, but also for understanding the evolutionary diversity within various groups, as well as for comprehending the evolutionary processes or mechanisms producing such diversity. To-date, ~200 bacterial and nine eukaryotic genomes have been completely sequenced (list available online). These complete genome/proteome sequences provide a platform for understanding biological systems at a whole new level of complexity and provide insights into evolution.

Protein lengths in prokaryotes and eukaryotes vary considerably from a few to thousands of amino acids and length variations are documented to have many effects. In general, the mean protein lengths of eukaryotes are reported to be 40-60% greater than in prokaryotes [Zhang, 2000]. It was reported that orthologous proteins are longer in eukaryotes as compared to prokaryotes. Also, eukaryote- and prokaryote-specific proteins are of unequal length with the former being longer than the latter. Multicellular eukaryotes are described to have longer proteins than uni-cellular eukaryotes [Chervitz et al., 1998]. Increased protein lengths add functional motifs to proteins, which in turn are associated with novel gene interactions and sophisticated gene regulation networks in eukaryotes [Zhang, 2000]. A relationship between protein conservation and sequence length was also reported by Lipman et al. who found that the less conserved (less important) proteins are on average smaller than the more conserved (and more important) proteins [Lipman et al., 2002]. An interesting example was reported in Mycoplasma hyorhinis where elongated versions of Vlp surface lipoproteins protect escape variants from growth-inhibiting host antibodies [Citti et al., 1997]. There are reports that longer proteins are more important than the smaller ones in yeast [Zhang, 2000]. Positive correlations between protein lengths and expression levels have been described [Duret et al., 1999]. However, the biological meaning and evolutionary mechanisms that are responsible for differences in protein lengths among the three domains of life (i.e eukaryotes, archea and prokaryotes) are hitherto obscure. Recent determination of the complete genome sequences of representative organisms, i.e. bacteria with different life-styles, physiology, genotypic and phenotypic characteristics, and nine eukaryotes (four unicellular and five multi-cellular), makes it possible for the first time to study the distribution of the lengths of all proteins encoded in a genome, and to compare these distributions across the different species in a detailed fashion. This will help to elucidate commonalities and contrasts among these genomes (e.g. genome reduction in all the obligatory intracellular parasites compared to E. coli, a free-living bacterium), and enhance understanding of their general evolution and behavior at the sequence level. Recently, Pruess and colleagues generated the Proteome Analysis Database that allows one to interrogate and compare entire proteomes of organisms by domain and/or protein family distributions [Pruess et al., 2003]. However, there is currently no comprehensive database, that facilitates comparison of protein length distribution profiles for completely sequenced genomes and that permits us to view the results online in a graphical and tabular fashion.

PPD is a new WWW-based database on protein length distribution of genomes. The key components of PPD include, (i) an algorithm that can classify proteins into groups based on their lengths, (ii) a user interface that calculates the percentage change (reduction or increase) in proteins of a specific length among the chosen organisms, (iii) a user interface that is designed for users to explore the resulting classification in detail, (iv) viewing of length distribution profiles online and (v) amino acid composition profiles online, (vi) links to COG and KEGG to provide further information on orthology and metabolic role of the protein product, respectively. Via this approach, users can obtain appropriate classification results that they desire using the latest data available for the organisms of their interest. However, the users themselves should examine their classification results carefully in order to interpret them.




System and methods

The annotated genome sequences with the accompanying information on the positions of all protein-coding genes were retrieved from the GenBank FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes). All bacterial, archaeal and eukaryotic genome sequences, (available as at 15th December, 2003) were included. Coding region information is extracted and protein lengths are calculated for each protein. In PPD, based on a user-specified set of organisms and parameters, a protein length classification table is dynamically created based on a user-specified set of organisms and parameters. The created table is cached into the database and one can compare multiple genomes. Clustering is performed dynamically for proteins of similar size and bar charts are created online depending on the user's selection. Information on proteins of specific length groups with explanations on protein products, orthologous group category (based on COG) [Tatusov et al., 1997] and amino-acid compositions of the proteins are made available. The protein products are linked to KEGG [Kanehisa et al., 2002] to provide detailed information on the metabolic roles of the proteins.




Implementation

WWW Interface

A basic Internet based query interface provides access to PPD. The main PPD page (http://www.bii.a-star.edu.sg/~kishore/PPD/PPD.html) provides a list of all the available genomes. Selecting the organism of choice, protein lengths and output options will display classification/cluster tables with information on genome size of the organism, the original GenBank file, the total number of proteins for the organism, and hyperlinks to retrieve proteins of specific length groups with information on amino acid composition of each protein for each genome as per the genome annotation available from GenBank. The retrieved sequences can be further subjected to BLAST at NCBI. For the purpose of obtaining information on genome reduction or increase, the interface also allows for selection of the reference organism. A reduction in the genome sizes of obligatory intracellular parasites is depicted as an example online. The database is accessible to all in an unrestricted fashion through WWW.

Example

A large number of obligatory intracellular parasites have genome sizes that are considerably less than other completely sequenced genomes. There have been many reports on the evolution of these bacteria from larger genomes by genome deterioration. Comparison of genome sequences from completely sequenced obligatory intracellular parasites for which genome data is available - Chlamydia pneumoniae [Kalman et al., 1999], Chlamydia trachomatis [Stephens et al., 1998], Rickettsia conorii [Ogata et al., 2001] and Rickettsia prowazekii [Andersson et al.,1998] - show considerably reduced genomes indicating continual selective pressure for a minimal genome. Genome reduction in all the obligatory intracellular parasites versus E. coli (K-12) [Blattner et al., 1997], a free-living bacterium is shown as an example . The flux, streamlining and elimination of genes in bacterial genomes of obligate intracellular parasitic species represent an ongoing process and could be a function of bacterial life-style and coding capacity of the genomes in terms of compactness. The genomes display marked similarities in patterns of protein length and frequency distribution. It is interesting to see that despite reduction of genome size in obligatory intracellular parasites, there are long house-keeping proteins such as DNA gyrase (906 amino acids) that are maintained and selected. Bearing in mind that the genome sequence data are mostly incomplete and limited to a few species, this analysis initiates a broad-scale survey involving obligatory intracellular parasites. This survey also helps to enquire whether or not there are any generalizations concerning the life-style of these prokaryotes that distinguish them from free-living bacteria. In general, the length of a protein sequence is determined by its function and the wide variance in the lengths of an organism's proteins reflects the diversity of specific functional roles of these proteins. However, additional evolutionary forces that affect the length of a protein may be revealed by studying the proteins in detail.




Caveats

As annotation artefacts of genomic sequences are likely to be more frequent among shorter proteins than longer proteins [Das et al., 1997; Skovgaard et al., 2001], the authors advise caution while making inferences based on protein lengths and their compositions.




Conclusion

Whole-genome sequences of organisms are beginning to provide an opportunity for computer-based analysis that will allow us to highlight interesting features present in the genomic sequences. There is an unexpected level of structural plasticity in genomes in terms of genome size, gene number, number of proteins and protein length. Analysis of these factors in different genera and species will pave the way to develop hypotheses concerning the functional relevance of these features and then test these predictions in the laboratory. This will provide unique opportunities for comparative analyses between these organisms in order to identify important primary sequence features that translate into important phenotypic features and differences between organisms.




References