| In Silico Biology 4, 0048 (2004); ©2004, Bioinformation Systems e.V. |
1 G.N. Ramachandran Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, Mall Road, Delhi 110 007, India.
2 School of Biotechnology, GGS Indraprastha University, Delhi 110 006, India.
3 Department of Biotechnology, JayPee Institute of Information Technology, Noida (UP) 201307, India.
* Corresponding author
Phone: +91-120-2400973-06 ext 261; Fax: +91-120-2400986; Email: tannistha_cbt@yahoo.com
Edited by E. Wingender; received August 01, 2004; revised October 25, 2004; accepted October 29, 2004; published November 02, 2004
In silico proteomics complements computational genomics in characterizing genome evolution. Here we examine cluster patterns in archaeal and bacterial proteomes using compositional properties of protein sequences in contrast to the traditionally used sequence alignment procedures. Application of standard Principal Component Analysis to the multi-dimensional data identified cluster patterns. Two types of cluster patterns exist in bacterial proteomes. Proteomes of type I have one major cluster with few isolated points in space revealing an underlying largely homogeneous compositional structure. In type II proteomes two clusters of protein distribution were discernible. The two clusters differ in size and were separated from each other although the boundary was somewhat fuzzy. Proteins falling in the major cluster were labeled as 'typical' and proteins of the minor cluster were called 'atypical'. The atypical proteins were mapped to Cluster of Orthologous Groups. Species distribution in COGs maps with respect to atypical proteins illuminated the biological relationships of extreme diversity among the archaeal members and of diversity among bacteria in relation to their niche. Amino acids that were over-represented in the atypical proteins had higher biosynthetic cost compared to 'typical' ribosomal proteins. However, archaea and bacteria economize by preferring the less costly amino acid to others closely related in chemical structure. Further, over-representation of serine in atypical proteins of archaeal members suggests re-examining these proteomes for the presence of Serine/Threonine phosphatases and kinases in Archaea. Our computational procedure can serve as a useful addition to the existing tools for carrying out in silico proteomics.
Key words: proteomics, composition, clustering, PCA
While large-scale genome sequencing has provided us with the building blocks or 'bags of genes' of living organisms, higher order comparative analyses are required to obtain insights into physiological and biochemical processes. Recently such analyses of genomes have revealed important genomic patterns of predictive use with respect to gene expression (high and low expression), horizontal gene transfer (genetic exchange), phylogenetic relationships, intra-genomic cross-correlations (prediction of restriction sites for restriction enzymes) and networks of gene links (society of genes) [1-7].
Most biological reactions and processes are either catalyzed or aided by proteins. Computational analysis of proteomes (the set of encoded proteins in an organism) has recently emerged as a complementary activity to the analysis of genomic patterns. Proteomic patterns have been described with respect to gene duplication, domain shuffling, phylogenetic analysis, amino acid clustering, amino acid runs in protein families, structural and functional features [8-14]. In these cases, a large body of data was generated primarily through assessment of sequence similarities, feature identification and standard compositional analysis. Although the usual approach followed for sequence analysis and classification of proteins is based on sequence alignment [15], several approaches have been developed using amino acid composition. While classification of proteins based on sequence alignment constitutes micro-level analysis, the corresponding exercise based on compositional attributes offers analysis at macro-level that holds the promise to explore novel relationships and provide information for physiological and biochemical inferences. Indeed, the software PropSearch was developed for analyzing protein sequences using 144 compositional properties where conventional alignment tools fail to identify significantly similar sequences [16].
Amino acid compositional analysis has been shown to be informative in a number of studies. Composition has significant correlation to the biological characteristics of a protein such as its location (intracellular or extracellular), function (enzymes or non-enzymes), structural features, the presence of disulfide bonds and the folding type [17-20]. Peptide chains could be assigned as either cytoplasmic or extra-cellular, solely from the analysis of sequence composition [21]. Compositional differences between cytoplasmic and secretory proteins have been used to develop software for predicting secreted proteins by training artificial neural networks [22]. Similarly, methods for predicting transmembrane helices in integral membrane proteins use analysis of charge bias and hydrophobicity [23, 24].
Biological characteristics of proteins such as in vivo stability and structural features have also been found to correlate with distinct patterns of amino acid composition. It was observed that the occurrence of certain dipeptides was significantly different in the unstable proteins compared to the stable ones [25]. Low complexity protein sequences with lower proportion of distinct dipeptides have non-globular shapes compared to sequences of high complexity [26]. In a comparative analysis of the proteomes of enteric pathogens, sequence complexity analysis showed that species patterns exist in a limited set of low complexity proteins that parallels taxonomic classification [26, 27]. Species and strain specific differences in low complexity proteins could be correlated with their unique biological properties [28]. These reports demonstrate the importance and usefulness of compositional analysis of proteins.
In general, the methods used in the foregoing studies comprised of computing the measures such as amino acid frequencies and residue pair frequencies, followed by applying scoring schemes and assessing statistical significance. While analyzing proteome data we can search for cluster patterns that are likely to reveal the underlying relationships between the encoded proteins. In this exercise a distance metric for each protein reflects its characteristics with respect to a given reference point. Proteins with similar characteristics are likely to cluster together. The most common set of distance metrics (known as Minkowski metrics) comprises of the Euclidean distance, the Manhattan distance and the "sup" distance. These distance measures are used by clustering tools such as K-means clustering to decipher useful patterns in large volumes of data [29].
Earlier, application of Principal Component Analysis to proteome data of one representative each from the three kingdoms of life, archaea, bacteria and eukarya, revealed cluster patterns in these proteomes [30]. In this work, we present a more comprehensive extended study comprising of 23 bacterial and archaeal proteomes. The results show that majority of the proteins in all proteomes cluster into a large dense group. Detailed inspection of features of protein distribution revealed cluster patterns that allowed us to classify proteomes into 2 basic types: I (and IA) and II. Type I (and IA) proteomes are characterized by a large dense cluster surrounded by a diffuse set of points whereas proteomes of type II have two nearly distinct clusters and are distributed over a wide phylogenetic range encompassing archaea and bacteria. Our computational procedure is likely to generate interest in exploring new relationships among archaea and bacteria.
Protein attributes
A protein sequence can be characterized by a large number of attributes. However, most attributes are related to one another and are related to the fundamental attributes used in this work. Traditionally, proteins were characterized experimentally by pI, molecular mass, and their solubility.
The following attributes were used to characterize proteins in this study: %charge (at pH 7.2), %hydrophobicity, distinct dipeptide and different types of compositional distances. The attribute %charge relates to the pI of a protein. The %hydrophobic amino acid residues in a protein guide its sub-cellular location (cytoplasmic versus membrane compartments). The Euclidean compositional distance Dconstant provides a measure of compositional bias of a protein from a reference point of uniform amino acid composition. The Euclidean compositional distance Dphobic enables resolution of the proteins in terms of the type of hydrophobic amino acids in contrast to %hydrophobicity. The distinct dipeptide composition relates to the dipeptide character of a protein, which in turn has been found to be correlated with its in vivo stability and shape [23, 26]. These attributes are time-honored methods of revealing the basic characteristics of proteins although they have been used in different forms in the literature. The application of Minkowski metrics provides a sound mathematical framework for computational analysis described in this work.
Subsequently, cluster patterns were inferred using Principal Component Analysis (PCA) by treating these attributes as variates.
Variate 1 is the % of charged amino acids aspartic acid (D), glutamic acid (E), lysine (K) and arginine (R) in a given protein considering the ionization properties of their side chains at pH 7.2.
% of Charge in a given protein is given by
|
(1) |
where L is the number of amino acids in a given protein.
Variate 2 is the % hydrophobicity of a given protein. Although there are more than 30 hydrophobic scales, we have used the following 4 hydrophobic scales Fauchere and Pliska [31], Hopp and Woods [32], Kyte and Doolittle [33] and Rose scale [34]. Both Kyte and Doolittle and Hopp and Woods scales are widely used. Fauchere and Pliska scale was chosen because it is reported to be a direct method of classifying amino acids [35]. Rose scale was found to be appropriate for assessing the degree of residue burial in protein [36]. Each scale was used one at a time.
% Hydrophobicity of a protein is given by
|
(2) |
Variate 3 is the compositional distance of a protein sequence. The distance is measured according to the formula (Minkowski metric r = 2, Euclidean distance):
|
(3) |
Ox is the observed number of xth amino acid in the protein and Ex is the expected number of xth amino acid in the same protein. In this case Ex was taken as L/20 considering all amino acids to be uniformly distributed. Dconstant/L is a normalized measure of distance for the protein.
Variate 4 is the distance (analogous to Manhattan distance) of distinct dipeptides of a protein with respect to the maximum possible for a given length of the protein. The measure 'C' is:
|
(4) |
| (5) |
Nexp = Expected number of distinct dipeptides; Nobs = Observed number of distinct dipeptides. Nobs is computed in 2 overlapping frames and averaged. When L = 800, Nexp cannot exceed 400 because the total number of theoretically possible distinct dimers is 20 X 20 = 400. C/L is a normalized measure of the distance in distinct dipeptides.
Variate 5 is the hydrophobic distance (Euclidean distance) of a protein given by:
|
(6) |
Ox is the observed number of xth hydrophobic amino acid in the protein and Ex is the expected number of xth hydrophobic amino acid in the same protein. In this case,
|
(7) |
The computation of Ex assumes uniform distribution of the different hydrophobic amino acid types; z is the number of different hydrophobic amino acids identified according to a particular hydrophobic scale. z will vary according to the hydrophobic scale used. In the Kyte and Doolittle and Rose scales z is 13, in Hopp and Woods and Fauchere and Pliska scales z is 11. Dphobic/L is a normalized measure of hydrophobic distance of a protein.
Statistics
Principal Component Analysis using correlation coefficients between the variates was carried out using SAS package (SAS Institute Inc.USA). Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of non-correlated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. This can also be viewed as reduction in the dimensionality of the system.
Sequences
Sequences were retrieved from the National Centre for Biotechnology Information by anonymous ftp transfer from the subdirectory /genbank/genomes/. The sequences of the following 23 genomes were analyzed: Archeal members - Methanococcus janaschii; Pyrococcus abyssi; Pyrococcus horikoshii; Thermoplasma acidophilum; Bacterial members - Bacillus subtilis; Borrelia burgdorferi B31; Campylobacter jejuni NCTC 11168; Escherichia coli K12; Escherichia coli O157; Haemophilus influenzae RdKW20; Helicobacter pylori 26695; Mycoplasma genitalium; Mycoplasma pneumoniae; Mycobacterium tuberculosis cdc1551; Mycobacterium tuberculosis H37Rv; Neisseria meningitis MC58; Pseudomonas aeruginosa PA01; Thermotoga maritima; Treponema pallidum; Synechocystis sp. PCC 6803; Ureaplasma urealyticum; Vibrio cholerae; Xyllela fastidiosa 9a5c.
Physical proximity analysis
Neighboring gene pairs of atypical protein coding genes was identified from the *.gbk files that describe the genes in a set order usually numbered from the origin of replication. Annotation revisions were carefully examined using other complementary information from .gbk files.
Identification of paralogs
Pair-wise alignments of the atypical protein coding genes were carried out using CLUSTALW [37] using default parameters. Protein pairs with scores greater than 62 were examined further for paralog relationships by re-aligning the sequences using BLASTP [38] with low complexity filter 'off' as suggested previously [39]. If the pair had >60% identity, they were considered as paralogs.
Sequence analysis
Sequence analysis was carried out using the Wisconsin Package Version 10.1, Genetics Computer Group (GCG), Madison, Wisconsin [40].
Software
Software programs were written in PERL (Practical Extraction and Reporting Language) and operated on a Silicon Graphics Origin 200 using IRIX 6.5 operating system. Each record in the output data carries the 'gi number' of the protein as its identifier and the fields in each record carries numerical data on the attributes of the protein.
Two Basic Cluster patterns
A standard approach to examine for patterns in large volumes of data is the application of clustering tools. The numerical data on compositional attributes of protein sequences from complete genomes were subjected to Principal Component Analysis to infer cluster patterns. The SAS (Statistical Analysis Software) software generates volcano plots in which each protein is represented as a single point in space. Examination of these plots from proteomes of 23 organisms revealed that each proteome could be classified into one of two basic types (Figure 1). We selected these 23 organisms on the basis of their biological characteristics summarized in Table 1. The proportion of variance in the data explained by the three principal components P1 (principle component 1), P2 (principle component 2) and P3 (principle component 3) were 65-70%, 85-90% and greater than 90% respectively. The type I cluster pattern consists of a dense area surrounded by a diffuse set of points in almost all directions in space. The type IA cluster is similar to type I except that the dense area is highly intensified. The type II cluster pattern consists of two nearly distinct clusters separated in space from each other indicating the presence of multiple protein families with distinct compositional characteristics. Strain differences within the pair E. coli K12: E. coli O157 and the pair M. tuberculosis H37Rv: M. tuberculosis cdc1551 were not discernible while examining the cluster patterns.
| Table 1: Completely sequenced microbial proteomes whose protein sequences were retrieved. |
| Species | Strain | Biological importance |
| Borrelia burgdorferi | B31 | the aetiologic agent of Lyme disease |
| Bacillus subtilis | produces an antibiotic, called iturin active against the fungus | |
| Campylobacter jejuni | NCTC 11168 | cause of bacterial food-borne diarrhoeal disease |
| Escherichia coli | K12 | model organism of bacteria |
| Escherichia coli | O157:H7 | cause haemorrhagic colitis |
| Haemophilus influenzae | Rd KW20 | causes pneumonia and meningitis |
| Helicobacter pylori | 26695 | causes peptic ulcer |
| Mycoplasma genitalium | causes pelvic inflammatory disease | |
| Methanococcus jannaschii | methane-producing thermophile | |
| Mycoplasma pneumoniae | human pathogen, causing 'atypical pneumonia' | |
| Mycobacterium tuberculosis | cdc 1551 | clinical strain |
| Mycobacterium tuberculosis | H37Rv | causes tuberculosis |
| Neisseria meningitis | MC58 | causes meningitis and septicemia |
| Pyrococcus abyssi | hyperthermophilic archaea | |
| Pseudomonas aeruginosa | PA01 | opportunistic human pathogen causing otitis externa, septic arthritis, endocarditis, conjuctivitis, iridocyclitis, keratitis and iritis, corneal ulcer, panophthalmitis, etc. |
| Pyrococcus horikoshii | hyper-thermophilic archaebacterium | |
| Synechocystis | PCC6803 | unicellular photosynthetic cyanobacterium |
| Thermoplasma acidophilum | thermoacidophilic archaeon, low pH growth (pH 1-2). | |
| Thermotoga maritima | grows at 80° C and metabolizes many simple and complex carbohydrates. | |
| Treponema palidum | Nichols | the syphilis spirochete |
| Ureaplasma urealyticum | causes of infection of lower respiratory tract | |
| Vibrio cholerae | N16961 | causes cholera, an acute diarrheal illness |
| Xyllela fastidiosa | 9a5c | causes economically important plant diseases like Pierce's disease of grapevines |
Type I and Type IA proteomes
The mollicutes (mycoplasma and ureaoplasma) and spirochaetes (Borrelia and Treponema) proteomes display the type I cluster pattern. Independent observations on overall protein distribution made using self organizing maps (SOM) on the basis of amino acid composition had revealed that B. burgdorferi, M. genitalium and M. pneumoniae show close relationship [22]. However, in our method, the proteome of H. influenzae Rd exhibited characteristics of type II cluster pattern and therefore was not grouped with other mesophiles (E. coli, or Synechocystis sp.). Species with larger proteome sizes (greater than 1400 proteins) fall into either type IA or type II patterns (Figure 1). The similarity in cluster patterns of the mollicutes, the spirochaetes, the proteobacteria E. coli, X. fastidiosa, and the cyanobacterium Synechosytis, show high compositional homogeneity among the majority of proteins in their proteomes. The appearance of a few proteins separated from the dense cluster shows that their compositional properties are distinct.
Although observation of such a common feature at macro-level across phylogenetically distant taxa need not imply direct similarity in their biological or metabolic characteristics, some points of similarities between them at this level is worth noting.
Among the mollicutes, the genomes of mycoplasma and ureaplasma share common features such as small genome size, high A+T content of the genomic DNA (in the range of 59% to 74%) and conservation of gene order with functional roles in core biological processes (ribosomal proteins & oligopeptide transporters) [41, 42].
The two spirochaetes B. burgdorferi and T. pallidum share similarities in biological characteristics such as growth, similar genome size, clinical features and core biological functions such as genome replication and expression [43, 44]. The %A+T richness of the genomic DNA however varies widely between the two; Borrelia is 72% in A+T whereas Treponema is 47% in A+T.
Although the mollicutes are distantly related to the spirochaetes, gross similarities exist between them in the metabolic pathways [45, 46]. Of the genes on the linear chromosome of B. burgdorferi, 66% are transcribed away from the center of the chromosome and this transcriptional bias appears similar to that in M. genitalium and M. pneumoniae. Similarity between B. burgdorferi, T. pallidum and M. genitalium is apparent in the complement of the genes involved in DNA replication. In DNA repair mechanisms, B. burgdorferi and M. genitalium are similar.
Similarities between the mollicutes and spirochaetes have also been reported on transport capacity, the limited metabolic capacity and the lack of respiratory electron transport chain. T. pallidum is unable to synthesize enzyme co-factors, fatty acids and nucleotides de novo similar to B. burgdorferi and M. genitalium [46]. Some of the transport systems of T. pallidum are of similar specificity to those found in B. burgdorferi and M. genitalium. The absence of TCA cycle and oxidative phosphorylation, generation of reducing power through the oxidative branch of the pentose phosphate pathway of T. pallidum are similar to that in B. burgdorferi and M. genitalium. As in the case of mollicutes and spirochaetes, the gross level similarities in biological functions are mirrored in type IA proteomes. For instance, the transcriptional and translational machinery of X. fastidiosa is similar to that of E. coli [46]. These observations illuminate the underlying gross similarities in functional characteristics between species displaying similarity in their proteome patterns.
Type II proteomes: General features
Type II proteomes are widely distributed in the prokaryotes including archaea, proteobacteria, Bacillus sp. and actinomycetes (Figure 1). These proteomes display two nearly distinct clusters of protein distribution. To assess whether the cluster patterns of type II proteomes were sensitive to the hydrophobic scale used, we re-examined them by varying the hydrophobic scale. Sample results for H. influenzae Rd are shown in Figure 2. It is apparent that the cluster pattern is invariant in all 4 hydrophobic scales examined. Similar observations were made with other proteomes (data not shown). Further analysis was carried out using Kyte and Doolittle scale, as it is the widely used scale of hydrophobicity. We refer to the major cluster of large number of proteins as 'typical' and the minor cluster as 'atypical'.
|
Figure 2: Cluster patterns are invariant in all 4 hydrophobic scales examined. The whole proteome of H. influenzae Rd [53] was processed as described in methods. The projection of volcanic plot is shown in 4 scales labeled below the corresponding plot. The angular orientation of the cube has been varied slightly from one scale to another to maximize clarity for visual inspection. Arrows point to the atypical cluster of proteins. |
A summary of the number of proteins in the atypical cluster and their overall functional distribution in different proteomes is shown in Table 2. Although larger proteomes tend to have higher number atypical proteins, this relationship does not hold in a few cases. M. tuberculosis strains are particularly unusual in having a low number of atypical proteins (1.3%-1.7% of the proteome). P. horikoshii has a higher number of atypical proteins (15.5% of the proteome) for its proteome size.
It is evident that a sizable fraction of the atypical proteins belongs to the transport and membrane associated (TM) functional class and proteins of unknown function (hypothetical proteins) in all the 14 proteomes examined except in mycobacteria. The representation from the cellular processes CP and the characteristic CH classes are low. The uniformity in overall functional composition of the atypical proteins among representatives from widely separated phylogenetic taxa with widely varying genomic base composition indicates that the transport and membrane proteins have distinct compositional characteristics compared with the rest of the proteins. These observations re-confirm and extend the inferences drawn from previous studies using smaller data sets [17-21].
| Table 2: Size of 'atypical' cluster and its overall functional composition in different speciesa. |
| Species | CPb | TMb | CHb | H | Number of atypical proteins |
Proteome sizec |
| B. subtilis | 10.7 | 30.9 | 9.9 | 48.5 | 625 | 4111 |
| C. jejuni | 18.9 | 67.4 | 10.9 | 2.9 | 175 | 1633 |
| H. influenzae RdKW20 | 13.4 | 44.1 | 7.5 | 34.9 | 186 | 1713 |
| H. pylori | 13.8 | 52.6 | 11.2 | 22.4 | 116 | 1575 |
| M. jannaschii | 12.4 | 17.4 | 4.5 | 65.7 | 178 | 1728 |
| M. tuberculosis H37Rv | 0.0 | 0.0 | 100.0 | 0.0 | 73 | 4186 |
| M. tuberculosis cdc1551 | 0.0 | 0.0 | 100.0 | 0.0 | 52 | 3926 |
| N. meningitits MC58 | 17.0 | 26.1 | 10.2 | 46.6 | 176 | 3441 |
| P. abyssi | 8.5 | 18.6 | 7.4 | 65.4 | 188 | 1768 |
| P. horikoshii | 3.2 | 10.8 | 4.3 | 81.7 | 279 | 1800 |
| P. aeruginosa PA01 | 9.0 | 44.6 | 4.6 | 41.8 | 612 | 5566 |
| T. acidophilum | 6.1 | 76.1 | 7.4 | 10.4 | 163 | 1481 |
| T. maritima | 9.5 | 34.3 | 7.0 | 49.3 | 201 | 1857 |
| V. cholerae N16961 | 15.6 | 34.6 | 9.8 | 40.0 | 315 | 2741 |
|
a, The percentage fraction of atypical proteins in each functional class is shown. The atypical proteins could be clearly identified in 14 out of 23 proteomes analyzed in this work. b, Functional class codes: CP denotes CELLULAR PROCESSES, TM denotes TRANSPORT AND MEMBRANE ASSOCIATED, CH denotes CHARACTERISTIC, H denotes 'HYPOTHETICAL' according to a modified classification scheme reported previously [27]. In this method, CELLULAR PROCESSES superclass has proteins of the INFORMATION class (comprising replication, transcription and translation) and of metabolism class. Proteins of Transport and Membrane associated classes were combined into TRANSPORT AND MEMBRANE ASSOCIATED superclass. After classifying the atypical proteins into CP and TM superclasses, the remaining consisted of several proteins that are characteristic of the unique properties of a given species. These proteins were collectively placed in CHARACTERISTIC superclass. This scheme aids in rapid comparative analysis especially when the numbers of proteins in the individual functional classes defined by Riley [54] are small. The annotation provided with the *.faa file from the NCBI was used for classification. c, Number of encoded proteins in the respective genomes |
Functional composition of 'atypical' proteins: The COGs maps
The cluster of orthologous groups (COGs) database offers a systematic route to examine the functional characteristics of proteins across a wide phylogenetic spectrum [46]. Examination of the distribution of atypical cluster proteins in the 20 COGs classes (excluding function unknown, general function prediction, not in COGs) revealed 13 functional classes with sizable representation (see legend to Figure 4 for functional class codes). In the rest of the classes the representation was either absent or less than 1% in most species.
For each of the 14 species, the fraction of proteins of a given COGs class appearing as atypical was plotted against the fraction of atypical proteins appearing in the same COGs for 13 COGs. Sample plots for energy production and conversion genes (C) and Carbohydrate transport and metabolism genes (G) are shown in Figure 3. The former measure estimates the proportion of proteins in COGs with atypical amino acid composition whereas the latter measure estimates the tendency of proteins with atypical amino acid composition to appear in the same COGs.
Species distribution patterns of the atypical proteins in the COGs indicate that the members of archaea exhibit greater diversity whereas the members of the bacteria show more cohesion. The two chromosomes (I and II) of Vibrio cholerae have very different distribution patterns supporting the notion that Chromosome I genes mainly adapt the organism for growth in the intestine whereas Chromosome II genes are essential within environmental niches [48]. None of the atypical proteins of mycobacteria appear in any of the COGs and were therefore excluded and treated as a special case. For quantitative comparisons we examined the proportions of atypical proteins in each class of COGs for all species by computing their Euclidean radial distance from the origin in all 13 classes of COGs maps. Subsequently, comparison of species distributions between the different classes of COGs was carried out by estimating the variance of the distributions. High variance among archaea and bacteria was observed in G (Carbohydrate transport and metabolism), P (Inorganic ion transport and metabolism), and V (Defense mechanisms) (Figure 4a and b). Also N (Cell motility) COGs showed high variance amongst archaea.
|
Figure 4: Species Distribution of Archaeal (a) and Bacterial (b) members with respect to atypical proteins (measured by Euclidean radial distance from the origin) in 13 classes of COGs. Species abbreviations same as in Figure 3. Note the difference in variance between the species distributions in the COGs. The COGs classes are C (Energy production and conversion), D (Cell division and chromosome partitioning), E (Amino acid transport and metabolism), F (Nucleotide transport and metabolism), G (Carbohydrate transport and metabolism), H (Coenzyme metabolism), I (Lipid metabolism), M (Cell wall/membrane biogenesis), N (cell motility), O (Posttranslational modification, protein turnover, chaperones), P (Inorganic ion transport and metabolism), U (Intracellular trafficking and secretion), V (Defense mechanism). |
Since TM proteins constitute a sizable representation of the atypical proteins, differences in the proportion of atypical proteins among various species in the different COGs is in part due to the varying proportion of atypical TM proteins. TM proteins are at the interface between the cell and its natural niche, and the diversity exhibited by the various species perhaps relates to their life styles [48]. The high variance in species distributions with respect to atypical proteins observed in the carbohydrate and inorganic ion functional group may relate to the diverse substrate utilization capabilities of Bacteria and Archaea. High variance in species distributions in the defence mechanisms is indicative of the diverse niches occupied by bacteria and archaeal members. Taken together these observations mirror the biodiversity of these microbes.
The special case of mycobacteria
The atypical proteins of the mycobacteria all comprise of the PE_PGRS and PPE family of proteins. These proteins are rich in glycine and are composed of reiterated sequences [11]. In addition, the PE_PGRS proteins of mycobacteria are unique and do not have homologs in other species. These results contrast those observed in other bacteria and indicate an underlying compositional homogeneity among the rest of the proteins in M. tuberculosis. Alternatively, it is possible that the PE_PGRS proteins with their very unusual compositional characteristics obscure the general feature of differences in the composition of transporters and membrane proteins and other hypothetical proteins from the typical proteins in the software. To test this possibility, we removed the PE_PGRS and PPE family of proteins from the data file and re-examined the protein distribution through PCA. The results are shown in Figure 5. It is apparent that the two cluster pattern vanishes; instead, the in silico modified M. tuberculosis proteome has a homogeneous compositional structure analogous to type IA proteomes. This trend appears unique to mycobacteria. Thus the special features of mycobacterial proteomes arise from their inherent compositional homogeneity in majority of their proteins.
The over-represented amino acids in 'atypical proteins' and their cost of biosynthesis
Assessment of over-representation of amino acids in atypical proteins relative to typical proteins is a complicated task because the typical cluster is large and is likely to embody enormous variability. Therefore we need a good reference point within the typical cluster. The ribosomal proteins of each species serve as a good reference point for comparative analysis of the 'atypical' proteins since they are universal, appear as typical proteins in all proteomes, are part of core biological functions of the cell and are also highly expressed [1].
Amino acids that were over-represented in the atypical proteins relative to the ribosomal proteins in the 14 species are shown in Table 3. The amino acids are arranged in order of decreasing metabolic cost of biosynthesis in high-energy phosphate units [39]. Since a significant proportion of the atypical proteins are transporters and membrane proteins, the over representation of the hydrophobic amino acids phenylalanine and leucine is characteristic of the transmembrane regions in these proteins. The small hydrophobic and sulphur containing hydrophobic amino acids are not over-represented in the atypical proteins.
Among the aliphatic amino acids, leucine appears to be used in preference to isoleucine indicating a bias towards low cost of biosynthesis. Similarly, phenylalanine is preferred to tryptophan. A balance between selective over-representation of certain amino acids in the 'atypical' proteins and the biochemical cost is evident in bacterial physiology. It is known that the abundantly present ribosomal proteins use higher frequency of less costly amino acids [39]. From this standpoint, it would appear that the 'atypical' proteins might not be present in abundance since they use more costly amino acids.
The over-representation of serine in archaea and in some bacteria (B. subtilis, T. maritima and M. tuberculosis) in the atypical proteins suggests that some of these could be involved in signaling through phosphorylation/ dephosphorylation of serine/ threonine residues. Indeed, membrane bound serine/ threonine kinases have been identified and characterized in B. subtilis and M. tuberculosis [49-51]. Serine / threonine phospahatases and kinases have so far not been described in archaea and are worth exploring.
| Table 3: Amino acids over-represented in the atypical cluster relative to ribosomal proteins of the respective proteomesa. |
| Species | W | F | Y | M | I | L | P | T | Q | N | S | A | G |
| ARCHAEA | |||||||||||||
| P. abyssi | + | + | + | + | |||||||||
| P. horikoshii | + | + | |||||||||||
| T. acidophilum | + | + | + | + | + | + | + | + | |||||
| M. jannaschii | + | + | + | ||||||||||
| BACTERIA | |||||||||||||
| B. subtilis | + | + | + | + | + | + | |||||||
| T. maritima | + | + | + | + | + | ||||||||
| M. tuberculosis cdc | + | + | + | + | + | + | + | + | + | ||||
| M. tuberculosis H37Rv | + | + | + | + | + | + | + | ||||||
| H. influenzae RdKW20 | + | + | + | + | + | + | |||||||
| N. meningitis MC58 | + | + | + | + | |||||||||
| V. cholerae N16961 | + | + | |||||||||||
| C.jejuni | + | + | |||||||||||
| H. pylori | + | + | + | ||||||||||
| Metabolic costb | 74.3 | 52 | 50 | 34.3 | 32.3 | 27.3 | 20.3 | 18.7 | 16.3 | 14.7 | 11.7 | 11.7 | 11.7 |
|
a, amino acids that were statistically over represented in the atypical proteins relative to the ribosomal proteins (typical) were identified according to the rule Mai > Mri + 3SD where Mai is the mean composition of a given amino acid 'i' in the atypical cluster, Mri is the mean composition of the same amino acid 'i' in the ribosomal proteins. SD is the standard deviation. '+' indicates the significantly over represented amino acid in a given proteome. Shaded areas are not statistically significant. b, the cost of synthesizing an amino acid from a precursor in high energy phosphate units (~P) as computed previously [39]. These cost calculations are with reference to the same precursors used for amino acid biosynthesis in E. coli and B. subtilis. |
Physical proximity analysis
The number of neighboring pairs of atypical protein coding genes bears a linear relationship to the total number of proteins in the atypical cluster across various species of archaea and bacteria (Karl Pearson's correlation coefficient r2 = 0.96) (Figure 6). However, no relationship is evident between the number of paralogs and the number of neighboring gene pairs. These observations show that the atypical protein coding genes have a tendency to cluster in the 'genomescape' of different species but are not paralogs suggesting that the neighbouring gene pairs coding for atypical proteins have not arisen from gene duplication. It is known that genes coding for proteins that assemble into a multi-subunit complex or catalyze sequential biochemical transformations in the same metabolic pathway are generally present as neighbours as part of operons under common regulation [52]. It is therefore possible that the neighbouring atypical proteins have either similar expression profiles or close link in functional pathways.
Comparative genomic analysis at the macro-level using dinucleotide abundance values had earlier indicated compositional homogeneity within a species whereas significant differences exist between species [2]. Indeed, this observation has enabled the construction of phylogenetic relationships using whole genome information. In the present report, we describe a macro-level analysis of the proteomes that reveals clear heterogeneity in the compositional structure of proteomes of size greater than 1400 proteins.
The most common structure observed in large proteomes is the presence of typical and atypical proteins. A sizable number of atypical proteins are transporters and membrane proteins. Species vary with respect to the numbers of atypical proteins in the COGs perhaps reflecting the diversity of their lifestyles. The typical and atypical proteins differ in the compositional characteristics with respect to the proportion of the types of amino acids. Our observations on complete proteomes can be used to rapidly identify and delineate the 'typical' and 'atypical' compartments of proteins and the genes coding for these proteins. The typical compartment is most likely important for the cell growth and physiology whereas the atypical compartment caters to the interactions between the cell and its environment.
There are however some common principles of bacterial physiology applicable uniformly to all compartments of proteins of bacterial genomes. Bacteria tend to economize by optimizing the cost of biosynthesis of 'typical' and 'atypical' proteins. Gene neighbours coding for atypical proteins and are not paralogs, may therefore be similarly expressed or linked in functional pathways. They may even have been acquired through horizontal transfer since horizontally transferred genes usually occur as neighbours. In this context, it would be of interest to elucidate the role of several hypothetical atypical proteins.
The biological significance of this work becomes apparent as several proteins of medical importance have distinct amino acid composition [55, 56]. At present there is no general approach to examine proteomes to identify proteins with distinct (or unique) amino acid composition. Earlier the evidence of horizontal gene transfer in E. coli speciation was shown by a multivariate analysis method (factorial correspondence analysis) which implements some PCA methods for comparative analysis [57-59]. Multivariate analysis method was used for classifying 780 genes of E. coli using their codon usage pattern. In this work we have developed an application of multivariate analysis method (PCA) for proteome analysis that has the potential to identify proteins with distinct compositional characteristics from the bulk of proteins within an organism. Our method is based on standard formulae for computing amino acid distances and representation of hydrophobic characteristics with subsequent processing by a clustering technique called "Principal Component Analysis" and is applicable equally well to any organism. This method has the clear advantage over traditional laborious, time consuming and expensive approaches in being able to point out rapidly a subset of the total proteins in an organism. The method uses real protein data and uses standard mathematical formulae to narrow down the search space for investigators to extract potentially useful leads from complete proteome data.
Our procedure can be used to rapidly view complete proteomes for comparative analysis. The methods described in this work are standard and can be readily reproduced. Since the boundary between the atypical cluster and the typical cluster in type II proteomes is somewhat fuzzy, minor differences may arise while reproducing. However, the majority proteins of the two clusters can be consistently and easily identified. Our procedure can complement the existing tools for computational proteomics.
TN is a recipient of a fellowship from the Council of Scientific and Industrial Research. TN thanks Dr. C.B-Rao for help with the SAS software and advice on Principal Component Analysis.