In Silico Biology 4, 0048 (2004); ©2004, Bioinformation Systems e.V.  


Clusters of proteins in archaeal and bacterial proteomes using compositional analysis

Tannistha Nandi1,2,3*, Samir K. Brahmachari1, Krishnamoorthy Kannan2 and Srinivasan Ramachandran1




1 G.N. Ramachandran Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, Mall Road, Delhi 110 007, India.
2 School of Biotechnology, GGS Indraprastha University, Delhi 110 006, India.
3 Department of Biotechnology, JayPee Institute of Information Technology, Noida (UP) 201307, India.



*  Corresponding author
   Phone: +91-120-2400973-06 ext 261; Fax: +91-120-2400986; Email: tannistha_cbt@yahoo.com





Edited by E. Wingender; received August 01, 2004; revised October 25, 2004; accepted October 29, 2004; published November 02, 2004



Abstract

In silico proteomics complements computational genomics in characterizing genome evolution. Here we examine cluster patterns in archaeal and bacterial proteomes using compositional properties of protein sequences in contrast to the traditionally used sequence alignment procedures. Application of standard Principal Component Analysis to the multi-dimensional data identified cluster patterns. Two types of cluster patterns exist in bacterial proteomes. Proteomes of type I have one major cluster with few isolated points in space revealing an underlying largely homogeneous compositional structure. In type II proteomes two clusters of protein distribution were discernible. The two clusters differ in size and were separated from each other although the boundary was somewhat fuzzy. Proteins falling in the major cluster were labeled as 'typical' and proteins of the minor cluster were called 'atypical'. The atypical proteins were mapped to Cluster of Orthologous Groups. Species distribution in COGs maps with respect to atypical proteins illuminated the biological relationships of extreme diversity among the archaeal members and of diversity among bacteria in relation to their niche. Amino acids that were over-represented in the atypical proteins had higher biosynthetic cost compared to 'typical' ribosomal proteins. However, archaea and bacteria economize by preferring the less costly amino acid to others closely related in chemical structure. Further, over-representation of serine in atypical proteins of archaeal members suggests re-examining these proteomes for the presence of Serine/Threonine phosphatases and kinases in Archaea. Our computational procedure can serve as a useful addition to the existing tools for carrying out in silico proteomics.

Key words: proteomics, composition, clustering, PCA



Introduction

While large-scale genome sequencing has provided us with the building blocks or 'bags of genes' of living organisms, higher order comparative analyses are required to obtain insights into physiological and biochemical processes. Recently such analyses of genomes have revealed important genomic patterns of predictive use with respect to gene expression (high and low expression), horizontal gene transfer (genetic exchange), phylogenetic relationships, intra-genomic cross-correlations (prediction of restriction sites for restriction enzymes) and networks of gene links (society of genes) [1-7].

Most biological reactions and processes are either catalyzed or aided by proteins. Computational analysis of proteomes (the set of encoded proteins in an organism) has recently emerged as a complementary activity to the analysis of genomic patterns. Proteomic patterns have been described with respect to gene duplication, domain shuffling, phylogenetic analysis, amino acid clustering, amino acid runs in protein families, structural and functional features [8-14]. In these cases, a large body of data was generated primarily through assessment of sequence similarities, feature identification and standard compositional analysis. Although the usual approach followed for sequence analysis and classification of proteins is based on sequence alignment [15], several approaches have been developed using amino acid composition. While classification of proteins based on sequence alignment constitutes micro-level analysis, the corresponding exercise based on compositional attributes offers analysis at macro-level that holds the promise to explore novel relationships and provide information for physiological and biochemical inferences. Indeed, the software PropSearch was developed for analyzing protein sequences using 144 compositional properties where conventional alignment tools fail to identify significantly similar sequences [16].

Amino acid compositional analysis has been shown to be informative in a number of studies. Composition has significant correlation to the biological characteristics of a protein such as its location (intracellular or extracellular), function (enzymes or non-enzymes), structural features, the presence of disulfide bonds and the folding type [17-20]. Peptide chains could be assigned as either cytoplasmic or extra-cellular, solely from the analysis of sequence composition [21]. Compositional differences between cytoplasmic and secretory proteins have been used to develop software for predicting secreted proteins by training artificial neural networks [22]. Similarly, methods for predicting transmembrane helices in integral membrane proteins use analysis of charge bias and hydrophobicity [23, 24].

Biological characteristics of proteins such as in vivo stability and structural features have also been found to correlate with distinct patterns of amino acid composition. It was observed that the occurrence of certain dipeptides was significantly different in the unstable proteins compared to the stable ones [25]. Low complexity protein sequences with lower proportion of distinct dipeptides have non-globular shapes compared to sequences of high complexity [26]. In a comparative analysis of the proteomes of enteric pathogens, sequence complexity analysis showed that species patterns exist in a limited set of low complexity proteins that parallels taxonomic classification [26, 27]. Species and strain specific differences in low complexity proteins could be correlated with their unique biological properties [28]. These reports demonstrate the importance and usefulness of compositional analysis of proteins.

In general, the methods used in the foregoing studies comprised of computing the measures such as amino acid frequencies and residue pair frequencies, followed by applying scoring schemes and assessing statistical significance. While analyzing proteome data we can search for cluster patterns that are likely to reveal the underlying relationships between the encoded proteins. In this exercise a distance metric for each protein reflects its characteristics with respect to a given reference point. Proteins with similar characteristics are likely to cluster together. The most common set of distance metrics (known as Minkowski metrics) comprises of the Euclidean distance, the Manhattan distance and the "sup" distance. These distance measures are used by clustering tools such as K-means clustering to decipher useful patterns in large volumes of data [29].

Earlier, application of Principal Component Analysis to proteome data of one representative each from the three kingdoms of life, archaea, bacteria and eukarya, revealed cluster patterns in these proteomes [30]. In this work, we present a more comprehensive extended study comprising of 23 bacterial and archaeal proteomes. The results show that majority of the proteins in all proteomes cluster into a large dense group. Detailed inspection of features of protein distribution revealed cluster patterns that allowed us to classify proteomes into 2 basic types: I (and IA) and II. Type I (and IA) proteomes are characterized by a large dense cluster surrounded by a diffuse set of points whereas proteomes of type II have two nearly distinct clusters and are distributed over a wide phylogenetic range encompassing archaea and bacteria. Our computational procedure is likely to generate interest in exploring new relationships among archaea and bacteria.



Materials and methods


Protein attributes

A protein sequence can be characterized by a large number of attributes. However, most attributes are related to one another and are related to the fundamental attributes used in this work. Traditionally, proteins were characterized experimentally by pI, molecular mass, and their solubility.

The following attributes were used to characterize proteins in this study: %charge (at pH 7.2), %hydrophobicity, distinct dipeptide and different types of compositional distances. The attribute %charge relates to the pI of a protein. The %hydrophobic amino acid residues in a protein guide its sub-cellular location (cytoplasmic versus membrane compartments). The Euclidean compositional distance Dconstant provides a measure of compositional bias of a protein from a reference point of uniform amino acid composition. The Euclidean compositional distance Dphobic enables resolution of the proteins in terms of the type of hydrophobic amino acids in contrast to %hydrophobicity. The distinct dipeptide composition relates to the dipeptide character of a protein, which in turn has been found to be correlated with its in vivo stability and shape [23, 26]. These attributes are time-honored methods of revealing the basic characteristics of proteins although they have been used in different forms in the literature. The application of Minkowski metrics provides a sound mathematical framework for computational analysis described in this work.

Subsequently, cluster patterns were inferred using Principal Component Analysis (PCA) by treating these attributes as variates.

Variate 1 is the % of charged amino acids aspartic acid (D), glutamic acid (E), lysine (K) and arginine (R) in a given protein considering the ionization properties of their side chains at pH 7.2.

% of Charge in a given protein is given by

(1)

where L is the number of amino acids in a given protein.

Variate 2 is the % hydrophobicity of a given protein. Although there are more than 30 hydrophobic scales, we have used the following 4 hydrophobic scales Fauchere and Pliska [31], Hopp and Woods [32], Kyte and Doolittle [33] and Rose scale [34]. Both Kyte and Doolittle and Hopp and Woods scales are widely used. Fauchere and Pliska scale was chosen because it is reported to be a direct method of classifying amino acids [35]. Rose scale was found to be appropriate for assessing the degree of residue burial in protein [36]. Each scale was used one at a time.

% Hydrophobicity of a protein is given by

(2)

Variate 3 is the compositional distance of a protein sequence. The distance is measured according to the formula (Minkowski metric r = 2, Euclidean distance):

(3)

Ox is the observed number of xth amino acid in the protein and Ex is the expected number of xth amino acid in the same protein. In this case Ex was taken as L/20 considering all amino acids to be uniformly distributed. Dconstant/L is a normalized measure of distance for the protein.

Variate 4 is the distance (analogous to Manhattan distance) of distinct dipeptides of a protein with respect to the maximum possible for a given length of the protein. The measure 'C' is:

(4)
(5)

Nexp = Expected number of distinct dipeptides; Nobs = Observed number of distinct dipeptides. Nobs is computed in 2 overlapping frames and averaged. When L = 800, Nexp cannot exceed 400 because the total number of theoretically possible distinct dimers is 20 X 20 = 400. C/L is a normalized measure of the distance in distinct dipeptides.

Variate 5 is the hydrophobic distance (Euclidean distance) of a protein given by:

(6)

Ox is the observed number of xth hydrophobic amino acid in the protein and Ex is the expected number of xth hydrophobic amino acid in the same protein. In this case,

(7)

The computation of Ex assumes uniform distribution of the different hydrophobic amino acid types; z is the number of different hydrophobic amino acids identified according to a particular hydrophobic scale. z will vary according to the hydrophobic scale used. In the Kyte and Doolittle and Rose scales z is 13, in Hopp and Woods and Fauchere and Pliska scales z is 11. Dphobic/L is a normalized measure of hydrophobic distance of a protein.


Statistics

Principal Component Analysis using correlation coefficients between the variates was carried out using SAS package (SAS Institute Inc.USA). Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of non-correlated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. This can also be viewed as reduction in the dimensionality of the system.


Sequences

Sequences were retrieved from the National Centre for Biotechnology Information by anonymous ftp transfer from the subdirectory /genbank/genomes/. The sequences of the following 23 genomes were analyzed: Archeal members - Methanococcus janaschii; Pyrococcus abyssi; Pyrococcus horikoshii; Thermoplasma acidophilum; Bacterial members - Bacillus subtilis; Borrelia burgdorferi B31; Campylobacter jejuni NCTC 11168; Escherichia coli K12; Escherichia coli O157; Haemophilus influenzae RdKW20; Helicobacter pylori 26695; Mycoplasma genitalium; Mycoplasma pneumoniae; Mycobacterium tuberculosis cdc1551; Mycobacterium tuberculosis H37Rv; Neisseria meningitis MC58; Pseudomonas aeruginosa PA01; Thermotoga maritima; Treponema pallidum; Synechocystis sp. PCC 6803; Ureaplasma urealyticum; Vibrio cholerae; Xyllela fastidiosa 9a5c.

Physical proximity analysis

Neighboring gene pairs of atypical protein coding genes was identified from the *.gbk files that describe the genes in a set order usually numbered from the origin of replication. Annotation revisions were carefully examined using other complementary information from .gbk files.

Identification of paralogs

Pair-wise alignments of the atypical protein coding genes were carried out using CLUSTALW [37] using default parameters. Protein pairs with scores greater than 62 were examined further for paralog relationships by re-aligning the sequences using BLASTP [38] with low complexity filter 'off' as suggested previously [39]. If the pair had >60% identity, they were considered as paralogs.

Sequence analysis

Sequence analysis was carried out using the Wisconsin Package Version 10.1, Genetics Computer Group (GCG), Madison, Wisconsin [40].

Software

Software programs were written in PERL (Practical Extraction and Reporting Language) and operated on a Silicon Graphics Origin 200 using IRIX 6.5 operating system. Each record in the output data carries the 'gi number' of the protein as its identifier and the fields in each record carries numerical data on the attributes of the protein.



Results and discussion


Two Basic Cluster patterns

A standard approach to examine for patterns in large volumes of data is the application of clustering tools. The numerical data on compositional attributes of protein sequences from complete genomes were subjected to Principal Component Analysis to infer cluster patterns. The SAS (Statistical Analysis Software) software generates volcano plots in which each protein is represented as a single point in space. Examination of these plots from proteomes of 23 organisms revealed that each proteome could be classified into one of two basic types (Figure 1). We selected these 23 organisms on the basis of their biological characteristics summarized in Table 1. The proportion of variance in the data explained by the three principal components P1 (principle component 1), P2 (principle component 2) and P3 (principle component 3) were 65-70%, 85-90% and greater than 90% respectively. The type I cluster pattern consists of a dense area surrounded by a diffuse set of points in almost all directions in space. The type IA cluster is similar to type I except that the dense area is highly intensified. The type II cluster pattern consists of two nearly distinct clusters separated in space from each other indicating the presence of multiple protein families with distinct compositional characteristics. Strain differences within the pair E. coli K12: E. coli O157 and the pair M. tuberculosis H37Rv: M. tuberculosis cdc1551 were not discernible while examining the cluster patterns.



Figure 1: Two basic cluster patterns. The whole proteomes of 23 species were processed as described in methods. Note that the type IA proteome is similar in structure to the type I proteome but is more dense. Type II proteomes have a distinct structure characterized by the presence of two clusters of proteins 'atypical' (minor fraction) and 'typical' (major fraction). The typical proteins are closer to the origin whereas the atypical proteins are removed away from the axis. The magnification and point size are same in all the three examples shown here. Plots were prepared using Kyte and Doolittle hydrophobic scale. The orientation of the cube has been chosen to maximize clarity. Arrow points to the atypical cluster of proteins.


Table 1: Completely sequenced microbial proteomes whose protein sequences were retrieved.
Species Strain Biological importance
Borrelia burgdorferi B31 the aetiologic agent of Lyme disease
Bacillus subtilis   produces an antibiotic, called iturin active against the fungus
Campylobacter jejuni NCTC 11168 cause of bacterial food-borne diarrhoeal disease
Escherichia coli K12 model organism of bacteria
Escherichia coli O157:H7 cause haemorrhagic colitis
Haemophilus influenzae Rd KW20 causes pneumonia and meningitis
Helicobacter pylori 26695 causes peptic ulcer
Mycoplasma genitalium   causes pelvic inflammatory disease
Methanococcus jannaschii   methane-producing thermophile
Mycoplasma pneumoniae   human pathogen, causing 'atypical pneumonia'
Mycobacterium tuberculosis cdc 1551 clinical strain
Mycobacterium tuberculosis H37Rv causes tuberculosis
Neisseria meningitis MC58 causes meningitis and septicemia
Pyrococcus abyssi   hyperthermophilic archaea
Pseudomonas aeruginosa PA01 opportunistic human pathogen causing otitis externa, septic arthritis, endocarditis, conjuctivitis, iridocyclitis, keratitis and iritis, corneal ulcer, panophthalmitis, etc.
Pyrococcus horikoshii   hyper-thermophilic archaebacterium
Synechocystis PCC6803 unicellular photosynthetic cyanobacterium
Thermoplasma acidophilum   thermoacidophilic archaeon, low pH growth (pH 1-2).
Thermotoga maritima   grows at 80° C and metabolizes many simple and complex carbohydrates.
Treponema palidum Nichols the syphilis spirochete
Ureaplasma urealyticum   causes of infection of lower respiratory tract
Vibrio cholerae N16961 causes cholera, an acute diarrheal illness
Xyllela fastidiosa 9a5c causes economically important plant diseases like Pierce's disease of grapevines



Type I and Type IA proteomes

The mollicutes (mycoplasma and ureaoplasma) and spirochaetes (Borrelia and Treponema) proteomes display the type I cluster pattern. Independent observations on overall protein distribution made using self organizing maps (SOM) on the basis of amino acid composition had revealed that B. burgdorferi, M. genitalium and M. pneumoniae show close relationship [22]. However, in our method, the proteome of H. influenzae Rd exhibited characteristics of type II cluster pattern and therefore was not grouped with other mesophiles (E. coli, or Synechocystis sp.). Species with larger proteome sizes (greater than 1400 proteins) fall into either type IA or type II patterns (Figure 1). The similarity in cluster patterns of the mollicutes, the spirochaetes, the proteobacteria E. coli, X. fastidiosa, and the cyanobacterium Synechosytis, show high compositional homogeneity among the majority of proteins in their proteomes. The appearance of a few proteins separated from the dense cluster shows that their compositional properties are distinct.

Although observation of such a common feature at macro-level across phylogenetically distant taxa need not imply direct similarity in their biological or metabolic characteristics, some points of similarities between them at this level is worth noting.

Among the mollicutes, the genomes of mycoplasma and ureaplasma share common features such as small genome size, high A+T content of the genomic DNA (in the range of 59% to 74%) and conservation of gene order with functional roles in core biological processes (ribosomal proteins & oligopeptide transporters) [41, 42].

The two spirochaetes B. burgdorferi and T. pallidum share similarities in biological characteristics such as growth, similar genome size, clinical features and core biological functions such as genome replication and expression [43, 44]. The %A+T richness of the genomic DNA however varies widely between the two; Borrelia is 72% in A+T whereas Treponema is 47% in A+T.

Although the mollicutes are distantly related to the spirochaetes, gross similarities exist between them in the metabolic pathways [45, 46]. Of the genes on the linear chromosome of B. burgdorferi, 66% are transcribed away from the center of the chromosome and this transcriptional bias appears similar to that in M. genitalium and M. pneumoniae. Similarity between B. burgdorferi, T. pallidum and M. genitalium is apparent in the complement of the genes involved in DNA replication. In DNA repair mechanisms, B. burgdorferi and M. genitalium are similar.

Similarities between the mollicutes and spirochaetes have also been reported on transport capacity, the limited metabolic capacity and the lack of respiratory electron transport chain. T. pallidum is unable to synthesize enzyme co-factors, fatty acids and nucleotides de novo similar to B. burgdorferi and M. genitalium [46]. Some of the transport systems of T. pallidum are of similar specificity to those found in B. burgdorferi and M. genitalium. The absence of TCA cycle and oxidative phosphorylation, generation of reducing power through the oxidative branch of the pentose phosphate pathway of T. pallidum are similar to that in B. burgdorferi and M. genitalium. As in the case of mollicutes and spirochaetes, the gross level similarities in biological functions are mirrored in type IA proteomes. For instance, the transcriptional and translational machinery of X. fastidiosa is similar to that of E. coli [46]. These observations illuminate the underlying gross similarities in functional characteristics between species displaying similarity in their proteome patterns.

Type II proteomes: General features

Type II proteomes are widely distributed in the prokaryotes including archaea, proteobacteria, Bacillus sp. and actinomycetes (Figure 1). These proteomes display two nearly distinct clusters of protein distribution. To assess whether the cluster patterns of type II proteomes were sensitive to the hydrophobic scale used, we re-examined them by varying the hydrophobic scale. Sample results for H. influenzae Rd are shown in Figure 2. It is apparent that the cluster pattern is invariant in all 4 hydrophobic scales examined. Similar observations were made with other proteomes (data not shown). Further analysis was carried out using Kyte and Doolittle scale, as it is the widely used scale of hydrophobicity. We refer to the major cluster of large number of proteins as 'typical' and the minor cluster as 'atypical'.


Figure 2: Cluster patterns are invariant in all 4 hydrophobic scales examined. The whole proteome of H. influenzae Rd [53] was processed as described in methods. The projection of volcanic plot is shown in 4 scales labeled below the corresponding plot. The angular orientation of the cube has been varied slightly from one scale to another to maximize clarity for visual inspection. Arrows point to the atypical cluster of proteins.


A summary of the number of proteins in the atypical cluster and their overall functional distribution in different proteomes is shown in Table 2. Although larger proteomes tend to have higher number atypical proteins, this relationship does not hold in a few cases. M. tuberculosis strains are particularly unusual in having a low number of atypical proteins (1.3%-1.7% of the proteome). P. horikoshii has a higher number of atypical proteins (15.5% of the proteome) for its proteome size.

It is evident that a sizable fraction of the atypical proteins belongs to the transport and membrane associated (TM) functional class and proteins of unknown function (hypothetical proteins) in all the 14 proteomes examined except in mycobacteria. The representation from the cellular processes CP and the characteristic CH classes are low. The uniformity in overall functional composition of the atypical proteins among representatives from widely separated phylogenetic taxa with widely varying genomic base composition indicates that the transport and membrane proteins have distinct compositional characteristics compared with the rest of the proteins. These observations re-confirm and extend the inferences drawn from previous studies using smaller data sets [17-21].

Table 2: Size of 'atypical' cluster and its overall functional composition in different speciesa.
Species CPb TMb CHb H Number of
atypical proteins
Proteome
sizec
B. subtilis 10.7 30.9 9.9 48.5 625 4111
C. jejuni 18.9 67.4 10.9 2.9 175 1633
H. influenzae RdKW20 13.4 44.1 7.5 34.9 186 1713
H. pylori 13.8 52.6 11.2 22.4 116 1575
M. jannaschii 12.4 17.4 4.5 65.7 178 1728
M. tuberculosis H37Rv 0.0 0.0 100.0 0.0 73 4186
M. tuberculosis cdc1551 0.0 0.0 100.0 0.0 52 3926
N. meningitits MC58 17.0 26.1 10.2 46.6 176 3441
P. abyssi 8.5 18.6 7.4 65.4 188 1768
P. horikoshii 3.2 10.8 4.3 81.7 279 1800
P. aeruginosa PA01 9.0 44.6 4.6 41.8 612 5566
T. acidophilum 6.1 76.1 7.4 10.4 163 1481
T. maritima 9.5 34.3 7.0 49.3 201 1857
V. cholerae N16961 15.6 34.6 9.8 40.0 315 2741
a, The percentage fraction of atypical proteins in each functional class is shown. The atypical proteins could be clearly identified in 14 out of 23 proteomes analyzed in this work.
b, Functional class codes: CP denotes CELLULAR PROCESSES, TM denotes TRANSPORT AND MEMBRANE ASSOCIATED, CH denotes CHARACTERISTIC, H denotes 'HYPOTHETICAL' according to a modified classification scheme reported previously [27]. In this method, CELLULAR PROCESSES superclass has proteins of the INFORMATION class (comprising replication, transcription and translation) and of metabolism class. Proteins of Transport and Membrane associated classes were combined into TRANSPORT AND MEMBRANE ASSOCIATED superclass. After classifying the atypical proteins into CP and TM superclasses, the remaining consisted of several proteins that are characteristic of the unique properties of a given species. These proteins were collectively placed in CHARACTERISTIC superclass. This scheme aids in rapid comparative analysis especially when the numbers of proteins in the individual functional classes defined by Riley [54] are small. The annotation provided with the *.faa file from the NCBI was used for classification.
c, Number of encoded proteins in the respective genomes



Functional composition of 'atypical' proteins: The COGs maps

The cluster of orthologous groups (COGs) database offers a systematic route to examine the functional characteristics of proteins across a wide phylogenetic spectrum [46]. Examination of the distribution of atypical cluster proteins in the 20 COGs classes (excluding function unknown, general function prediction, not in COGs) revealed 13 functional classes with sizable representation (see legend to Figure 4 for functional class codes). In the rest of the classes the representation was either absent or less than 1% in most species.

For each of the 14 species, the fraction of proteins of a given COGs class appearing as atypical was plotted against the fraction of atypical proteins appearing in the same COGs for 13 COGs. Sample plots for energy production and conversion genes (C) and Carbohydrate transport and metabolism genes (G) are shown in Figure 3. The former measure estimates the proportion of proteins in COGs with atypical amino acid composition whereas the latter measure estimates the tendency of proteins with atypical amino acid composition to appear in the same COGs.



Figure 3: Sample COGs diagram of atypical proteins. X-axis represents the fraction of atypical proteins appearing in a given COGs, Y-axis represents the fraction of the COGs proteins appearing as atypical. Kyte and Doolittle scale was used. Each point corresponds to a given species. The radial distance of each species from the origin is the Euclidean distance, where x, y are the coordinates for each species. Species abbreviations are as follows: Methanococcus janaschii (MJAN), Pyrococcus abyssi (PA), Pyrococcus horikoshii (PH), Thermoplasma acidophilum (TACID), Bacillus subtilis (BSUB), Campylobacter jejuni NCTC 11168 (CJEJ), Haemophilus influenzae Rd (HI), Helicobacter pylori 26695 (HPYL), Mycobacterium tuberculosis cdc1551 (MTUBcdc), Mycobacterium tuberculosis H37Rv (MTUBRv), Neisseria meningitis MC58 (NM), Pseudomonas aeruginosa PA01(PAER), Thermotoga maritima (TMAR), Vibrio cholerae chromosome I (VCI), Vibrio cholerae chromosome-II (VCII).


Species distribution patterns of the atypical proteins in the COGs indicate that the members of archaea exhibit greater diversity whereas the members of the bacteria show more cohesion. The two chromosomes (I and II) of Vibrio cholerae have very different distribution patterns supporting the notion that Chromosome I genes mainly adapt the organism for growth in the intestine whereas Chromosome II genes are essential within environmental niches [48]. None of the atypical proteins of mycobacteria appear in any of the COGs and were therefore excluded and treated as a special case. For quantitative comparisons we examined the proportions of atypical proteins in each class of COGs for all species by computing their Euclidean radial distance from the origin in all 13 classes of COGs maps. Subsequently, comparison of species distributions between the different classes of COGs was carried out by estimating the variance of the distributions. High variance among archaea and bacteria was observed in G (Carbohydrate transport and metabolism), P (Inorganic ion transport and metabolism), and V (Defense mechanisms) (Figure 4a and b). Also N (Cell motility) COGs showed high variance amongst archaea.


Figure 4: Species Distribution of Archaeal (a) and Bacterial (b) members with respect to atypical proteins (measured by Euclidean radial distance from the origin) in 13 classes of COGs. Species abbreviations same as in Figure 3. Note the difference in variance between the species distributions in the COGs. The COGs classes are C (Energy production and conversion), D (Cell division and chromosome partitioning), E (Amino acid transport and metabolism), F (Nucleotide transport and metabolism), G (Carbohydrate transport and metabolism), H (Coenzyme metabolism), I (Lipid metabolism), M (Cell wall/membrane biogenesis), N (cell motility), O (Posttranslational modification, protein turnover, chaperones), P (Inorganic ion transport and metabolism), U (Intracellular trafficking and secretion), V (Defense mechanism).


Since TM proteins constitute a sizable representation of the atypical proteins, differences in the proportion of atypical proteins among various species in the different COGs is in part due to the varying proportion of atypical TM proteins. TM proteins are at the interface between the cell and its natural niche, and the diversity exhibited by the various species perhaps relates to their life styles [48]. The high variance in species distributions with respect to atypical proteins observed in the carbohydrate and inorganic ion functional group may relate to the diverse substrate utilization capabilities of Bacteria and Archaea. High variance in species distributions in the defence mechanisms is indicative of the diverse niches occupied by bacteria and archaeal members. Taken together these observations mirror the biodiversity of these microbes.


The special case of mycobacteria

The atypical proteins of the mycobacteria all comprise of the PE_PGRS and PPE family of proteins. These proteins are rich in glycine and are composed of reiterated sequences [11]. In addition, the PE_PGRS proteins of mycobacteria are unique and do not have homologs in other species. These results contrast those observed in other bacteria and indicate an underlying compositional homogeneity among the rest of the proteins in M. tuberculosis. Alternatively, it is possible that the PE_PGRS proteins with their very unusual compositional characteristics obscure the general feature of differences in the composition of transporters and membrane proteins and other hypothetical proteins from the typical proteins in the software. To test this possibility, we removed the PE_PGRS and PPE family of proteins from the data file and re-examined the protein distribution through PCA. The results are shown in Figure 5. It is apparent that the two cluster pattern vanishes; instead, the in silico modified M. tuberculosis proteome has a homogeneous compositional structure analogous to type IA proteomes. This trend appears unique to mycobacteria. Thus the special features of mycobacterial proteomes arise from their inherent compositional homogeneity in majority of their proteins.


Figure 5: The special case of mycobacteria. (a) Cluster patterns in M. tuberculosis H37Rv proteome. All the proteins of the minor atypical cluster belong to PE_PGRS and PPE families. (b) Cluster pattern in M. tuberculosis H37Rv proteome after removing the PE_PGRS and PPE families. Note the disappearance of the second atypical cluster (shown by arrow) indicating homogeneity in compositional structure in the rest of the proteome.


The over-represented amino acids in 'atypical proteins' and their cost of biosynthesis

Assessment of over-representation of amino acids in atypical proteins relative to typical proteins is a complicated task because the typical cluster is large and is likely to embody enormous variability. Therefore we need a good reference point within the typical cluster. The ribosomal proteins of each species serve as a good reference point for comparative analysis of the 'atypical' proteins since they are universal, appear as typical proteins in all proteomes, are part of core biological functions of the cell and are also highly expressed [1].

Amino acids that were over-represented in the atypical proteins relative to the ribosomal proteins in the 14 species are shown in Table 3. The amino acids are arranged in order of decreasing metabolic cost of biosynthesis in high-energy phosphate units [39]. Since a significant proportion of the atypical proteins are transporters and membrane proteins, the over representation of the hydrophobic amino acids phenylalanine and leucine is characteristic of the transmembrane regions in these proteins. The small hydrophobic and sulphur containing hydrophobic amino acids are not over-represented in the atypical proteins.

Among the aliphatic amino acids, leucine appears to be used in preference to isoleucine indicating a bias towards low cost of biosynthesis. Similarly, phenylalanine is preferred to tryptophan. A balance between selective over-representation of certain amino acids in the 'atypical' proteins and the biochemical cost is evident in bacterial physiology. It is known that the abundantly present ribosomal proteins use higher frequency of less costly amino acids [39]. From this standpoint, it would appear that the 'atypical' proteins might not be present in abundance since they use more costly amino acids.

The over-representation of serine in archaea and in some bacteria (B. subtilis, T. maritima and M. tuberculosis) in the atypical proteins suggests that some of these could be involved in signaling through phosphorylation/ dephosphorylation of serine/ threonine residues. Indeed, membrane bound serine/ threonine kinases have been identified and characterized in B. subtilis and M. tuberculosis [49-51]. Serine / threonine phospahatases and kinases have so far not been described in archaea and are worth exploring.

Table 3: Amino acids over-represented in the atypical cluster relative to ribosomal proteins of the respective proteomesa.
Species W F Y M I L P T Q N S A G
ARCHAEA
P. abyssi   +     + +         +    
P. horikoshii   +                 +    
T. acidophilum + + + + + +         + +  
M. jannaschii   +     +           +    
BACTERIA
B. subtilis + +   + + +         +    
T. maritima   +   + + +         +    
M. tuberculosis cdc   +       + + + + + + + +
M. tuberculosis H37Rv   +       +   +   + + + +
H. influenzae RdKW20 + + + + + +              
N. meningitis MC58 + +   +   +              
V. cholerae N16961   +       +              
C.jejuni   +       +              
H. pylori + +       +              
Metabolic costb 74.3 52 50 34.3 32.3 27.3 20.3 18.7 16.3 14.7 11.7 11.7 11.7
a, amino acids that were statistically over represented in the atypical proteins relative to the ribosomal proteins (typical) were identified according to the rule Mai > Mri + 3SD where Mai is the mean composition of a given amino acid 'i' in the atypical cluster, Mri is the mean composition of the same amino acid 'i' in the ribosomal proteins. SD is the standard deviation. '+' indicates the significantly over represented amino acid in a given proteome. Shaded areas are not statistically significant.
b, the cost of synthesizing an amino acid from a precursor in high energy phosphate units (~P) as computed previously [39]. These cost calculations are with reference to the same precursors used for amino acid biosynthesis in E. coli and B. subtilis.


Physical proximity analysis

The number of neighboring pairs of atypical protein coding genes bears a linear relationship to the total number of proteins in the atypical cluster across various species of archaea and bacteria (Karl Pearson's correlation coefficient r2 = 0.96) (Figure 6). However, no relationship is evident between the number of paralogs and the number of neighboring gene pairs. These observations show that the atypical protein coding genes have a tendency to cluster in the 'genomescape' of different species but are not paralogs suggesting that the neighbouring gene pairs coding for atypical proteins have not arisen from gene duplication. It is known that genes coding for proteins that assemble into a multi-subunit complex or catalyze sequential biochemical transformations in the same metabolic pathway are generally present as neighbours as part of operons under common regulation [52]. It is therefore possible that the neighbouring atypical proteins have either similar expression profiles or close link in functional pathways.


Figure 6a: Physical proximity of atypical protein coding genes. X-axis represents the total number of atypical proteins in each species. Y-axis represents the number of neighboring gene pairs coding for atypical proteins. Neighbour relationship was inferred from *.gbk files. The *.gbk files provide the information on the genes in a set order usually starting from the origin of replication. Each point represents a species. The Karl Pearsons correlation coefficient is also shown.
Figure 6b: Lack of correlation between physical proximity of neighbouring atypical gene pairs and paralogous relationship between the atypical proteins. X-axis represents the number of neighbouring atypical protein coding gene pairs, Y-axis: number of paralogs of atypical proteins. Each point represents a species. The total number of paralogs was identified as described in the methods.




Conclusions

Comparative genomic analysis at the macro-level using dinucleotide abundance values had earlier indicated compositional homogeneity within a species whereas significant differences exist between species [2]. Indeed, this observation has enabled the construction of phylogenetic relationships using whole genome information. In the present report, we describe a macro-level analysis of the proteomes that reveals clear heterogeneity in the compositional structure of proteomes of size greater than 1400 proteins.

The most common structure observed in large proteomes is the presence of typical and atypical proteins. A sizable number of atypical proteins are transporters and membrane proteins. Species vary with respect to the numbers of atypical proteins in the COGs perhaps reflecting the diversity of their lifestyles. The typical and atypical proteins differ in the compositional characteristics with respect to the proportion of the types of amino acids. Our observations on complete proteomes can be used to rapidly identify and delineate the 'typical' and 'atypical' compartments of proteins and the genes coding for these proteins. The typical compartment is most likely important for the cell growth and physiology whereas the atypical compartment caters to the interactions between the cell and its environment.

There are however some common principles of bacterial physiology applicable uniformly to all compartments of proteins of bacterial genomes. Bacteria tend to economize by optimizing the cost of biosynthesis of 'typical' and 'atypical' proteins. Gene neighbours coding for atypical proteins and are not paralogs, may therefore be similarly expressed or linked in functional pathways. They may even have been acquired through horizontal transfer since horizontally transferred genes usually occur as neighbours. In this context, it would be of interest to elucidate the role of several hypothetical atypical proteins.

The biological significance of this work becomes apparent as several proteins of medical importance have distinct amino acid composition [55, 56]. At present there is no general approach to examine proteomes to identify proteins with distinct (or unique) amino acid composition. Earlier the evidence of horizontal gene transfer in E. coli speciation was shown by a multivariate analysis method (factorial correspondence analysis) which implements some PCA methods for comparative analysis [57-59]. Multivariate analysis method was used for classifying 780 genes of E. coli using their codon usage pattern. In this work we have developed an application of multivariate analysis method (PCA) for proteome analysis that has the potential to identify proteins with distinct compositional characteristics from the bulk of proteins within an organism. Our method is based on standard formulae for computing amino acid distances and representation of hydrophobic characteristics with subsequent processing by a clustering technique called "Principal Component Analysis" and is applicable equally well to any organism. This method has the clear advantage over traditional laborious, time consuming and expensive approaches in being able to point out rapidly a subset of the total proteins in an organism. The method uses real protein data and uses standard mathematical formulae to narrow down the search space for investigators to extract potentially useful leads from complete proteome data.

Our procedure can be used to rapidly view complete proteomes for comparative analysis. The methods described in this work are standard and can be readily reproduced. Since the boundary between the atypical cluster and the typical cluster in type II proteomes is somewhat fuzzy, minor differences may arise while reproducing. However, the majority proteins of the two clusters can be consistently and easily identified. Our procedure can complement the existing tools for computational proteomics.



Acknowledgements

TN is a recipient of a fellowship from the Council of Scientific and Industrial Research. TN thanks Dr. C.B-Rao for help with the SAS software and advice on Principal Component Analysis.




References


  1. Karlin, S. and Mrazek, J. (2000). Predicted highly expressed genes of diverse prokaryotic genomes. J. Bacteriol. 182, 5238-5250.

  2. Karlin, S., Campbell, M. A. and Mrazek, J. (1998). Comparative DNA analysis across diverse genomes. Annu. Rev. Genet. 32, 185-225.

  3. Lawrence, J. G. and Ochman, H. (1998). Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95, 9413-9417.

  4. Koski, L. B., Morton, R. A. and Golding, G. B. (2001). Codon bias and base composition are poor indicators of horizontally transferred genes. Mol. Biol. Evol. 18, 404-412.

  5. Yuan, Y. P., Eulenstein, O., Vingron, M. and Bork, P. (1998). Towards detection of orthologues in sequence databases. Bioinformatics 14, 285-289.

  6. Yanai, I. and DeLisi, C. (2002). The society of genes: networks of functional links between genes from comparative genomics. Genome Biol. 3, research0064.1-0064.12.

  7. Gelfand, M. S. and Koonin, E. V. (1997). Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucleic Acid Res. 25, 2430-2439.

  8. Apweiler, R., Biswas, M., Fleischmann, W., Kanapin, A., Karavidopoulou, Y., Keresey, P., Kriventseva, E. V., Mittard, V., Mulder, N., Phan, I. and Zdobnov, E. (2001). Proteome analysis database: online application of InterPro and CluSTr for functional classification of proteins in whole genomes. Nucleic Acids Res. 29, 44-48.

  9. Ling, L., Wang, J., Cui, Y., Li, W. and Chen, R. (2002). Proteome-wide analysis of protein function composition reveals the clustering and phylogenetic properties of organisms. Mol. Phylogenet. Evol. 25, 101-111.

  10. Rosato, V., Pucello, N. and Giuliano, G. (2002). Evidence for cysteine clustering in thermophilic proteomes. Trends Genet. 186, 278-281.

  11. Tekaia, F., Gordon, S. V., Garnier, T., Brosch, R., Barrell, B. G. and Cole S. T. (1999). Analysis of the proteome of Mycobacterium tuberculosis in silico. Tuber. Lung Dis. 79, 329-342.

  12. Scharfe, C., Zaccaria, P., Hoertnagel, K., Jaksch, M., Klopstock, T., Dembowski, M., Lill, R., Prokisch, H., Gerbitz, K. D., Neupert, W., Mewes, H. W. and Meitinger T. (2000). MITOP, the mitochondrial proteome database: 2000 update. Nucleic Acids Res. 28, 155-158.

  13. Carter, P., Liu, J. and Rost, B. (2003). PEP: Predictions for Entire Proteomes. Nucleic Acids Res. 31, 410-413.

  14. Karlin, S., Brocchieri, L., Bergman, A., Mrazek, J. and Gentles, A. J. (2002). Amino acid runs in eukaryotic proteomes and disease associations. Proc. Natl. Acad. Sci. USA 99, 333-338.

  15. Miller, C. J. and Attwood, T. K. (2003). Bioinformatics goes back to the future. Nat. Rev. Mol. Cell. Biol. 4, 157-162.

  16. Hobohm, U. and Sander, C. (1995). A sequence property approach to searching protein databases. J. Mol. Biol. 251, 390-399.

  17. Nishikawa, K., Kubota, Y. and Ooi, T. (1983). Classification of proteins into groups based on amino acid composition and other characters. II. Grouping into four types. J. Biochem. (Tokyo) 94, 997-1007.

  18. Nishikawa, K., Kubota, Y. and Ooi, T. (1983). Classification of proteins into groups based on amino acid composition and other characters. I. Angular distribution. J. Biochem. (Tokyo) 94, 981-995.

  19. Nakashima, H., Nishikawa, K. and Ooi, T. (1986). The folding type of a protein is relevant to the amino acid composition. J. Biochem. (Tokyo) 99, 153-162.

  20. Nakashima, H. and Nishikawa, K. (1992). The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins. FEBS Lett. 303, 141-146.

  21. Nakashima, H. and Nishikawa, K. (1994). Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 238, 54-61.

  22. Schneider, G. (1999). How many potentially secreted proteins are contained in a bacterial genome? Gene 237, 113-121.

  23. Guruprasad, K., Reddy, B. V. and Pandit M. W. (1990). Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. 4, 155-161.

  24. von Heijne, G. (1992). Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J. Mol.Biol. 225, 487-494.

  25. Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E. L. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567-580.

  26. Nandi, T., Dash, D., Ghai, R., B-Rao, C., Kannan, K., Brahmachari, S. K., Ramakrishnan, C. and Ramachandran, S. (2003). A novel complexity measure for comparative analysis of protein sequences from complete genomes. J. Biomol. Struct. Dyn. 20, 657-668.

  27. Nandi, T., Kannan, K. and Ramachandran, S. (2003). The low complexity proteins from enteric pathogenic bacteria: Taxonomic parallels embedded in diversity. In Silico Biol. 3, 0024.

  28. Nandi, T., Kannan, K. and Ramachandran, S. (2003). Species and strain-specific patterns of low-complexity proteins in Escherichia and Mycobacteria. Current Science 85, 185-187.

  29. Brazma, A. and Vilo, J. (2000). Gene expression data analysis. FEBS Lett. 480, 17-24.

  30. Nandi, T., B-Rao, C. and Ramachandran, S. (2002). Comparative genomics using data mining tools. J. Biosci. 27, 15-25.

  31. Fauchere, J. L. and Pliska, V. (1983). Hydrophobic parameters p of amino-acid side chains from the partitioning of N-acetyl-amino-acid amides. Eur. J. Med. Chem. - Chim. Ther. 18, 369-375.

  32. Hopp, T. P. and Woods, K. R. (1981). Prediction of protein antigenic determinants from amino acid sequences. Proc. Natl. Acad. Sci. USA 78, 3824-3828.

  33. Kyte, J. and Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105-132.

  34. Rose, G. D., Geselowitz, A. R., Lesser, G. J., Lee, R. H. and Zehfus, M. H. (1985). Hydrophobicity of amino acid residues in globular proteins. Science 229, 834-838.

  35. Roland, L. and Eisenberg, D. (1992). Protein. In: Sequence Analysis Primer, Gribskov, M. and Devereux, J. (eds.), Oxford University Press, Oxford, pp. 61-87.

  36. Varadarajan, R., Nagarajaram, H. A. and Ramakrishnan, C. (1996). A procedure for the prediction of temperature-sensitive mutants of a globular protein based solely on the amino acid sequence. Proc. Natl. Acad. Sci. USA 93, 13908-13913.

  37. Higgins, D., Thompson, J., Gibson, T., Thompson, J. D., Higgins, D. G. and Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680.

  38. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-410.

  39. Wisconsin Package Version 10.0, Genetics Computer Group GCG., Madison, Wisconsin, USA.

  40. Fraser, C. M., Gocayne, J. D., White, O., Adams, M. D., Clayton, R. A., Fleischmann, R. D., Bult, C. J., Kerlavage, A. R., Sutton, G., Kelley, J. M., Fritchman, J. L., Weidman, J. F., Small, K. V., Sandusky, M., Fuhrmann, J., Nguyen, D., Utterback, T. R., Saudek, D. M., Phillips, C. A., Merrick, J. M., Tomb, J.-F., Dougherty, B. A., Bott, K. F., Hu, P.-C. and Lucier, T. S. (1995). The minimal gene complement of Mycoplasma genitalium. Science 270, 397-403.

  41. Glass, J. I., Lefkowitz, E. J., Glass, J. S., Heiner, C. R., Chen, E. Y. and Cassell, G. H. (2000). The complete sequence of the mucosal pathogen Ureaplasma urealyticum. Nature 407, 757-762.

  42. Fraser, C. M., Casjens, S., Huang, W. M., Sutton, G. G., Clayton, R., Lathigra, R., White, O., Ketchum, K. A., Dodson, R., Hickey, E. K., Gwinn, M., Dougherty, B., Tomb, J. F., Fleischmann, R. D., Richardson, D., Peterson, J., Kerlavage, A. R., Quackenbush, J., Salzberg, S., Hanson, M., van Vugt, R., Palmer, N., Adams, M. D., Gocayne, J., Weidman, J., Utterback, T., Watthey, L., Mcdonald, L., Artiach, P., Bowman, C., Garland, S., Fujii, C., Cotton, M. D., Horst, K., Roberts, K., Hatch, B., Smith, H. O. and Venter, J. C. (1997). Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 390, 580-586.

  43. Subramanian, G., Koonin, E. V. and Aravind, L. (2000). Comparative genome analysis of the pathogenic spirochetes Borrelia burgdorferi and Treponema pallidum. Infect. Immun. 68, 1633-1648.

  44. Fraser, C. M., Norris, S. J., Weinstock, G. M., White, O., Sutton, G. G., Dodson, R., Gwinn, M., Hickey, E. K., Clayton, R., Ketchum, K. A., Sodergren, E., Hardham, J. M., McLeod, M. P., Salzberg, S., Peterson, J., Khalak, H., Richardson, D., Howell, J. K., Chidambaram, M., Utterback, T., McDonald, L., Artiach, P., Bowman, C., Cotton, M. D., Fujii, C., Garland, S., Hatch, B., Horst, K., Roberts, K., Sandusky, M., Weidman, J., Smith, H. O. and Venter, J. C. (1998). Complete genome sequence of Treponema pallidum, the syphilis spirochete. Science 281, 375-388.

  45. Simpson, A. J. et al. (2000). The genome sequence of the plant pathogen Xylella fastidiosa. The Xylella fastidiosa Consortium of the Organization for Nucleotide Sequencing and Analysis. Nature 406, 151-157.

  46. Tatusov, R. L., Natale, D. A., Garkavtsev, I. V., Tatusova, T. A., Shankavaram, U. T., Rao, B. S., Kiryutin, B., Galperin, M. Y., Fedorova, N. D. and Koonin, E. V. (2001). The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29, 22-28.

  47. Schoolnik, G. K. and Yildiz, F. H. (2000). The complete genome sequence of Vibrio cholerae: a tale of two chromosomes and of two lifestyles. Genome Biol. 1, reviews1016.1-1016.3.

  48. Fraser, C. M., Eisen, J., Fleischmann, R. D., Ketchum, K. A. and Peterson, S. (2000). Comparative genomics and understanding of microbial biology. Emerg. Infect. Dis. 6, 505-512.

  49. Akashi, H. and Gojobori, T. (2002). Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis. Proc. Natl. Acad. Sci. USA 99, 3695-3700.

  50. Madec, E., Laszkiewicz, A., Iwanicki, A., Obuchowski, M. and Seror, S. (2002). Characterization of a membrane-linked Ser/Thr protein kinase in Bacillus subtilis, implicated in developmental processes. Mol. Microbiol. 46, 571-586.

  51. Koul, A., Choidas, A., Tyagi, A. K., Drlica, K., Singh, Y. and Ullrich, A. (2001). Serine/threonine protein kinases PknF and PknG of Mycobacterium tuberculosis: characterization and localization. Microbiol. 147, 2307-2314.

  52. Young, T. A., Delagoutte, B., Endrizzi, J. A., Falick, A. M. and Alber, T. (2003). Structure of Mycobacterium tuberculosis PknB supports a universal activation mechanism for Ser/Thr protein kinases. Nat. Struct. Biol. 10, 168-174.

  53. von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P. and Snel, B. (2003). STRING: a database of predicted functional associations between proteins. Nucleic Acids Res. 31, 258-261.

  54. Riley, M. (1993). Functions of the gene products of Escherichia coli. Microbiol. Rev. 57, 862-952.

  55. Ramachandran, S., Thompson, R. W., Gam, A. A. and Neva, F. A. (1998). Recombinant cDNA clones for immunodiagnosis of strongyloidiasis. J. Infect. Dis. 177, 196-203.

  56. De Arruda, M. E., Collins, K. M., Hochberg, L. P., Ryan, P. R., Wirtz, R. A. and Ryan, J. R. (2004). Quantitative determination of sporozoites and circumsporozoite antigen in mosquitoes infected with P. falciparum & P. vivax. Ann. Trop. Med. Parasitol. 98, 121-127.

  57. Medigue, C., Rouxel, T., Vigier, P., Henaut, A. and Danchin, A. (1991). Evidence for horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222, 851-856.

  58. Greenacre, M. J. (1984). Theory and Applications of Correspondence Analysis. Academic Press, London.

  59. McInerney, J. O. (1998). GCUA: General Codon Usage Analysis. Bioinformatics 14, 372-373.