| In Silico Biology 3, 0024 (2003); ©2003, Bioinformation Systems e.V. |
G.N. Ramachandran Knowledge Centre for Genome Informatics
Institute of Genomics and Integrative Biology, Mall Road, Delhi 110 007 India
Tel: +91-11-2766-6156, x169;
+91-11-2766-6157, x169
Fax: +91-11-2766-7471
1School of Biotechnology, Guru Govind Singh Indraparastha University, Kashmere Gate, Delhi 110 006
Email: ramu@cbt.res.in; ramucbt@yahoo.com
* corresponding author
Edited by E. Wingender; received December 12, 2002; revised and accepted March 01, 2003; published March 23, 2003
The number and functions of the low complexity (LC) proteins from four enteric bacterial pathogens Escherichia coli O157, Vibrio cholerae, Helicobacter pylori and Campylobacter jejuni were compared. For this purpose the LC proteins were grouped into 3 categories for pairwise comparisons. These were COMMON, VARIANT and LC proteins with No Homologues (LCNH). Homologous LC proteins in both species in a given pairwise comparison were grouped as COMMON. LC Proteins of same function but not of low complexity in either of the species in a given pair were grouped as VARIANT. LC proteins without any homologues in either species were grouped as LCNH. Conservation patterns were inferred by comparing them under 3 functional classes CELLULAR PROCESSES (CP), TRANSPORT & MEMBRANE ASSOCIATED (TM) and CHARACTERISTIC (CH). In the COMMON category, highest similarity was found between E. coli O157 and V. cholerae on the one hand and H. pylori and C. jejuni on the other under the functional class CP. This parallels taxonomic classification in that E. coli and V. cholerae are classified under gamma subdivision of proteobacteria whereas H. pylori and C. jejuni are classified under the epsilon subdivision. The data from LCNH group, although more diffuse, was complementary the to pattern drawn from COMMON category in that the numbers of LCNH in the pair {E. coli O157, V. cholerae} and in {H. pylori, C. jejuni} were lowest. No consistent patterns were observed in the VARIANT category. These observations indicate that although low complexity segments are thought to undergo variations, species patterns do exist in a limited set of low complexity proteins that parallels taxonomic classification.
Key words: sequence complexity, low complexity proteins, microbial pathogen, enteric bacteria, colonization factor, cag pathogenicity island, PGRS proteins, functional classification, taxanomic relationships, pairwise comparisons, comparative genomics, sequence analysis, functional superclass, bacterial genomes, cellular processes, transport proteins, membrane associated proteins, characteristic proteins, conservation pattern, genomics
The vast majority of proteins identified in different bacteria through genome sequencing are of high sequence complexity. A minor fraction of the proteins (about 3-7% of the total) have high proportion of low complexity sequences depending on the species [1]. We call these proteins as low complexity (LC) proteins. These observations indicate that LC proteins are generally selected against in bacterial evolution. Earlier work by Wootton [2] indicated that they have either structural or transmembrane characteristics. Recently, the analysis of low complexity sequences has received attention from structural and sequence variations perspectives [3, 4, 5, 6]. We have been analyzing the LC proteins from different bacteria whose complete genome sequences are available. The results showed that the LC proteins are very few (less than 10 in number) in the INFORMATION functional superclass [7] comprising replication, transcription and translation classes. But the number of LC proteins is higher in other functional classes such as metabolism, transport and membrane associated and other classes. A sizable number of LC proteins in bacteria have no known function.
In order to identify species and strain patterns in the LC proteins we have carried out systematic groupings of the LC proteins into different classes based on their functional annotation. The procedure was simplified by combining the information superclass and the metabolic class into a superclass called CELLULAR PROCESSES (CP). The transport and membrane associated classes were combined into TRANSPORT AND MEMBRANE ASSOCIATED (TM) superclass. After the LC proteins were placed into the CP and TM superclasses, the remaining consisted of several proteins whose functional roles correlate with the characteristic biology of a given species or strain. Therefore these proteins were placed into CHARACTERISTIC (CH) superclass. This procedure, a slight modification of Riley's method [8], enables a rapid comparative analysis of the genomic information on LC proteins. Comparisons of the number and the function of LC proteins in each of these classes in Escherichia K12 and O157 and Mycobacteria revealed species and strain specific patterns that could be correlated with the pathogenic and nonpathogenic characteristics of these species [9]. In this work, we report the comparative analysis of the LC proteins from 4 enteric pathogenic bacteria E. coli O157, Helicobacter pylori, Campylobacter jejunii and Vibrio cholerae.
Sequences
The complete genome sequences of Campylobacter jejunii, Escherichia coli O157, Helicobacter pylori, Vibrio cholerae [10, 11, 12, 13], the proteins encoded and the annotation files were downloaded from NCBI web site (http://www.ncbi.nlm.nih.gov) ENTREZ genome server or through anonymous ftp.
Computer programs
The ScanCom program [1] was run through the entire protein sequence file in FASTA format (*.faa files) and proteins were classified into either low or high sequence complexity based on the amount of low complexity sequence present using a sliding window of 45 amino acid residues. Proteins smaller than 45 amino acid residues were not scanned. Thus 94% to 99% of the total proteins were analyzed through this procedure depending on the species.
After assessing sequence complexity, the program reports the numerical value of Fc which is an indicator of the amount of low complexity sequence present in a protein. Fc, expressed as percentage units, is defined as
| Fc = | No. of window size fragments with | (1) |
| Total number of window size fragments in the protein |
where
is the complexity measure [14]. Proteins with Fc =
15% were classified as low complexity whereas those with Fc < 15% were classified as high complexity.
We define the Complexity
of a sequence as
| (2) |
Complexity
ranges from 0 (for homopolymers
=
max and Nobs = Nmax) to maximum of 1 (for the most complex sequence with no repetitions and uniform amino acid distribution,
/
max = 0 and Nobs = Nmax).
Nobs observed number of distinct dimers
Nmax maximal number of distinct dimers
The measure of skew (
) is defined as
![]() | (3) |
where Ox is the observed number of the xth amino acid and L is the sequence length.
is maximum for a homopolymeric sequence, where
takes the value
| (4) |
The ratio (
/
max) is a normalized measure of skew resulting from amino acid composition.
Proteins identified as low complexity using our method were also re-examined using the SEG program available from NCBI. Except for a few proteins (3-7 in number depending on the species), the SEG results (W = 45, K1 = 3.4, K2 = 3.75) were in agreement with our predictions.
Pairwise comparisons
We investigated the similarities and the differences in LC proteins between the different species using genome BLAST analysis available with the NCBI. Homologous LC proteins with same functional role and belonging to low complexity in both species in a given pairwise comparison were grouped into a category called COMMON. Homologous LC proteins with same functional role but not belonging to the category of low complexity in either of the species in a given pairwise comparison were classified as VARIANT. LC proteins without any homologues (LCNH) in either of the species were classified separately.
Repeats analysis
Analysis of repeats was carried out with the REPEAT in the Wisconsin package ver 10.0 [14] using default parameters.
A comparative summary of the low complexity proteins in the different enteric bacterial pathogens is shown in Table 1. The number of LC proteins in various functional classes was normalized to the total number of proteins of length greater than 45 amino acid residues in the respective species and up-scaled by a factor of 1000. In the class of cellular processes (CP) the normalized fraction of low complexity (LC) proteins per thousand in E. coli O157 (5.15%) is similar to that of V. cholerae (6.4%). Likewise, the normalized fraction of LC proteins in H. pylori (9.97%) and C. jejuni (8.65%) are similar. There is no such pattern discernible in the transport associated and membrane class (TM). In the Characteristic (CH) class the normalized fraction of E. coli O157 and of H. pylori are similar.
| Table 1: | Distribution of low complexity proteins in the functional super classes CP, TM, CHARACTERISTIC (CH) and hypothetical (H) in C. jejuni, E. coli O157, H. pylori and V. cholerae. |
| Species | Number of LC proteins in different functional classesa | |||
| CP | TM | CH | H | |
| C. jejuni | 8.65 | 21.63 | 12.36 | 9.89 |
| E. coli O157 | 5.15 | 10.31 | 9.93 | 9.73 |
| H. pylori | 9.97 | 8.64 | 9.97 | 38.54 |
| V. cholerae | 6.40 | 4.73 | 4.17 | 12.80 |
| a: The number of proteins in each functional class was normalized to the total number of proteins scanned using ScanCom [1] and up-scaled by a factor of 1000. | ||||
The normalized number of COMMON LC proteins between different pairs of species in the three functional superclasses is shown in Figure 1. It is apparent that the proportion of COMMON LC proteins in all the three classes is highest between E. coli O157 and V. cholerae (56%, 53% and 33%). Among the rest of the comparisons, the number of COMMON LC proteins in the Cellular Processes class is highest between C. jejuni and H. pylori (43%). The proportion of COMMON LC proteins in other pairwise comparisons is either very low or absent.
The normalized proportion of LCNHs (Low Complexity proteins with No Homologues) between different pairs of species in the Cellular processes is shown in Figure 2. It is apparent from Figure 2 that the proportion of LCNH proteins is low in the following pair-wise comparisons, namely, E. coli O157 vs V. cholerae and C. jejuni vs H. pylori in the Cellular process class. Further, in these pair-wise comparisons, the numbers are close when the respective species pair are compared both-ways. For example, between E. coli O157 vs V. cholerae, 22% of the total LC proteins in CP class are LCNH proteins. Conversely, between V. cholerae vs E. coli O157, 17% of the total LC proteins in CP class are LCNH proteins. The corresponding values in the H. pylori and C. jejuni comparisons are 13% and 14%. In all other comparisons, the numbers are 2-4 fold higher and they differ widely when compared both-ways. The proportions of LCNH proteins in the TM and Characteristic classes were very high and no consistent pattern was discernible. No consistent pattern was discernible from pair-wise comparisons of VARIANT LC proteins.
|
Figure 2: Complementary evidence for taxonomic parallels observed in Figure 1. LC proteins with No Homologues (LCNH) in functional class of CELLULAR PROCESSES in pairwise comparisons between different organisms. To enable comparative analysis, the actual numbers of LCNH proteins were normalized to the total number of LC proteins in the CP functional class of the base organism in a given pair. For example, in the pair 'A'-'B' between species 'A' and 'B', the base organism is 'A'. Note that the matrix is not symmetric. This is because the number of LCNH proteins in a given species such as E. coli O157 with no homologues in V. cholerae is different from those of V. cholerae with no homologues in E. coli O157. When read from left of the matrix, the base organism is the current one. This order is reversed when read from top. |
Low complexity proteins constitute a minor fraction (
3-7%) in bacterial genomes indicating that they are generally selected against in bacterial evolution. The vast number of proteins identified through genome sequencing indicate that low complexity proteins span a wide range of functions including translation, metabolism, transport and membrane associated, adhesins, cell division proteins and a few proteins with functional roles that can be correlated with the biological characteristics of a given species for example, the Type III secretion apparatus and other secreted proteins of the enteropathogenic E. coli O157, the PGRS proteins of M. tuberculosis, colonization factor of V. cholerae, the cag pathogenicity island protein of H. pylori and the sporulation proteins of Bacillus subtilis.
The number and the exact functional role of LC proteins vary from species to species. Thus we adopted a classification scheme of a higher order to enable comparative studies. The classification scheme proposed here is a slight modification of that proposed by Riley [8] used in genomic studies. The differences in the number of LC proteins in different functional classes arise due to compositional biases and sequence variations in different species. For instance, a few LC proteins in one species appear as LC protein in several other species also but not all. One example is that of rpL7/L12 ribosomal protein that occurs as LC protein in many bacteria but not all [15]. In addition, we have observed that the acquisition of genes due to phage invasion contributes to these differences (see Table 3 below).
In the present work, we have analyzed the patterns of the LC proteins in four enteric pathogenic bacteria. The results shown in Figure 1 indicate that there is a close relationship in the COMMON LC proteins between E. coli O157 and V. cholerae in all the three functional classes. Indeed, E. coli O157 and V. cholerae are classified under the same taxon of gamma sub-division of proteobacteria and therefore the close relationship observed in the COMMON LC proteins parallels the taxonomic classification. Likewise, the results from Figure 1 indicate a close relationship between H. pylori and C. jejuni in the representation of COMMON LC proteins. Both H. pylori and C. jejuni are classified under the epsilon subdivision of proteobacteria. Thus, the close relationship observed in the COMMON LC proteins parallels taxonomic classification in these cases. The results from Figure 2 provide complementary support to the conclusions derived from COMMON LC proteins. Close relationship between a given species pair should indicate that the number of LC proteins with no homologues must be low in that pair. This prediction is well met in the CELLULAR PROCESSES class.
We carried out a comparative compositional and identity analysis of the repeats of COMMON LC proteins of the CELLULAR PROCESSES class (Table 2). It is evident that the repeats of the homologues vary. The highest identity in repeats between the homologues is shown by the low complexity ribosomal proteins perhaps because these are highly conserved in evolution. Among the top ranking amino acid frequencies in the repeats, low complexity ribosomal proteins show identical ranking of amino acids in the E. coli O157 vs V. cholerae comparison. In the case of H. pylori vs C. jejuni comparison, there appears to be more variability in that the rank positions are switched, although the similarity in the overall pattern is maintained. Whether these differences are due to lineage specific rates needs further investigation. Low complexity proteins of functions other than ribosomal proteins show slight variability in the rank order of the individual amino acids while exhibiting similarity.
| Table 2: Compositional and identity analysis of repeats of COMMON LC proteins of CELLULAR PROCESSES class. |
| Species pair | COMMON LC proteins and their functions | % Identity in repeatsa | Top ranking amino acids in the repeatsb |
| E. coli O157 and V. cholerae | putative transferase | 33% | (V, N, A) / (I, N, G) |
| 50S ribosomal subunit protein L7/L12 | 9.1% | (A, E) / (A, E) | |
| pyruvate dehydrogenase | 34% | (A, V, P) / (A, G, K) | |
| 50S ribosomal subunit protein L15 | 54% | (G, R) / (G, K) | |
| 50S ribosomal subunit protein L18 | 50% | (A, K) / (A, K) | |
| 50S ribosomal subunit protein A | 33% | (K) / (K) | |
| membrane-bound ATP synthase, F0 sector | 33% | (A,E,K) / (A,E,Q) | |
| acetylCoA carboxylase, BCCP subunit; | 50% | (A, E, P) / (A, P, E) | |
| FKBP-type peptidyl-prolyl cis-trans isomerase | 36% | (H,G,D) / (G, H, D) | |
| ssDNA-binding protein | 38% | (G, Q, P) / (Q, G, P) | |
| RNase E, membrane attachment | 31% | (R, E, V) / (E, P, V) | |
| inducible ATP-independent RNA helicase | 19% | (R, E, G) / (R, G, E) | |
| ATP-dependent dsDNA exonuclease | 17% | (Q, E, L) / (Q, L, E) | |
| H. pylori and C. jejuni | ribosomal protein L29 | No identity | (K) / (K) |
| ribosomal protein L24 | 17% | (K, V, I) / (K, I, A) | |
| ribosomal protein S21 | 67% | (R, K, F) / (K, R, F) | |
| ATP synthase F0, subunit b | 13% | (K, L, F) / (L, K, E) | |
| serine acetyltransferase | 23% | (G, I, K) / (G, I, V) | |
| ATP synthase F0, subunit c | 37% | (G, A, L) / (A, G, L) | |
|
a: The number of identical repeats were scored and normalized to the total number of maximum possible identical repeats as in Figure 1.
b: The amino acid frequencies were computed from observed occurrences and ranked in descending order. We scored the top distinctly ranking amino acids. The descending order of ranks are shown horizontally read from left to right for both the COMMON LC proteins in a species pair demarcated by a '/'. |
|||
These observations indicate that conservation patterns exist among the LC protein that parallels taxonomic classification most profoundly in the CELLULAR PROCESSES class of functions. Perhaps this is due to higher levels of conservation observed in proteins with functional roles in basic processes of the cell. The transport and membrane associated class exhibits more diversity compared to the LC proteins of the CELLULAR PROCESSES class. This is presumably due to the fact that these proteins are located at the interface between the local environment of the bacterium (niche) and the internal cellular compartment [16] that varies from species to species.
The CHARACTERISTICS class also exhibits substantial diversity. The functional details of the LC proteins from a given species with no homologues in any of the other three species are shown in Table 3. It is in this class that many species-specific proteins and other proteins such as due to phage invasion appear. It is apparent that the numbers are high in the TM and CHARACTERISTICS classes. It is note worthy that many of these proteins in the CHARACTERISTICS class correlate with the characteristic biology of the species. The similarity in the normalized numbers of LC proteins between E. coli O157 and H. pylori (noted in Table 1) in the CHARACTERISTICS class needs further investigations.
| Table 3: | Functional details of the LC proteins from a given species with no homologues in the other three speciesa. |
| V. cholerae | |
| CP | Cytochrome c554, oxaloacetate decarboxylase beta subunit |
| TM | - |
| CH | Colonization factor, -ve regulator of flagellin synthesis FlgM |
| E. coli O157 | |
| CP | +ve regulator of sigma heat shock protein, putative homeobox protein, ATP synthase Fo sector subunit b, putative phosphotransferase system enzyme subunit, putative acyl carrier protein |
| TM | Membrane protein (4), lipoprotien (2), transport (2), aquaporin, Na+/H+ antiporter |
| CH | Prophage (20), bacteriphage (4), curlin major subunit, acid shock protein, flagellar biosynthesis, sepZ, secreted protein, detox protein, PTS system glucitol/sorbitol specific protein, type III secretion apparatus, translocated intimin receptor |
| C. jejuni | |
| CP | - |
| TM | Membrane protein (10), lipoprotein |
| CH | Periplasmic protein (7), highly acidic protein, small hydrophobic protein (2), putative coiled coil protein |
| H. pylori | |
| CP | NADH ubiquinone oxidoreductase, fucosyltransferase |
| TM | Membrane protein (3) |
| CH | Cag14, tetracyclin resistance protein, secretd protein involved in flagellar motility, histidine rich metal binding polypeptide, histidine glutamine rich protein |
| a: If there is more than one protein with very similar annotation the numbers are indicated in brackets. |
In summary, we report that conservation patterns exist in the low complexity proteins belonging to cellular processes that parallels taxonomic classification in the 4 enteric pathogenic species E. coli O157, V. cholerae, H. pylori and C. jejuni. There is more diversity in the transport and membrane associated and characteristic classes. These observations on conservation patterns related to taxonomic classification are of interest since low complexity sequences are generally thought to be prone to sequence variations and recombination [18] and constitutes a minor fraction of the genome.
TN is recipient of a fellowship from Council of Scientific and Industrial Research. SR is recipient of NMITLI grant from the CSIR. We thank the anonymous referee for comments.