| In Silico Biology 8, 0019 (2008); ©2008, Bioinformation Systems e.V. |
Bioinformatics sub-centre, School of Biotechnology, Devi Ahilya University, Khandwa Road, Indore-452001, India
URL: http://www.davvbiotech.res.in
* Corresponding author
Email: ak_sbt@yahoo.com
Phone: +91-731-2470373; Fax: +91-731-2470372
Edited by H. Michael; received September 08, 2007; revised and accepted March 19, 2008; published April 07, 2008
Pathogenicity Islands (PAIs) are the sub-sets of Genomic Islands (GIs) that are acquired by horizontal gene transfer (HGT) and are generally shown to have a significant deviation in G+C, dinucleotide or codon frequency from core genome. Major approaches used for PAI identification are based on composition bias and/or similarity with known PAIs. These approaches either limit the search to GIs or to regions similar to previously annotated PAIs. PredictBias is a web application for the identification of genomic and pathogenicity islands in prokaryotes based on composition bias, presence of insertion elements, proximity with virulence-associated genes and absence in related non-pathogenic species. A profile database of virulence factors (VFPD) has been developed using 213 protein families associated to virulence retrieved from Pfam and PRINTS database. PredictBias performs a RPSBLAST search for regions with significant composition bias against VFPD. If a region encodes for at least one protein related to virulence then it is marked as potential PAI (biased-composition) otherwise as GI. Regions involved in virulence but having unsuspicious composition bias due to ancient HGT are identified by scanning genome segments (8 ORFs) with more than four significant hits to VFPD and are marked as potential PAI (unbiased-composition). The relative absence of potential PAIs in related non-pathogenic species can be investigated using 'compare genome feature' of PredictBias that further aids in validating the results and defining boundaries for PAIs. Performance measure analysis showed that the output of PredictBias is in agreement with the known findings. PredictBias is available at www.davvbiotech.res.in/PredictBias.
Keywords: pathogenicity islands, genomic islands, web server
Pathogenicity islands are distinct chromosomal regions of pathogenic bacteria that contain genes encoding virulence factors viz. adhesins, toxins and invasins and contribute to the virulence of the respective pathogen. Although a major portion of a bacterial genome (70%-80%) has homogenous G+C contents, some portions of the genome (20%-30%) carry segments of DNA having distinct G+C contents and include large unstable regions carrying insertion sequences (ISs), transposons, prophages and others that have been known as genomic islands (GIs) [1]. The pathogenicity islands (PAIs) belong to a distinct sub-class of genomic islands that are acquired by horizontal transfer and usually occupy relatively large genomic regions ranging 10 to 200 kb. The PAIs were first described in uropathogenic Escherichia coli and were afterwards found in many pathogens of humans, animals and plants. At least ten pathogenicity islands have been identified in Salmonella typhimurium alone [2]. Due to increase in bacterial resistance to conventional antibiotics, there is an urgent need to identify novel drug targets. Being vital for the bacterial pathogenesis, PAI-encoded genes may play a significant role in identifying potential drug targets.
Common genetic features associated with pathogenicity islands are the presence of one or more virulence genes, significant composition bias from core genome (%GC bias, dinucleotide bias and codon bias), presence in pathogenic species while being absent in benign relatives, tRNA gene acting as insertion site and proximity with mobile genetic elements like integrase, transposase and insertion sequence (IS) elements [3, 4]. Several tools have been developed for in silico detection of genomic and pathogenicity islands, mostly based on composition bias measurement, proximity of tRNA and mobile genetic elements for island detection [5, 6]. Although efficient in the detection of GIs, these tools give much false positive results for PAIs. This is because a region showing distinct nucleotide content may be alien to the host genome but may not necessarily be involved in pathogenicity. Applications like PAIDB overcome this shortcoming by integrating composition-based search with similarity search against published PAIs. This approach has a limitation in that the detected PAIs are limited by the dataset of known PAIs [7, 8].
Here we present a web server application, PredictBias that integrates all the features associated with PAIs and predicts potential genomic and pathogenicity islands in a prokaryotic genome.
PredictBias currently presents the pre-computed bias results for all the completed microbial genome in RefSeq [9] and the dataset is updated regularly. As an input, PredictBias takes the genome file in GenBank format, performs the analysis and presents the results in tabular format. A general flow diagram displaying various steps involved in analysis with PredictBias is presented in Fig. 1 and has been described below.
![]() Click on the thumbnail to enlarge the picture |
Figure 1: Flow diagram showing various steps involved in analysis with PredictBias. |
Composition bias analysis (sliding window approach)
PredictBias uses a cluster of six ORFs for calculating %GC bias, dinucleotide bias and codon bias (nucleotide composition measure parameters). ORF clusters are taken consecutively for the entire genome by using a sliding window shifting by one ORF at a time. Cluster of six ORFs is chosen for analysis because previous codon based analysis has shown that a minimum of 1,500 codons or 4.5 kb (corresponding to about 6-8 ORFs) is necessary for the reliable estimate of bias [10]. The %GC bias for each ORF cluster has been calculated by
%GC Bias (Cluster) = %GC (Cluster) − %GC (Genome)
The dinucleotide and codon bias analysis method is based on algorithm developed by Karlin [11]. The dinucleotide bias or average absolute dinucleotide relative abundance difference (δ*(f, g)) for each ORF cluster is calculated according to the following formula:
![]() |
where the sum extends over all dinucleotides. The ρ*xy(f) and ρ*xy(g) denote the dinucleotide relative abundance values for all the ORFs and their reverse complements in a cluster and the genome, respectively. The ρ*xy was calculated from the formula ( ρ*xy = f*xy / f*x f*y ) where f*x, f*y and f*xy denote the frequency of mononucleotide x, y and dinucleotide xy, respectively for each ORF cluster and genome. The codon bias for each ORF cluster is calculated by the following formula:
![]() |
where pa(F) is the average of amino acid frequencies in the ORF cluster F. The f(x,y,z) and g(x,y,z) are the average codon frequencies for the ORF cluster F and genome G, normalized to 1 for each synonymous codon set.
Identification of regions with significant bias
PredictBias examines the query genome for consecutive ORF clusters (≥6) with codon bias deviation and either of the %GC bias or dinucleotide bias deviation above the threshold value and marks the first ORF of each cluster as part of a genomic island in the output result. For determining the threshold values, seventy three GIs listed in the Islander database were analyzed. Islander is a comprehensive database containing genomic islands identified in completely sequenced genomes of 52 bacterial organisms [12].
Out of the 52 bacterial genomes in Islander database, one representative genome from each genus was selected for analysis. After selecting one genome from homologous genera, 73 GIs were found distributed across 29 bacterial organisms. The genomes analyzed here belong to diverse phylogenetic groups and have genomic G+C contents ranging from 32% for Streptococcus mutans to 72% for Streptomyces coelicolor. The dinucleotide bias deviation, codon bias deviation and %GC bias for all the GIs present in a single bacterial organism were averaged and are listed in Tab. 1. Individual bias values for each GI and their corresponding coordinates in the host genome are given in Supplementary Tab. S1. Based on the analyses, the threshold values were set to 2.0, 3.0 and 4.0 for dinucleotide bias deviation, %GC bias and codon bias deviation, respectively. Although most of the GIs in the studied bacterial organisms have bias values above the threshold, bias value(s) for some GIs was much below the threshold. This may be due to the integration of these GIs much earlier in the evolutionary time and over time the sequence may have adjusted to the base composition of the host genome due to the process of amelioration [10]. Furthermore, almost none of the studied GIs was observed to be having bias values below the threshold for all the three nucleotide composition measure parameters suggesting that none from the three parameters alone is sufficient and should be integrated for the efficient identification of potential islands in a genome sequence.
| Table 1: | Mean bias for seventy three Genomic Islands studied in 29 bacterial genomes. |
| Organism | %GC | Group | Mean B(F|G) deviationa,d | Mean δ*(f, g) deviationb,d | Mean %GC biasc,d |
| Staphylococcus epidermidis | 32 | Firmicutes | 3.31 | 2.53 | 6.49 |
| Lactococcus lactis | 35.3 | Firmicutes | 3.85 | 1.56 | 1.02 |
| Streptococcus mutans | 36.8 | Firmicutes | 11.89 | 2.25 | 4.74 |
| Enterococcus faecalis | 37.4 | Firmicutes | 3.08 | 1.09 | 4.27 |
| Listeria innocua | 37.4 | Firmicutes | 3.87 | 2.21 | 1.71 |
| Haemophilus influenzae | 38.1 | Gammaproteobacteria | 4.47 | 0.53 | 0.31 |
| Nostoc sp. PCC 7120 | 41.3 | Cyanobacteria | 3.21 | 2.18 | 2.98 |
| Bacteroides thetaiotaomicron | 42.9 | Bacteroidetes | 8.05 | 2.64 | 4.31 |
| Bacillus subtilis | 43.5 | Firmicutes | 10.63 | 2.04 | 8.26 |
| Lactobacillus plantarum | 44.4 | Firmicutes | 6.01 | 2.07 | 4.45 |
| Vibrio parahaemolyticus | 45.4 | Gammaproteobacteria | 6.46 | 3.88 | 3.28 |
| Shewanella oneidensis | 45.9 | Gammaproteobacteria | 3.95 | 1.64 | 3.3 |
| Yersinia pestis KIM | 47.7 | Gammaproteobacteria | 15.16 | 2.95 | 5.75 |
| Escherichia coli CFT073 | 50.5 | Gammaproteobacteria | 8.71 | 3.4 | 2.95 |
| Shigella flexneri 2a str. 2457T | 50.9 | Gammaproteobacteria | 9.73 | 3.15 | 3.35 |
| Salmonella enterica Typhi Ty2 | 52.1 | Gammaproteobacteria | 9.86 | 4.16 | 3.19 |
| Xylella fastidiosa | 52.6 | Gammaproteobacteria | 29.64 | 5.07 | 12.18 |
| Brucella melitensis | 57.2 | Alphaproteobacteria | 18.15 | 3.32 | 5.68 |
| Agrobacterium tumefaciens | 59 | Alphaproteobacteria | 15.87 | 4.1 | 4.31 |
| Bifidobacterium longum | 60.1 | Actinobacteria | 7.12 | 2.02 | 3.26 |
| Sinorhizobium meliloti | 62.2 | Alphaproteobacteria | 14.13 | 3.53 | 4.06 |
| Mesorhizobium loti | 62.5 | Alphaproteobacteria | 11.07 | 2.62 | 4.06 |
| Corynebacterium efficiens | 63.1 | Actinobacteria | 13.11 | 5.98 | 3.27 |
| Bradyrhizobium japonicum | 64.1 | Bradyrhizobium japonicum | 17.58 | 2.82 | 6.28 |
| Xanthomonas campestris | 65.1 | Xanthomonas campestris | 11.63 | 2.87 | 3.97 |
| Pseudomonas aeruginosa | 66.6 | Pseudomonas aeruginosa | 20.07 | 3.47 | 5.71 |
| Deinococcus radiodurans | 66.6 | Deinococcus radiodurans | 18.67 | 7.71 | 2.51 |
| Ralstonia solanacearum | 67 | Betaproteobacteria | 8.42 | 1.79 | 2.59 |
| Streptomyces coelicolor | 72 | Actinobacteria | 16.05 | 1.58 | 4.18 |
| a Mean codon bias deviation of a GI is calculated as mean of difference of codon bias (ORF clusters in GI) and Mean codon bias (genome).
b Mean dinucleotide bias deviation of a GI is calculated as mean of difference of dinucleotide bias (ORF clusters in GI) and Mean dinucleotide bias (genome). c Mean bias in G+C frequency of a GI is calculated as mean of difference of G+C frequency (ORF clusters in GI) and G+C frequency (genome). d Bias values shown are averaged for all the GIs in a bacterial organism. |
Virulence factor profile database
A profile database of virulence factors (VFPD) has been developed to investigate the role of potential GIs in pathogenicity. Protein families related to virulence were retrieved from Pfam [13] and PRINTS [14] databases by searching for the key words 'Virulence', 'Adhesin', 'Siderophore', 'Invasin', 'Endotoxin', and 'Exotoxin'. Thereafter, results were curated manually and after removing families having putative or potential role in virulence, total 213 protein families having well established role in virulence were obtained (Supplementary Tab. S2). For each family, four iterations of PSI-BLAST [15] search were carried out at a cut-off E-value of 0.001 with -C and -u 1 parameter to create an output profile in ASCII format. Profiles thus created were used to develop VFPD using formatrpsdb program available with NCBI toolkit (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/INDEX.HTML). PredictBias performs RPSBLAST search for each potential GI against VFPD at a cut-off E-value of 0.00001 and low complexity regions are filtered. GIs with at least one significant hit with VFPD are marked as potential PAI (biased-composition) in the output results of PredictBias.
Earlier studies have shown that many regions involved in virulence exhibited unsuspicious composition bias either due to adjustment of transferred region in composition by amelioration or due to HGT from species having similar nucleotide composition [10] and therefore are undetectable through composition based analysis. To identify such regions, PredictBias scans for regions (8 ORFs) having ≥4 ORFs with significant hits to VFPD. Overlapping regions are merged and continuous ORF stretch is marked as potential PAI (unbiased-composition) in the output results of PredictBias.
Compare genome feature
"Compare genome feature" of PredictBias has been provided as a last step to aid in defining the start and end regions of PAIs and examine the role of a potential PAI in pathogenicity. Through this feature, potential PAI regions can be examined for their relative absence in non-pathogenic species, if absent, the role of such regions in pathogenicity can be further validated. 'Compare genome feature' is based on the blastp version of BLAST [16] family of programs. It compares protein sequences in potential PAI regions against a local BLAST database of microbial genomes developed using the NCBI toolkit. The local BLAST database was indexed so that a BLAST parser written in Perl identifies the relative position of an ORF with respect to the host genome. The program arranges the significant BLAST hits in continuous stretches as present in the host genome and displays the results in tabular format with each column having a continuous ORF stretch significantly similar to potential PAI. The longest ORF stretch is displayed first followed by smaller ORF stretches. 'compare genome feature' of PredictBias has a limitation in that it can be availed only for microorganisms where complete genome sequences of related non-pathogenic species are available. However, with more than 450 microbial genomes available and sequencing of more than 700 in progress, genome comparison may be crucial in PAI identification.
PredictBias is implemented using ASP.Net, Perl and C Programming languages. It runs on IIS web server using SQL server database and is accessible at http://www.davvbiotech.res.in/PredictBias.
Web interface
PredictBias provides a user-friendly interface that allows researcher to easily analyze the location of potential genomic and pathogenicity islands in a prokaryotic genome. Bias results are displayed in tabular format with potential island's start region, end region, bias values, similarity with virulence profile, presence of insertion elements and prediction results displayed in adjacent columns (Fig. 2). The user can change the threshold values of %GC, dinucleotide or codon bias for more stringent search. Besides, 'compare genome feature' is also available where one can select a potential island and compare it in a related non-pathogenic species. It is pertinent to mention that 'compare genome feature' is meant to investigate the relative arrangement of a region in non-pathogenic species and should not be assumed as a tool for whole genome comparison. To assist in the identification of related non-pathogenic species, a phylogenetic tree of bacterial species is also available. Although PredictBias is updated regularly, if bias analysis for a microbial genome is not available at PredictBias, user can upload a GenBank genome file for real time analysis of the genome of interest.
Performance evaluation
To evaluate the performance of PredictBias, previously predicted and annotated set of genomic and pathogenicity islands in B. subtilis and E. coli were compared with the results of PredictBias.
Genomic Island identification
Nicolas et al. [17] reported the identification of nine out of ten prophages integrated in the Bacillus genome and fourteen DNA segments potentially arising from HGT. PredictBias results are largely in accordance with the findings of Nicolas et al. as shown in Tab. 2. In contrast to their studies, a previously unidentified island corresponding to the long repeats of 3608-3634 kb was detected by PredictBias. PredictBias sub-divides this island into two patches: BSU35150 - BSU35210 and BSU35310 - BSU35360 corresponding to coordinates of 3608 - 3619 kb and 3630 - 3635 kb, respectively. Furthermore, a potential PAI comprising 24 ORFs (BSU06840 - BSU07070) is identified by PredictBias. This region encodes various hypothetical proteins and three proteins (cotJA, cotJB, cotJC) previously established to be part of the cotJ operon [18]. The cotJ operon plays a significant role in the formation of a proteinaceous shell called coat that is essential for the survival of Bacillus spores during extreme conditions like starvation and high temperature. Being present in a non-pathogenic species, these potential islands should be considered as 'fitness islands' rather than to be involved in pathogenicity.
| Table 2: | Performance of PredictBias in the detection of potential GIs integrated into the genome of B. subtilis. |
| Island | Gene | Prediction results | |||
| Nicolas et al. | PredictBias | ||||
| Repeats | HMM | Position (kb) | Position (ORF) | ||
| P1 'prophage' | 19 | 202 - 213 | 202 - 220 | 202 - 219 | BSU01790 - BSU01970 |
| P2 'prophage' | 39 | 555 - 567 | 529 - 570 | 528 - 564 | BSU04790 - BSU05170 |
| -- | 31 | -- | 570 - 600 | -- | -- |
| P3 'prophage' | 12 | -- | 651 - 664 | 647 - 660 | BSU05980 - BSU06090 |
| Site-specific recombinase | 9 | -- | 738 - 747 | -- | -- |
| Multidrug-efflux | 4 | -- | 818 - 822 | -- | -- |
| -- | 5 | -- | 1124 - 1130 | -- | -- |
| cotJ operon | 24 | -- | -- | 752 - 822 | BSU06840 - BSU07070 |
| P4 'prophage' | 10 | -- | 1262 - 1270 | -- | -- |
| PBSX 'prophage' | 34 | -- | -- | -- | -- |
| -- | 4 | 1385 - 1424 | 1397 - 1399 | -- | -- |
| -- | 5 | -- | 1442 - 1447 | -- | -- |
| -- | 4 | -- | 1478 - 1482 | -- | -- |
| P5 'prophage' | 25 | -- | 1879 - 1891 | 1877 - 1903 | BSU17450 - BSU17690 |
| -- | 4 | -- | 2038 - 2041 | -- | -- |
| P6 'prophage' | 33 | 2050 - 2060 | 2046 - 2073 | 2036 - 2070 | BSU18660 - BSU18980 |
| SPβ prophage | 184 | -- | 2151 - 2286 | 2151 - 2283 | BSU19810 - BSU21640 |
| Skin prophage | 53 | 2654 - 2701 | 2652 - 2701 | 2652 - 2666 2672 - 2700 | BSU25750 - BSU25930 BSU26020 - BSU26350 |
| P7 'prophage' | 42 | 2725 - 2735 | 2707 - 2756 | 2706 - 2745 | BSU26450 - BSU26860 |
| Competence | 5 | -- | 3253 - 3257 | -- | -- |
| Arsenic resistance regul. | 6 | 3462 - 3469 | 3463 - 3467 | -- | -- |
| -- | 13 | 3608 - 3634 | -- | 3608 - 3619 3630 - 3635 |
BSU35150 - BSU35210 BSU35310 - BSU35360 |
| Cell wall synthesis | 10 | 3665 - 3672 | 3658 - 3685 | 3660 - 3677 | BSU35630 - BSU35720 |
| ABC transporter | 9 | -- | 4123 - 4134 | 4122 - 4131 | BSU40120 - BSU40200 |
| ABC transporter | 5 | 4170 - 4176 | 4171 - 4176 | -- | -- |
| Streptothricin regul. | 6 | 4189 - 4190 | 4184 - 4190 | 4182 - 4186 | BSU40710 - BSU40760 |
Furthermore, two prophages namely PBSX, P4 and eleven potential GIs were not identified by PredictBias, either because of having insignificant composition bias as was in the case of PBSX studied by Nicolas et al. [17] or because of GIs having a length of less than 6 ORFs like in Competence, ABC transporter and six other islands. Inability in identifying GIs of less than 6 ORFs length may be regarded as the limitation of PredictBias but earlier studies have shown that focusing on GIs rather than individual putative alien genes in a genome assisted in reducing false positive results without missing relevant HGT events [6] thus genome segments having less than 6 ORFs with significant composition bias or length are intentionally left out in the final prediction results. Furthermore, to aid in the detection of small islands like P4 prophage, PredictBias provides a bar plot representing the composition bias (y-axis) for each ORF cluster (x-axis) along the genome. Bar plot representation assists in distinguishing regions having significant bias from insignificant ones, a very crucial feature during the detection of small islands like P4 'prophage' (Fig. 3).
Pathogenicity island identification
Performance of PredictBias in the identification of PAIs was analyzed by comparing PredictBias results with the findings of Brzuszkiewicz et al. [19]. They carried out a comparative genome analysis of uropathogenic E. coli against nonpathogenic E. coli strain K12 and reported the identification of many gene clusters that may contribute to virulence/fitness. Here, we compared each reported PAI (> 6 ORFs) with the results of PredictBias and comparison results were compiled in a tabular form as shown in Tab. 3. It has been observed that while the sliding window approach significantly aids in the detection of PAIs, this approach alone are not sufficient in defining PAI boundaries, as is apparent from the discrepancies between the PredictBias results (third column) and published findings. Inability in defining boundaries for genomic and pathogenicity islands has been one of the major limitations of the sliding window approach [7].
To get over this limitation, 'compare genome feature' has been used and flanking genome segments in E. coli K-12 significantly similar to upstream and downstream regions of predicted PAIs were determined for each potential PAI. In case of PAI I, the upstream segment was found to be present till ECP_3763 and the downstream segment started from ECP_3865, therefore, ECP_3765 and ECP_3864 were defined as the boundaries of PAI I. The start and end regions for other islands were determined similarly and as evident from Tab. 3, are in agreement with the findings of Brzuszkiewicz et al. [19]. On 'compare genome' analysis, some islands like ECP_0142 - ECP_0147 were having perfect similarity with a corresponding genome segment in E. coli K12, therefore are unlikely to have a potential role in pathogenicity. Since these islands were having significant composition bias and therefore are likely to be acquired by HGT, eighteen such regions present in the PredictBias results were designated as potential GIs (Supplementary Tab. S3).
Eight of the potential islands predicted by Brzuszkiewicz et al. [19] were not identified by PredictBias due to having less than 6 ORFs with significant composition bias. Although undetectable through composition based measures, such regions were easily identified using 'compare genome feature' of PredictBias as shown in Tab. 3. It was further observed that all of these eight islands were from the cited table 5 of Brzuszkiewicz et al. [19] that enlists those GIs that are present in all the UPEC strains except E. coli K-12. In this context, these eight islands might have been deleted from E. coli K-12 genome during evolution because of having no selective advantage for the host organism.
Furthermore, a potential PAI comprising 5 ORFs (ECP_1342 - ECP_1347) was identified by PredictBias. It encodes for a transcriptional regulator and four probable multi-drug efflux proteins of which ECP_1345 has significant similarity with 'Acriflavin resistant protein family signature' at an E-value of 0.0. Acriflavin resistant proteins are believed to protect the bacterium from hydrophobic inhibitors, and mutation in these genes increases the susceptibility of E. coli to small inhibitor molecules like cephalothin and cephaloridine [20]. Interestingly, the potential PAI region has insignificant composition bias, thus undetectable through composition-based parameters. This further strengthens the importance of integrating multiple lines of evidence in PAI detection.
| Table 3: | Performance of PredictBias in the detection of potential GIs and PAIs integrated into the genome of E. coli strain 536. |
| Island | Gene | Prediction results | ||||
| Brzuszkiewicz et al. | PredictBias (composition analysis) | PredictBias (Compare genome analysis) | Flanking ORFs in E. coli K12 | |||
| Upstream | downstream | |||||
| Colicin | 6 | ECP_0113 -ECP_0118 | ECP_0113 - ECP_0121 | ECP_0113 -ECP_0118 | b0112 | b0113 |
| IAHP gene cluster | 24 | ECP_0239 -ECP_0248 | ECP_0224 - ECP_0230 ECP_0235 - ECP_0244 |
ECP_0224 -ECP_0248 | b0217 | b0219 |
| PAI III | 75 | ECP_0274 -ECP_0342 | ECP_0274 - ECP_0297 ECP_0307 - ECP_0346 | ECP_0274 -ECP_0349 | b0243 | b0287 |
| Putative membrane proteins | 8 | ECP_0692 -ECP_0699 | -- | ECP_0692 -ECP_0699 | b0679 | b0680 |
| Prophage | 66 | ECP_1134 -ECP_1200 | ECP_1132 - ECP_1144 ECP_1147 - ECP_1161 ECP_1166 - ECP_1180 ECP_1190 - ECP_1215 | ECP_1132 -ECP_1197 | b1136 | b1161 |
| Acriflavin resistance | 5 | -- | ECP_1342 - ECP_1347 | ECP_1343 -ECP_1347 | b1288 | b1290 |
| Rhs/Vgr-family protein | 5 | ECP_1457 -ECP_1460 | ECP_1455 - ECP_1461 | ECP_1457 -ECP_1461 | b1454 | b1460 |
| EmrE protein | 5 | ECP_1866 -ECP_1874 | ECP_1867 - ECP_1884 | ECP_1866 -ECP_1870 | b1931 | b1937 |
| PAI-IV | 39 | ECP_1913 -ECP_1955 | ECP_1896 - ECP_1930 ECP_1938 - ECP_1943 ECP_1948 - ECP_1958 | ECP_1908 -ECP_1947 | b1973 | b1978 |
| PAI-VI | 83 | ECP_1965 -ECP_2038 | ECP_1980 - ECP_2027 ECP_2037 - ECP_2045 | ECP_1962 -ECP_2044 | b1985 | b2002 |
| O-antigen syn. | 9 | ECP_2076 -ECP_2084 | ECP_2071 - ECP_2080 | ECP_2073 -ECP_2081 | b2029 | b2042 |
| Hypothetical proteins, IS elements | 14 | ECP_2702 -ECP_2714 | ECP_2695 - ECP_2705 | ECP_2697 -ECP_2710 | b2732 | b2733 |
| PTS, sucrose utilization | 5 | ECP_2754 -ECP_2758 | ECP_2749 - ECP_2754 | ECP_2750 -ECP_2754 | b2776 | b2777 |
| metV island, IHAP-like gene cluster | 29 | ECP_2804 -ECP_2832 | -- | ECP_2800 -ECP_2828 | b2813 | b2817 |
| PAI-V, K15 capsule | 75 | ECP_2962 -ECP_3024 | ECP_2960 - ECP_2972 ECP_2975 - ECP_3014 ECP_3023 - ECP_3034 | ECP_2962 -ECP_3036 | b2966 | b2968 |
| Dehydrogenases, putative allatoin degradation | 7 | ECP_3103 -ECP_3109 | -- | ECP_3099 -ECP_3105 | b4469 | b3017 |
| Putative galacticol | 8 | ECP_3346 -ECP_3353 | -- | ECP_3342 -ECP_3349 | b3256 | b3257 |
| Fimbrial proteins | 10 | ECP_3513 -ECP_3522 | ECP_3511 - ECP_3517 | ECP_3512 -ECP_3521 | b3426 | b3428 |
| PTS-dependent fructose utilization | 8 | ECP_3753 -ECP_3760 | -- | ECP_3754 -ECP_3761 | b3655 | b3656 |
| PAI-I | 100 | ECP_3765 -ECP_3862 | ECP_3763 - ECP_3858 | ECP_3765 -ECP_3864 | b3657 | b3660 |
| Hemolysin-coregulated (Hpc) | 23 | ECP_4024 -ECP_4046 | -- | ECP_4024 -ECP_4046 | b3829 | b3832 |
| Sugar utilization | 4 | ECP_4087 -ECP_4093 | -- | ECP_4090 -ECP_4093 | b3881 | b3885 |
| 2-oxoglutarate utilization system | 9 | ECP_4275 -ECP_4282 | -- | ECP_4274 -ECP_4282 | b4054 | b4055 |
| Oxidoreductases, regulators | 10 | ECP_4448 -ECP_4459 | ECP_4444 - ECP_4449 | ECP_4449 -ECP_4458 | b4203 | b4205 |
| PAI-II, Fimbrial proteins | 121 | ECP_4521 -ECP_4641 | ECP_4521 - ECP_4573 ECP_4610 - ECP_4650 |
ECP_4521 -ECP_4641 | b4267 | b4309 |
An important observation made during the analyses of PAI I, PAI II, PAI III and PAI V was the presence of a conserved region of 2 kb in all the four PAIs (Fig. 4). It encodes for three hypothetical proteins and two proteins involved in DNA repair mechanism. This region being conserved in four of the well-established PAIs of E. coli strain 536 may be considered having a vital role in pathogenicity.
It is an utmost requirement for any automated application to aid users in analyzing the significance of prediction results. In context of PAIs, PredictBias has the feature to change the threshold values of composition bias parameters thereby aids to distinguish regions with significant bias from less significant ones. Besides, significant similarity of a potential island with virulence factor profile provides another line of evidence for its role in pathogenicity. Moreover, comparative analysis of an island in related non-pathogenic species further aids in validating the results. We described PredictBias, an online available application that allows researchers to quickly and simply visualize the bias in genomic context. PredictBias will be updated automatically as more genomes become available and continue to enlighten the path of researchers in their search for genomic and pathogenicity islands.
This work was supported by the Grants received from the Department of Biotechnology, Ministry of Science and Technology, Government of India, New Delhi under the Bioinformatics Sub-Centre.