In Silico Biology 8, 0019 (2008); ©2008, Bioinformation Systems e.V.  


PredictBias: a server for the identification of genomic and pathogenicity islands in prokaryotes


Sachin Pundhir, Hemant Vijayvargiya and Anil Kumar*




Bioinformatics sub-centre, School of Biotechnology, Devi Ahilya University, Khandwa Road, Indore-452001, India
URL: http://www.davvbiotech.res.in



* Corresponding author

   Email: ak_sbt@yahoo.com
   Phone: +91-731-2470373;  Fax: +91-731-2470372





Edited by H. Michael; received September 08, 2007; revised and accepted March 19, 2008; published April 07, 2008



Abstract

Pathogenicity Islands (PAIs) are the sub-sets of Genomic Islands (GIs) that are acquired by horizontal gene transfer (HGT) and are generally shown to have a significant deviation in G+C, dinucleotide or codon frequency from core genome. Major approaches used for PAI identification are based on composition bias and/or similarity with known PAIs. These approaches either limit the search to GIs or to regions similar to previously annotated PAIs. PredictBias is a web application for the identification of genomic and pathogenicity islands in prokaryotes based on composition bias, presence of insertion elements, proximity with virulence-associated genes and absence in related non-pathogenic species. A profile database of virulence factors (VFPD) has been developed using 213 protein families associated to virulence retrieved from Pfam and PRINTS database. PredictBias performs a RPSBLAST search for regions with significant composition bias against VFPD. If a region encodes for at least one protein related to virulence then it is marked as potential PAI (biased-composition) otherwise as GI. Regions involved in virulence but having unsuspicious composition bias due to ancient HGT are identified by scanning genome segments (8 ORFs) with more than four significant hits to VFPD and are marked as potential PAI (unbiased-composition). The relative absence of potential PAIs in related non-pathogenic species can be investigated using 'compare genome feature' of PredictBias that further aids in validating the results and defining boundaries for PAIs. Performance measure analysis showed that the output of PredictBias is in agreement with the known findings. PredictBias is available at www.davvbiotech.res.in/PredictBias.

Keywords: pathogenicity islands, genomic islands, web server



Introduction

Pathogenicity islands are distinct chromosomal regions of pathogenic bacteria that contain genes encoding virulence factors viz. adhesins, toxins and invasins and contribute to the virulence of the respective pathogen. Although a major portion of a bacterial genome (70%-80%) has homogenous G+C contents, some portions of the genome (20%-30%) carry segments of DNA having distinct G+C contents and include large unstable regions carrying insertion sequences (ISs), transposons, prophages and others that have been known as genomic islands (GIs) [1]. The pathogenicity islands (PAIs) belong to a distinct sub-class of genomic islands that are acquired by horizontal transfer and usually occupy relatively large genomic regions ranging 10 to 200 kb. The PAIs were first described in uropathogenic Escherichia coli and were afterwards found in many pathogens of humans, animals and plants. At least ten pathogenicity islands have been identified in Salmonella typhimurium alone [2]. Due to increase in bacterial resistance to conventional antibiotics, there is an urgent need to identify novel drug targets. Being vital for the bacterial pathogenesis, PAI-encoded genes may play a significant role in identifying potential drug targets.

Common genetic features associated with pathogenicity islands are the presence of one or more virulence genes, significant composition bias from core genome (%GC bias, dinucleotide bias and codon bias), presence in pathogenic species while being absent in benign relatives, tRNA gene acting as insertion site and proximity with mobile genetic elements like integrase, transposase and insertion sequence (IS) elements [3, 4]. Several tools have been developed for in silico detection of genomic and pathogenicity islands, mostly based on composition bias measurement, proximity of tRNA and mobile genetic elements for island detection [5, 6]. Although efficient in the detection of GIs, these tools give much false positive results for PAIs. This is because a region showing distinct nucleotide content may be alien to the host genome but may not necessarily be involved in pathogenicity. Applications like PAIDB overcome this shortcoming by integrating composition-based search with similarity search against published PAIs. This approach has a limitation in that the detected PAIs are limited by the dataset of known PAIs [7, 8].

Here we present a web server application, PredictBias that integrates all the features associated with PAIs and predicts potential genomic and pathogenicity islands in a prokaryotic genome.



Methods

PredictBias currently presents the pre-computed bias results for all the completed microbial genome in RefSeq [9] and the dataset is updated regularly. As an input, PredictBias takes the genome file in GenBank format, performs the analysis and presents the results in tabular format. A general flow diagram displaying various steps involved in analysis with PredictBias is presented in Fig. 1 and has been described below.



Click on the thumbnail to enlarge the picture
Figure 1: Flow diagram showing various steps involved in analysis with PredictBias.


Composition bias analysis (sliding window approach)

PredictBias uses a cluster of six ORFs for calculating %GC bias, dinucleotide bias and codon bias (nucleotide composition measure parameters). ORF clusters are taken consecutively for the entire genome by using a sliding window shifting by one ORF at a time. Cluster of six ORFs is chosen for analysis because previous codon based analysis has shown that a minimum of 1,500 codons or 4.5 kb (corresponding to about 6-8 ORFs) is necessary for the reliable estimate of bias [10]. The %GC bias for each ORF cluster has been calculated by

%GC Bias (Cluster) = %GC (Cluster) − %GC (Genome)

The dinucleotide and codon bias analysis method is based on algorithm developed by Karlin [11]. The dinucleotide bias or average absolute dinucleotide relative abundance difference (δ*(f, g)) for each ORF cluster is calculated according to the following formula:

where the sum extends over all dinucleotides. The ρ*xy(f) and ρ*xy(g) denote the dinucleotide relative abundance values for all the ORFs and their reverse complements in a cluster and the genome, respectively. The ρ*xy was calculated from the formula ( ρ*xy = f*xy / f*x f*y ) where f*x, f*y and f*xy denote the frequency of mononucleotide x, y and dinucleotide xy, respectively for each ORF cluster and genome. The codon bias for each ORF cluster is calculated by the following formula:

where pa(F) is the average of amino acid frequencies in the ORF cluster F. The f(x,y,z) and g(x,y,z) are the average codon frequencies for the ORF cluster F and genome G, normalized to 1 for each synonymous codon set.


Identification of regions with significant bias

PredictBias examines the query genome for consecutive ORF clusters (≥6) with codon bias deviation and either of the %GC bias or dinucleotide bias deviation above the threshold value and marks the first ORF of each cluster as part of a genomic island in the output result. For determining the threshold values, seventy three GIs listed in the Islander database were analyzed. Islander is a comprehensive database containing genomic islands identified in completely sequenced genomes of 52 bacterial organisms [12].

Out of the 52 bacterial genomes in Islander database, one representative genome from each genus was selected for analysis. After selecting one genome from homologous genera, 73 GIs were found distributed across 29 bacterial organisms. The genomes analyzed here belong to diverse phylogenetic groups and have genomic G+C contents ranging from 32% for Streptococcus mutans to 72% for Streptomyces coelicolor. The dinucleotide bias deviation, codon bias deviation and %GC bias for all the GIs present in a single bacterial organism were averaged and are listed in Tab. 1. Individual bias values for each GI and their corresponding coordinates in the host genome are given in Supplementary Tab. S1. Based on the analyses, the threshold values were set to 2.0, 3.0 and 4.0 for dinucleotide bias deviation, %GC bias and codon bias deviation, respectively. Although most of the GIs in the studied bacterial organisms have bias values above the threshold, bias value(s) for some GIs was much below the threshold. This may be due to the integration of these GIs much earlier in the evolutionary time and over time the sequence may have adjusted to the base composition of the host genome due to the process of amelioration [10]. Furthermore, almost none of the studied GIs was observed to be having bias values below the threshold for all the three nucleotide composition measure parameters suggesting that none from the three parameters alone is sufficient and should be integrated for the efficient identification of potential islands in a genome sequence.


Table 1: Mean bias for seventy three Genomic Islands studied in 29 bacterial genomes.
Organism%GCGroupMean B(F|G) deviationa,dMean δ*(fg) deviationb,d Mean %GC biasc,d
Staphylococcus epidermidis32Firmicutes3.312.536.49
Lactococcus lactis35.3Firmicutes3.851.561.02
Streptococcus mutans36.8Firmicutes11.892.254.74
Enterococcus faecalis37.4Firmicutes3.081.094.27
Listeria innocua37.4Firmicutes3.872.211.71
Haemophilus influenzae 38.1Gammaproteobacteria4.470.530.31
Nostoc sp. PCC 712041.3Cyanobacteria3.212.182.98
Bacteroides thetaiotaomicron 42.9Bacteroidetes8.052.644.31
Bacillus subtilis43.5Firmicutes10.632.048.26
Lactobacillus plantarum44.4Firmicutes6.012.074.45
Vibrio parahaemolyticus 45.4Gammaproteobacteria6.463.883.28
Shewanella oneidensis45.9Gammaproteobacteria3.951.643.3
Yersinia pestis KIM 47.7Gammaproteobacteria15.162.955.75
Escherichia coli CFT07350.5Gammaproteobacteria8.713.42.95
Shigella flexneri 2a str. 2457T 50.9Gammaproteobacteria9.733.153.35
Salmonella enterica Typhi Ty2 52.1Gammaproteobacteria9.864.163.19
Xylella fastidiosa52.6Gammaproteobacteria29.645.0712.18
Brucella melitensis 57.2Alphaproteobacteria18.153.325.68
Agrobacterium tumefaciens59Alphaproteobacteria15.874.14.31
Bifidobacterium longum60.1Actinobacteria7.122.023.26
Sinorhizobium meliloti 62.2Alphaproteobacteria14.133.534.06
Mesorhizobium loti 62.5Alphaproteobacteria11.072.624.06
Corynebacterium efficiens 63.1Actinobacteria13.115.983.27
Bradyrhizobium japonicum64.1Bradyrhizobium japonicum17.582.826.28
Xanthomonas campestris65.1Xanthomonas campestris11.632.873.97
Pseudomonas aeruginosa66.6Pseudomonas aeruginosa20.073.475.71
Deinococcus radiodurans66.6Deinococcus radiodurans18.677.712.51
Ralstonia solanacearum67Betaproteobacteria8.421.792.59
Streptomyces coelicolor 72Actinobacteria16.051.584.18
a Mean codon bias deviation of a GI is calculated as mean of difference of codon bias (ORF clusters in GI) and Mean codon bias (genome).
b Mean dinucleotide bias deviation of a GI is calculated as mean of difference of dinucleotide bias (ORF clusters in GI) and Mean dinucleotide bias (genome).
c Mean bias in G+C frequency of a GI is calculated as mean of difference of G+C frequency (ORF clusters in GI) and G+C frequency (genome).
d Bias values shown are averaged for all the GIs in a bacterial organism.



Virulence factor profile database

A profile database of virulence factors (VFPD) has been developed to investigate the role of potential GIs in pathogenicity. Protein families related to virulence were retrieved from Pfam [13] and PRINTS [14] databases by searching for the key words 'Virulence', 'Adhesin', 'Siderophore', 'Invasin', 'Endotoxin', and 'Exotoxin'. Thereafter, results were curated manually and after removing families having putative or potential role in virulence, total 213 protein families having well established role in virulence were obtained (Supplementary Tab. S2). For each family, four iterations of PSI-BLAST [15] search were carried out at a cut-off E-value of 0.001 with -C and -u 1 parameter to create an output profile in ASCII format. Profiles thus created were used to develop VFPD using formatrpsdb program available with NCBI toolkit (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/INDEX.HTML). PredictBias performs RPSBLAST search for each potential GI against VFPD at a cut-off E-value of 0.00001 and low complexity regions are filtered. GIs with at least one significant hit with VFPD are marked as potential PAI (biased-composition) in the output results of PredictBias.

Earlier studies have shown that many regions involved in virulence exhibited unsuspicious composition bias either due to adjustment of transferred region in composition by amelioration or due to HGT from species having similar nucleotide composition [10] and therefore are undetectable through composition based analysis. To identify such regions, PredictBias scans for regions (8 ORFs) having ≥4 ORFs with significant hits to VFPD. Overlapping regions are merged and continuous ORF stretch is marked as potential PAI (unbiased-composition) in the output results of PredictBias.


Compare genome feature

"Compare genome feature" of PredictBias has been provided as a last step to aid in defining the start and end regions of PAIs and examine the role of a potential PAI in pathogenicity. Through this feature, potential PAI regions can be examined for their relative absence in non-pathogenic species, if absent, the role of such regions in pathogenicity can be further validated. 'Compare genome feature' is based on the blastp version of BLAST [16] family of programs. It compares protein sequences in potential PAI regions against a local BLAST database of microbial genomes developed using the NCBI toolkit. The local BLAST database was indexed so that a BLAST parser written in Perl identifies the relative position of an ORF with respect to the host genome. The program arranges the significant BLAST hits in continuous stretches as present in the host genome and displays the results in tabular format with each column having a continuous ORF stretch significantly similar to potential PAI. The longest ORF stretch is displayed first followed by smaller ORF stretches. 'compare genome feature' of PredictBias has a limitation in that it can be availed only for microorganisms where complete genome sequences of related non-pathogenic species are available. However, with more than 450 microbial genomes available and sequencing of more than 700 in progress, genome comparison may be crucial in PAI identification.



Results and discussion

PredictBias is implemented using ASP.Net, Perl and C Programming languages. It runs on IIS web server using SQL server database and is accessible at http://www.davvbiotech.res.in/PredictBias.


Web interface

PredictBias provides a user-friendly interface that allows researcher to easily analyze the location of potential genomic and pathogenicity islands in a prokaryotic genome. Bias results are displayed in tabular format with potential island's start region, end region, bias values, similarity with virulence profile, presence of insertion elements and prediction results displayed in adjacent columns (Fig. 2). The user can change the threshold values of %GC, dinucleotide or codon bias for more stringent search. Besides, 'compare genome feature' is also available where one can select a potential island and compare it in a related non-pathogenic species. It is pertinent to mention that 'compare genome feature' is meant to investigate the relative arrangement of a region in non-pathogenic species and should not be assumed as a tool for whole genome comparison. To assist in the identification of related non-pathogenic species, a phylogenetic tree of bacterial species is also available. Although PredictBias is updated regularly, if bias analysis for a microbial genome is not available at PredictBias, user can upload a GenBank genome file for real time analysis of the genome of interest.



Click on the thumbnail to enlarge the picture
Figure 2: A screenshot of PredictBias results for E. coli strain 536. (A) Main page shows a list of potential genomic and pathogenicity islands. (B) Composition bias measurement for an island with each bias value representing the difference in the bias for an ORF cluster and mean bias for genome. (C) An ORF in the potential PAI having significant similarity with a virulence factor.


Performance evaluation

To evaluate the performance of PredictBias, previously predicted and annotated set of genomic and pathogenicity islands in B. subtilis and E. coli were compared with the results of PredictBias.


Genomic Island identification

Nicolas et al. [17] reported the identification of nine out of ten prophages integrated in the Bacillus genome and fourteen DNA segments potentially arising from HGT. PredictBias results are largely in accordance with the findings of Nicolas et al. as shown in Tab. 2. In contrast to their studies, a previously unidentified island corresponding to the long repeats of 3608-3634 kb was detected by PredictBias. PredictBias sub-divides this island into two patches: BSU35150 - BSU35210 and BSU35310 - BSU35360 corresponding to coordinates of 3608 - 3619 kb and 3630 - 3635 kb, respectively. Furthermore, a potential PAI comprising 24 ORFs (BSU06840 - BSU07070) is identified by PredictBias. This region encodes various hypothetical proteins and three proteins (cotJA, cotJB, cotJC) previously established to be part of the cotJ operon [18]. The cotJ operon plays a significant role in the formation of a proteinaceous shell called coat that is essential for the survival of Bacillus spores during extreme conditions like starvation and high temperature. Being present in a non-pathogenic species, these potential islands should be considered as 'fitness islands' rather than to be involved in pathogenicity.


Table 2: Performance of PredictBias in the detection of potential GIs integrated into the genome of B. subtilis.
IslandGenePrediction results
Nicolas et al.PredictBias
RepeatsHMMPosition (kb)Position (ORF)
P1 'prophage'19202 - 213202 - 220202 - 219BSU01790 - BSU01970
P2 'prophage'39555 - 567529 - 570528 - 564BSU04790 - BSU05170
--31--570 - 600----
P3 'prophage'12--651 - 664647 - 660BSU05980 - BSU06090
Site-specific recombinase9--738 - 747----
Multidrug-efflux4--818 - 822----
--5--1124 - 1130----
cotJ operon24----752 - 822BSU06840 - BSU07070
P4 'prophage'10--1262 - 1270----
PBSX 'prophage'34--------
--41385 - 14241397 - 1399----
--5--1442 - 1447----
--4--1478 - 1482----
P5 'prophage'25--1879 - 18911877 - 1903BSU17450 - BSU17690
--4--2038 - 2041----
P6 'prophage'332050 - 20602046 - 20732036 - 2070BSU18660 - BSU18980
SPβ prophage184--2151 - 22862151 - 2283BSU19810 - BSU21640
Skin prophage532654 - 27012652 - 27012652 - 2666
2672 - 2700
BSU25750 - BSU25930
BSU26020 - BSU26350
P7 'prophage'422725 - 27352707 - 27562706 - 2745BSU26450 - BSU26860
Competence5--3253 - 3257----
Arsenic resistance regul.63462 - 34693463 - 3467----
--133608 - 3634--3608 - 3619
3630 - 3635
BSU35150 - BSU35210
BSU35310 - BSU35360
Cell wall synthesis103665 - 36723658 - 36853660 - 3677BSU35630 - BSU35720
ABC transporter9--4123 - 41344122 - 4131BSU40120 - BSU40200
ABC transporter54170 - 41764171 - 4176----
Streptothricin regul.64189 - 41904184 - 41904182 - 4186BSU40710 - BSU40760


Furthermore, two prophages namely PBSX, P4 and eleven potential GIs were not identified by PredictBias, either because of having insignificant composition bias as was in the case of PBSX studied by Nicolas et al. [17] or because of GIs having a length of less than 6 ORFs like in Competence, ABC transporter and six other islands. Inability in identifying GIs of less than 6 ORFs length may be regarded as the limitation of PredictBias but earlier studies have shown that focusing on GIs rather than individual putative alien genes in a genome assisted in reducing false positive results without missing relevant HGT events [6] thus genome segments having less than 6 ORFs with significant composition bias or length are intentionally left out in the final prediction results. Furthermore, to aid in the detection of small islands like P4 prophage, PredictBias provides a bar plot representing the composition bias (y-axis) for each ORF cluster (x-axis) along the genome. Bar plot representation assists in distinguishing regions having significant bias from insignificant ones, a very crucial feature during the detection of small islands like P4 'prophage' (Fig. 3).



Click on the thumbnail to enlarge the picture
Figure 3: Bias analysis measurements for P4 'prophage'. The genome is represented as two barplots, first representing dinucleotide bias deviation and second representing codon bias deviation and %GC bias, respectively. Each bar represents composition bias for a cluster of six ORFs taken consecutively for the whole genome by using a sliding window shifting by one ORF at a time. Bias measurements (y-axis) are the difference between the bias (dinucleotide or codon) for an ORF cluster and mean bias for genome. Also shown is the start and end regions of the island.


Pathogenicity island identification

Performance of PredictBias in the identification of PAIs was analyzed by comparing PredictBias results with the findings of Brzuszkiewicz et al. [19]. They carried out a comparative genome analysis of uropathogenic E. coli against nonpathogenic E. coli strain K12 and reported the identification of many gene clusters that may contribute to virulence/fitness. Here, we compared each reported PAI (> 6 ORFs) with the results of PredictBias and comparison results were compiled in a tabular form as shown in Tab. 3. It has been observed that while the sliding window approach significantly aids in the detection of PAIs, this approach alone are not sufficient in defining PAI boundaries, as is apparent from the discrepancies between the PredictBias results (third column) and published findings. Inability in defining boundaries for genomic and pathogenicity islands has been one of the major limitations of the sliding window approach [7].

To get over this limitation, 'compare genome feature' has been used and flanking genome segments in E. coli K-12 significantly similar to upstream and downstream regions of predicted PAIs were determined for each potential PAI. In case of PAI I, the upstream segment was found to be present till ECP_3763 and the downstream segment started from ECP_3865, therefore, ECP_3765 and ECP_3864 were defined as the boundaries of PAI I. The start and end regions for other islands were determined similarly and as evident from Tab. 3, are in agreement with the findings of Brzuszkiewicz et al. [19]. On 'compare genome' analysis, some islands like ECP_0142 - ECP_0147 were having perfect similarity with a corresponding genome segment in E. coli K12, therefore are unlikely to have a potential role in pathogenicity. Since these islands were having significant composition bias and therefore are likely to be acquired by HGT, eighteen such regions present in the PredictBias results were designated as potential GIs (Supplementary Tab. S3).

Eight of the potential islands predicted by Brzuszkiewicz et al. [19] were not identified by PredictBias due to having less than 6 ORFs with significant composition bias. Although undetectable through composition based measures, such regions were easily identified using 'compare genome feature' of PredictBias as shown in Tab. 3. It was further observed that all of these eight islands were from the cited table 5 of Brzuszkiewicz et al. [19] that enlists those GIs that are present in all the UPEC strains except E. coli K-12. In this context, these eight islands might have been deleted from E. coli K-12 genome during evolution because of having no selective advantage for the host organism.

Furthermore, a potential PAI comprising 5 ORFs (ECP_1342 - ECP_1347) was identified by PredictBias. It encodes for a transcriptional regulator and four probable multi-drug efflux proteins of which ECP_1345 has significant similarity with 'Acriflavin resistant protein family signature' at an E-value of 0.0. Acriflavin resistant proteins are believed to protect the bacterium from hydrophobic inhibitors, and mutation in these genes increases the susceptibility of E. coli to small inhibitor molecules like cephalothin and cephaloridine [20]. Interestingly, the potential PAI region has insignificant composition bias, thus undetectable through composition-based parameters. This further strengthens the importance of integrating multiple lines of evidence in PAI detection.


Table 3: Performance of PredictBias in the detection of potential GIs and PAIs integrated into the genome of E. coli strain 536.
IslandGenePrediction results
Brzuszkiewicz et al.PredictBias (composition analysis)PredictBias (Compare genome analysis) Flanking ORFs in E. coli K12
Upstreamdownstream
Colicin6ECP_0113 -ECP_0118ECP_0113 - ECP_0121ECP_0113 -ECP_0118b0112b0113
IAHP gene cluster24ECP_0239 -ECP_0248ECP_0224 - ECP_0230
ECP_0235 - ECP_0244
ECP_0224 -ECP_0248b0217b0219
PAI III75ECP_0274 -ECP_0342 ECP_0274 - ECP_0297
ECP_0307 - ECP_0346
ECP_0274 -ECP_0349b0243b0287
Putative membrane proteins8ECP_0692 -ECP_0699--ECP_0692 -ECP_0699b0679b0680
Prophage66ECP_1134 -ECP_1200ECP_1132 - ECP_1144
ECP_1147 - ECP_1161
ECP_1166 - ECP_1180
ECP_1190 - ECP_1215
ECP_1132 -ECP_1197b1136b1161
Acriflavin resistance5--ECP_1342 - ECP_1347ECP_1343 -ECP_1347b1288b1290
Rhs/Vgr-family protein5ECP_1457 -ECP_1460ECP_1455 - ECP_1461ECP_1457 -ECP_1461b1454b1460
EmrE protein5ECP_1866 -ECP_1874ECP_1867 - ECP_1884ECP_1866 -ECP_1870b1931b1937
PAI-IV39ECP_1913 -ECP_1955ECP_1896 - ECP_1930
ECP_1938 - ECP_1943
ECP_1948 - ECP_1958
ECP_1908 -ECP_1947b1973b1978
PAI-VI83ECP_1965 -ECP_2038ECP_1980 - ECP_2027
ECP_2037 - ECP_2045
ECP_1962 -ECP_2044b1985b2002
O-antigen syn.9ECP_2076 -ECP_2084ECP_2071 - ECP_2080ECP_2073 -ECP_2081b2029b2042
Hypothetical proteins, IS elements14ECP_2702 -ECP_2714ECP_2695 - ECP_2705ECP_2697 -ECP_2710b2732b2733
PTS, sucrose utilization5ECP_2754 -ECP_2758ECP_2749 - ECP_2754ECP_2750 -ECP_2754b2776b2777
metV island, IHAP-like gene cluster29ECP_2804 -ECP_2832--ECP_2800 -ECP_2828b2813b2817
PAI-V, K15 capsule75ECP_2962 -ECP_3024ECP_2960 - ECP_2972
ECP_2975 - ECP_3014
ECP_3023 - ECP_3034
ECP_2962 -ECP_3036b2966b2968
Dehydrogenases, putative allatoin degradation7ECP_3103 -ECP_3109--ECP_3099 -ECP_3105b4469b3017
Putative galacticol8ECP_3346 -ECP_3353--ECP_3342 -ECP_3349b3256b3257
Fimbrial proteins10ECP_3513 -ECP_3522ECP_3511 - ECP_3517ECP_3512 -ECP_3521b3426b3428
PTS-dependent fructose utilization8ECP_3753 -ECP_3760--ECP_3754 -ECP_3761b3655b3656
PAI-I100ECP_3765 -ECP_3862ECP_3763 - ECP_3858ECP_3765 -ECP_3864b3657b3660
Hemolysin-coregulated (Hpc)23ECP_4024 -ECP_4046 --ECP_4024 -ECP_4046b3829b3832
Sugar utilization4ECP_4087 -ECP_4093--ECP_4090 -ECP_4093b3881b3885
2-oxoglutarate utilization system9ECP_4275 -ECP_4282--ECP_4274 -ECP_4282b4054b4055
Oxidoreductases, regulators10ECP_4448 -ECP_4459ECP_4444 - ECP_4449ECP_4449 -ECP_4458b4203b4205
PAI-II, Fimbrial proteins121ECP_4521 -ECP_4641ECP_4521 - ECP_4573
ECP_4610 - ECP_4650
ECP_4521 -ECP_4641b4267b4309


An important observation made during the analyses of PAI I, PAI II, PAI III and PAI V was the presence of a conserved region of 2 kb in all the four PAIs (Fig. 4). It encodes for three hypothetical proteins and two proteins involved in DNA repair mechanism. This region being conserved in four of the well-established PAIs of E. coli strain 536 may be considered having a vital role in pathogenicity.



Click on the thumbnail to enlarge the picture
Figure 4: 'Compare genome' analysis results for five annotated PAIs in E. coli strain 536 with E. coli K-12. Start and end regions for each PAI are shown along with homologous ORFs in E. coli K-12 at the PAI boundaries. Also shown is a 2 kb conserved region in four of the five PAIs (gray strip).



Conclusion

It is an utmost requirement for any automated application to aid users in analyzing the significance of prediction results. In context of PAIs, PredictBias has the feature to change the threshold values of composition bias parameters thereby aids to distinguish regions with significant bias from less significant ones. Besides, significant similarity of a potential island with virulence factor profile provides another line of evidence for its role in pathogenicity. Moreover, comparative analysis of an island in related non-pathogenic species further aids in validating the results. We described PredictBias, an online available application that allows researchers to quickly and simply visualize the bias in genomic context. PredictBias will be updated automatically as more genomes become available and continue to enlighten the path of researchers in their search for genomic and pathogenicity islands.



Acknowledgements

This work was supported by the Grants received from the Department of Biotechnology, Ministry of Science and Technology, Government of India, New Delhi under the Bioinformatics Sub-Centre.




References


  1. Gal-Mor, O. and Finlay, B. B. (2006). Pathogenicity islands: a molecular toolbox for bacterial virulence. Cell Microbiol. 8, 1707-1719.

  2. McClelland, M., Sanderson, K. E., Spieth, J., Clifton, S. W., Latreille, P., Courtney, L., Porwollik, S., Ali, J., Dante, M., Du, F., Hou, S., Layman, D., Leonard, S., Nguyen, C., Scott, K., Holmes, A., Grewal, N., Mulvaney, E., Ryan, E., Sun, H., Florea, L., Miller, W., Stoneking, T., Nhan, M., Waterson, R. and Wilson, R. K. (2001). Complete genome sequence of Salmonella enterica serovar Typhimurium LT2. Nature 413, 852-856.

  3. Hacker, J. and Kaper, J. B. (2000). Pathogenicity islands and the evolution of microbes. Annu. Rev. Microbiol. 54, 641-679.

  4. Schmidt, H. and Hensel, M. (2004). Pathogenicity islands in bacterial pathogenesis. Clin. Microbiol. Rev. 17, 14-56.

  5. Hsiao, W., Wan, I., Jones, S. J. and Brinkman, F. S. (2003). IslandPath: aiding detection of genomic islands in prokaryotes. Bioinformatics 19, 418-420.

  6. Waack, S., Keller, O., Asper, R., Brodag, T., Damm, C., Fricke, W. F., Surovcik, K., Meinicke, P. and Merkl, R. (2006). Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models. BMC Bioinformatics 7, 142.

  7. Yoon, S. H., Hur, C.-G., Kang, H.-Y., Kim, Y. H., Oh, T. K. and Kim, J. F. (2005). A computational approach for identifying pathogenicity islands in prokaryotic genomes. BMC Bioinformatics 6, 184.

  8. Yoon, S. H., Park, Y.-K., Lee, S., Choi, D., Oh, T. K., Hur, C.-G. and Kim, J. F. (2007). Towards pathogenomics: a web-based resource for pathogenicity islands. Nucleic Acids Res. 35, D395-D400.

  9. Pruitt, K. D., Tatusova, T. and Maglott, D. R. (2007). NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61-D65.

  10. Lawrence, J. G. and Ochman, H. (1997). Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol. 44, 383-397.

  11. Karlin, S. (2001). Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol. 9, 335-343.

  12. Mantri, Y. and Williams, K. P. (2004). Islander: a database of integrative islands in prokaryotic genomes, the associated integrases and their DNA site specificities. Nucleic Acids Res. 32, D55-D58.

  13. Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. and Durbin, R. (1998). Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26, 320-322.

  14. Attwood, T. K., Beck, M. E., Bleasby, A. J. and Parry-Smith, D. J. (1994). PRINTS--a database of protein motif fingerprints. Nucleic Acids Res. 22, 3590-3596.

  15. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.

  16. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-410.

  17. Nicolas, P., Bize, L., Muri, F., Hoebeke, M., Rodolphe, F., Ehrlich, S. D., Prum, B. and Bessières, P. (2002). Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. Nucleic Acids Res. 30, 1418-1426.

  18. Henriques, A. O., Beall, B. W., Roland, K. and Moran, C. P., Jr. (1995). Characterization of cotJ, a σ E-controlled operon affecting the polypeptide composition of the coat of Bacillus subtilis spores. J. Bacteriol. 177, 3394-3406.

  19. Brzuszkiewicz, E., Brüggemann, H., Liesegang, H., Emmerth, M., Olschläger, T., Nagy, G., Albermann, K., Wagner, C., Buchrieser, C., Emody, L., Gottschalk G., Hacker J. and Dobrindt U. (2006). How to become a uropathogen: comparative genomic analysis of extraintestinal pathogenic Escherichia coli strains. Proc. Natl. Acad. Sci. USA 103, 12879-12884.

  20. Ma, D., Cook, D. N., Alberti, M., Pon, N. G., Nikaido, H. and Hearst, J. E. (1995). Genes acrA and acrB encode a stress-induced efflux system of Escherichia coli. Mol. Microbiol. 16, 45-55.