In Silico Biology 6, 0020 (2006); ©2006, Bioinformation Systems e.V.  

Prophage Finder: a prophage loci prediction tool for prokaryotic genome sequences


Michael Bose and Robert D. Barber*




Department of Biological Sciences, University of Wisconsin-Parkside
Kenosha, WI 53141-2000, USA



* Corresponding author

   Email: barber@uwp.edu





Edited by E. Wingender; received November 30, 2005; revised March 23, 2006; accepted March 26, 2006; published April 15, 2006



Abstract

Prophage loci often remain under-annotated or even unrecognized in prokaryotic genome sequencing projects. A PHP application, Prophage Finder, has been developed and implemented to predict prophage loci, based upon clusters of phage-related gene products encoded within DNA sequences. This application provides results detailing several facets of these clusters to facilitate rapid prediction and analysis of prophage sequences. Prophage Finder was tested using previously annotated prokaryotic genomic sequences with manually curated prophage loci as benchmarks. Additional analyses from Prophage Finder searches of several draft prokaryotic genome sequences are available through the Web site (http://bioinformatics.uwp.edu/~phage/DOEResults.php) to illustrate the potential of this application.

Keywords: prophage, phage, microbial, prokaryote, genome annotation, curation, lysogen



Introduction

The impact of phage lysogeny (or prophage occurrence) on prokaryotic diversity and bacterial pathogenesis has been well documented [Brüssow et al., 2004; Canchaya et al., 2003; Canchaya et al., 2004]. However, genetic diversity (including sequence divergence, gene loss, and gene rearrangements within prophage loci) coupled with a general lack of rules corresponding to alterations in GC content, codon usage or integration sites pose significant obstacles for prophage identification in newly determined prokaryotic genome sequences [Casjens, 2003]. Comparative approaches highlighting clusters of phage-related genes appear to hold the most promise for reliable prophage loci assignment. For instance, a comparative method using protein sequences belonging to the lambdoid group of phage was previously implemented with a high measure of success to predict similar prophage loci in prokaryotic genome sequences [Mehta et al., 2004]. Here, a web application is presented for prokaryotic genome sequence investigators to identify prophage loci in newly determined DNA sequences based upon clustering of phage-related gene sequences available from completed phage genome sequences available at National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/genomes/static/phg.html). This web application exhibits a marked improvement relative to previous automated methods of prophage loci identification and appears useful for predicting prophage loci in a wide range of prokaryotic taxonomic groups. The Prophage Finder application is available at: http://bioinformatics.uwp.edu/~phage/ProphageFinder.php.



Materials and methods

Prophage Finder is a PHP application offering various option parameters to facilitate prediction of prophage loci within FASTA-formatted text files containing DNA sequences ranging from 5 kb to 10 Mb. Prophage Finder initially uses BLASTX with a user-defined threshold value (E-value) to identify sequence matches within a database of predicted amino acid sequences derived from all sequenced phage genomes available at www.ncbi.nlm.nih.gov/genomes/static/phg.html [Altschul et al., 1997]. Sequences belonging to members of the Caudovirales, double stranded DNA phage that often exhibit a lysogenic phage cycle, comprise approximately 90% of this database, with the remaining sequences including representative archaeal double stranded DNA phage (Lipothrixviridae, Rudiviridae, and Fuselloviridae) and bacterial phage, which have not been characterized as lysogenic. Following BLASTX analysis, Prophage Finder uses a Perl program to parse the output, sorting predicted prophage loci based upon clusters of significant, independent sequence matches neighboring each other within a user-defined number of base pairs (Hit Spacing) and with a specified number of neighboring matches (Hits per Prophage). Hit spacing is a user-defined variable, between 3.5 kb and 6 kb, which defines the neighboring criterion for clustering representing the maximum length in nucleotides allowed between independent, significant sequence matches. Hit spacing values allow for additional sequences unrelated to known phage gene products (and not represented within the sequence database) between significant sequence matches, which obviates the need for adjacent phage-specific genes in prophage prediction. Hits per Prophage is a user-defined variable describing the minimal number of significant sequence matches, 5 to 10, which must satisfy the neighboring criteria to designate a putative prophage loci. The threshold and default values for the user-defined variables were determined empirically through the benchmark testing. Similar to previous text mining efforts used to identify prophage loci [Canchaya et al., 2004], a filter was generated from Prophage Finder benchmark testing that removes certain BLASTX hits (i.e. transposase, topoisomerase, etc.) that substantially increase the number of false positives.

Additional analyses are available to assist the user in the identification of prophage loci including tRNA prediction, GC content calculations, and codon usage frequency. Prophage loci are often, but not exclusively, found adjacent to tRNA genes due to a specific mode of integration [Canchaya et al., 2004]. In addition, fluctuations in overall GC content, codon position-specific GC content, and codon usage may be indicators of horizontal gene transfer events such as phage integration [Lawrence and Ochman, 1997; Ochman et al., 2005]. While none of these attributes are clear indicators of prophage loci, taken together such data may offer corroboration for prophage assignment. Users can select the option to have Prophage Finder use tRNAscan-SE to predict tRNA genes within their input DNA sequence and generate a text file output, while all GC content and codon frequency calculations are performed by default and included in output. These analyses are not used by Prophage Finder and do not affect predictions generated by this program. Instead, these analyses are available to provide the user with ancillary information that may be useful for the identification prophage loci.

Several text files are available as output for the user to examine and curate predicted prophage loci. The BLASTX output (BlastOut) generated by comparison of the input DNA sequence to the phage protein database and used to generate clusters of significant hits is available, together with the predicted prophage loci DNA sequences (ProphageFinderDNASeqs), predicted prophage loci gene sequences (ProphageFinderGeneSeqs), and predicted prophage loci protein sequences (ProphageFinderProtSeqs). Two summary files are also available; a brief summary of clustered significant hits with associated coordinates (ProphageFinderSummary) and a complete summary (ProphageFinderComplete) that contains all of the information available in the other individual files and ancillary information such as GC content calculations and codon frequency. Submitted jobs are placed in a queue on a SunFire V480 Server 2 with dual 900 MHz processors, and results are returned to user-entered email accounts. In addition, the Perl program for parsing BLASTX output can be downloaded from the website and installed locally to expedite analyses.



Results and discussion

For benchmark testing of Prophage Finder, prokaryotic genome sequences with existing well-defined, manual annotation data regarding prophage loci were analyzed and Prophage Finder results were compared to other automated and manual prophage prediction approaches [Casjens, 2003; Mehta et al., 2004]. Genome sequences were examined under varying parameter conditions to assess the effectiveness of Prophage Finder predictions using different settings and to derive default settings as a first approach method. Sample results reflecting 12 genome sequences with varying number and sizes of prophage sequences are presented in Tab. 1). As expected, strict parameters for Prophage Finder increase the specificity of the application resulting in only one false positive among the various genome sequences, a locus in Xylella fastidiosa 9a5c that was described as a "phage related region" in the manual annotation [Casjens, 2003]. Permissive parameters increase the sensitivity of the application resulting in only seven total false negatives in this dataset. Three of these false negative results correspond with prophage loci predicted to contain three genes or less by manual curation. These minimal prophage loci, together with gene duplication events and operons containing multiple genes with significant sequence similarity to phage proteins, present obvious challenges for use of clustering methodologies that minimize both false negatives and false positives.


Table 1: Benchmarks for Prophage Finder based upon manual annotations
Organism Manually annotateda Strictb Default Permissive Mehta et al.c
Positive False
Positive
False
Negative
Positive False
Positive
False
Negative
Positive False
Positive
False
Negative
Bacillus subtilis 168 5 3 0 2 4 7 1 4 15 1 2
Caulobacter crescentus 1 0 0 1 0 2 1 0 7 1 0
Escherichia coli K12 10 5 0 5 6 2 4 7 13 3 4
Escherichia coli O157:H7 EDL933 20 17 0 3 20 8 0 20 16 0 14
Lactococcus lactis subsp. lactis IL1403 6 6 0 0 6 5 0 6 12 0 ND
Mesorhizobium loti MAFF303099 3 1 0 2 3 6 0 3 16 0 1
Neisseria meningitidis Z2491 3 2 0 1 3 5 0 3 8 0 2
Salmonella enterica subsp. enterica serovar Typhi str. CT18 11 8 0 3 8 7 3 11 14 0 ND
Streptococcus pyogenes M1 GAS 4 3 0 1 4 1 0 4 6 0 2
Xylella fastidiosa 9a5c 9 4 1 5 7 4 2 7 8 2 2
Xyllela fastidiosa Temecula 1 8 3 0 5 8 1 0 8 9 0 ND
Total Identified 80 52 1 28 69 54 11 73 145 7 27
Percent Identified 100 65   35 86   14 91   9 49
a Based on manual annotation results from Casjens, 2003.
b Settings reflect the following values for Prophage Finder searches. Strict: E-value = 0.001, Hits per Phage = 10, Hit spacing = 3500; Default: E-value = 0.5, Hits per Phage = 5, Hit spacing = 5500; Permissive: E-value = 1.0, Hits per Phage = 4, Hit spacing = 6000.
c Summary of results from Mehta et al., 2004. "ND" indicates Not Determined.


Within the range of selectivity and sensitivity offered by Prophage Finder, suggested (default) settings were chosen following benchmark testing that specifically minimized false negatives relative to false positives. Although use of the default settings offers a good starting point, it is appropriate to perform Prophage Finder analyses using several distinct parameter settings. When compared to a previously described prophage prediction method, Prophage Finder identifies ~16% more of the manually annotated prophage loci under strict conditions, and identifies ~37% more loci under default conditions [Mehta et al., 2004]. In addition to the benchmark testing, several draft prokaryotic genome sequences available through the Department of Energy and Joint Genome Institute microbial genome sequencing efforts were analyzed using Prophage Finder under default settings to examine the utility of Prophage Finder on prokaryotic genome sequences representing a broad taxonomic context. Complete results from these analyses are available through the Web site and are currently undergoing manual annotation. Results from the benchmark testing and initial analysis of the draft genome sequences demonstrate that Prophage Finder is a valuable tool for aiding in the identification of prophage loci.

Although Prophage Finder appears to be an effective tool for locating prophage loci in wide-ranging bacterial taxonomic groups, this application currently appears limited for analysis of archaeal genome sequences. Despite inclusion of known archaeal phage in the database available at Prophage Finder, this application has not been successful in identifying new prophage loci in draft or completed archaeal genome sequences. Future work will concentrate on augmentation and refinement of the phage sequence database available at Prophage Finder to increase the sensitivity and specificity of this annotation tool and enhance the analysis of both bacterial and archaeal prophage sequences.




References