| In Silico Biology 7, 0020 (2007); ©2007, Bioinformation Systems e.V. |
1 Citrus Research Board, 323 W. Oak, P.O. Box 230, Visalia, CA 93279, USA
2 USDA-ARS, 9611 So. Riverbend Ave. Parlier, CA 93648, USA
3 University of California, Davis, Department of Viticulture and Enology, Davis, CA 95616, USA
* Corresponding author
Email: eciverolo@fresno.ars.usda.gov;
Phone: +1-559-596 2702
Edited by E. Wingender; received December 18, 2006; revised February 20, 2007; accepted February 24, 2007; published February 28, 2007
The increasing number of whole genomic sequences of microorganisms has led to the complexity of genome-wide annotation and gene sequence comparison among multiple microorganisms. To address this problem, we have developed nWayComp software that compares DNA and protein sequences of phylogenetically-related microorganisms. This package integrates a series of bioinformatics tools such as BLAST, ClustalW, ALIGN, PHYLIP and PRIMER3 for sequence comparison. It searches for homologous sequences among multiple organisms and identifies genes that are unique to a particular organism. The homologous gene sets are then ranked in the ascending order of the sequence similarity. For each set of homologous sequences, a table of sequence identity among homologous genes along with sequence variations such as SNPs and INDELS is developed, and a phylogenetic tree is constructed. In addition, a common set of primers that can amplify all the homologous sequences are generated. The nWayComp package provides users with a quick and convenient tool to compare genomic sequences among multiple organisms at the whole-genome level.
Keywords: sequence comparison, homology, unique, SNP, INDEL, Phylip, primer
Identification of homologous or unique genes among multiple strains of a species or phylogenetically-related species is frequently needed by biologists [1]. Such unique gene sequences are often used as templates for designing primers for strain/species detection by PCR [2]. Similarly, variations in homologous DNA sequences, such as SNPs (single nucleotide polymorphism) and INDELS (insertion and deletion), can serve as reliable biomarkers for microbial strain differentiation [3, 4]. In addition, these biomarkers provide relationships of biological differences among species and strains, such as host specificity and pathogenicity [4]. However, none of the currently available bioinformatics tools are able to automatically identify homologous and unique genes among multiple species and/or strains to facilitate whole-genome comparisons. Thus, there is a need to develop comparative genomics tools for genome-wide applications as the number of whole microbial genome sequences is increasing rapidly. To date, the genomes of 286 bacterial species have been completely sequenced and the number continues to increase. This includes multiple strain sequence information for 81 bacterial species. To address this problem, we developed a stand-alone software, nWay-comparison tool that can compare genomic sequences among multiple organisms.
nWayComp automates the task of sequence comparison by integrating various bioinformatics tools such as BLASTALL [5] for identifying homologous sequences, ALIGN [6] for calculating sequence identities, CLUSTALW [7] for making multiple sequence alignments, PRIMER3 [8] for designing primers and PHYLIP [9] tool kit for constructing phylogenetic trees.
We applied this package to compare the annotated genes and proteins of the six Xanthomonas strains in order to identify homologous and unique genes. The software is written in Perl, access to technical support is provided, and is freely downloadable at http://fresno.ars.usda.gov/citrusdisease/downloads/nWayComp.htm. In addition, an accompanying program 'translator' is also freely downloadable, which translates DNA or RNA sequences into amino acid sequences in a batch mode. Upgrading of this software in the future will be made depending on the users' suggestions and requests.
nWayComp requires the user to save the individual sequence file in FASTA format in a folder called 'Genomebase' . The command line has three basic options: -p (sequence type: n-nucleotide p-amino acid); -e (E-value default value is 1e-10 for nucleotide sequence and 1e-15 for proteins); -a (alignment option: f-without alignment, default; t-with alignment). Other optional parameters for primer design and the Phylip phylogeny tree construction purpose are defined at the website: http://fresno.ars.usda.gov/citrusdisease/downloads/nWayComp.htm.
For ease of programming, nWayComp renames each sequence file with a number starting from '001' according to the alphabetic order of the file names. Next, nWayComp generates all the possible comparison combinations for the provided input files. For this purpose, it randomly splits the input files into two groups, A and B. A combination among input sequence files is expressed as file name(s) in group A with a "+" sign preceding each file name followed by file names in group B with a "−"sign preceding each file name. The first file name in a combination always has a "+" sign preceding it and thus omitted. For example, "001+002+003+004" is a combination for which nWayComp searches for homologous genes in all the four files; "001-002-003-004" is another combination for which nWayComp searches for genes of 001 that have no homologous genes in 002, 003 and 004 (i.e. unique genes of 001). In this paper, group A is not permitted to be empty. Therefore, there are 2n−1 combinations. After all the combinations among the input files have been determined, nWayComp searches for homologous genes among all the sequence file(s) of group A (whose names have a preceding "+" sign) and filters out homologous genes which show homology to any genes in the sequence files whose names have a preceding "−" sign.
The output file (Fig. 1A) of nWayComp shows each sequence file combination and the corresponding number of homologous gene sets or unique genes. A sequence file combination that has no value (i. e. no homologous gene sets or unique genes are identified) is not listed in the output file.
For each sequence file combination, a HTML page (Fig. 1B) showing all sets of homologous genes is generated. All these homologous gene sets are sorted in the ascending order of sequence similarity, which is calculated as the standard deviation of sequence identities among homologous genes.
For each set of homologous genes, the following files are created: an identity table showing the percentage of sequence identity among homologous genes (Fig. 1C), a HTML page showing the locations of SNPs and INDELS on the DNA sequences (Fig. 1D), a phylogenetic tree (Fig. 1E), common primers that are sorted in the descending order of the amplicon size (Fig. 1F), gene sequences in FASTA format (Fig. 1G) and a CLUSTALW sequence alignment file (Fig. 1F). However, if the sequence files contain amino acid sequences, then SNPs, INDELs and common primer design files are not generated. We classified SNPs into two types, unique and non-unique. A unique SNP is a location in the alignment file where there is one unique nucleotide, while a non-unique SNP means that the nucleotide at that position is shared by more than one input sequence.
nWayComp software ranks all the homologous gene sets according to gene similarity. The ranking of homologous genes aids users in the identification of highly conserved genes among multiple organisms. The highly conserved genes are ranked at the top of this ranking list. The identification of highly conserved genes has practical applications. It provides more reliable detection loci for a species or clade differentiation.
Currently, annotated genomes of six strains of four Xanthomonas species are available in GenBank. We used the nWayComp program to automate the comparative genome analysis of these strains. The description of these strain/species is at: http://fresno.ars.usda.gov/citrus%2Ddisease/downloads/nwaycomp/Xanthomonas.htm.
nWayComp was tested with both the DNA sequences and amino acid sequences of the six Xanthomonas strains. In the first step, the DNA sequences of all of the genes in the six strains were compared. The details of these comparisons are shown at: http://fresno.ars.usda.gov/citrus%2Ddisease/downloads/nwaycomp/DNAresult.htm. Next, we compared the amino acid sequences deduced from these DNA sequences. The results are available at: http://fresno.ars.usda.gov/citrus%2Ddisease/downloads/nwaycomp/AAresult.htm. There are 26 − 1 = 63 combinations for the above comparison set. Our analysis shows that there were 2,754 genes common to all the strains. Here we only present the results of seven combinations with one combination (001+002+003+004+005+006) showing homologous gene sets among all the six isolates and six other types showing unique genes of each strain (Tab. 1). The results indicate that 95.8% of homologous gene sets among the six strains that were identified using DNA sequences were also included in the homologous gene sets using amino acid sequences. The small discrepancy (4.2%) between the two estimates could be due to the individual sequences that are at the border of the defined BLAST cut-off value (1e-05 for DNA sequence and 1e-10 for amino acid sequence).
| Table 1: Number of unique and homologous genes in six Xanthomonas strains. |
| Xanthomonas strain | Number of genes |
| X. axonopodis pv. citri str. 306 (unique genes) | 388 |
| X. campestris pv. campestris str. ATCC 33913 (unique genes) | 24 |
| X. campestris pv. vesicatoria str. 85-10 (unique genes) | 440 |
| X. campestris pv. campestris str. 8004 (unique genes) | 99 |
| X. oryzae pv. oryzae KACC10331 (unique genes) | 25 |
| X. oryzae pv. oryzae MAFF 311018 (unique genes) | 117 |
| All strain homologs | 2754 |
| The number of homologous genes in the six Xanthomonas strains and unique genes of each strain that are identified by nWayComp Program using DNA sequences (E-value=1e-05). |
In addition to the homologous genes, nWayComp finds that there are 388, 24, 440, 99, 25, and 111 genes which are unique to X. axonopodis pv. citri str. 306, X. campestris pv. campestris str. ATCC 33913, X. campestris pv. campestris str. 8004, X. campestris pv. vesicatoria str. 85-10, X. oryzae pv. oryzae KACC10331 and X. oryzae pv. oryzae MAFF 311018, respectively (Tab. 1). These unique genes could potentially be used as templates for strain/species detection. Currently, the 16S rDNA gene is frequently used as a biomarker for species and strain identification [10]. However, there are many false positive results with PCR primers designed based on 16S rDNA [4]. One reason for this is that the sequences of this locus are conserved among many phylogenetically-related and phylogenetically-unrelated species, thus strain detection and identification by PCR using primers based on 16S rDNA is not reliable due to many false positive results. On the other hand, comparison of DNA sequences of the six Xanthomonas strains in these species indicated that there were many unique genes for each strain/species. The information further aids in designing primers for loci that are unique to specific Xanthomonas strains. Therefore, the overall quality of microbial detection and identification can potentially be significantly improved by reducing false positives.
In total 112,927 unique SNPs, 447,866 non-unique SNPs and 20,704 nucleotides within INDELS were identified for the 2,639 homologous genes sets. Our graphical view of the SNP and INDEL locations could assist users in designing TaqMan real-time PCR primers for SNP/INDEL detection (Fig. 1D).
Based on the MIPS functional category system [11], we classified the common and unique genes of the six Xanthomonas strains into different functional categories. The result is available at: http://fresno.ars.usda.gov/citrus%2Ddisease/downloads/nwaycomp/category.xls. The six strains of four Xanthomonas species have homologous genes mainly in the categories of metabolism (25.5%), cellular transport (7.1%), energy (6.0%), protein fate (5.2%), cell cycle and DNA processing (4.4%). Genes in these categories contribute primarily to cellular housekeeping activities. In addition, an average of 87% unique genes of these Xanthomonas strains are unclassified (hypothetical proteins). However, some genes that are potentially associated with the pathogenicity of these strains occur in the remaining 13% of classified genes. For example, X. axonopodis pv. citri str. 306 has four pili genes (XAC2664, XAC2666, XAC2668 and XAC2669) that form a cluster. X. campestris pv. campestris str. 8004 has a unique sensor histidine kinase (XC_1382), which mediates adaptive responses to changes in environmental conditions [12]. X. campestris pv. vesicatoria str. 85-10 has nine unique and transcriptional regulator genes (XCV2164, XCV2182, XCV2446, XCV2482, XCV0138, XCV4432, XCV2337, XCV2314 and XCV2335). These results provide the basis for identifying or differentiating unique strain biological features (e. g., host specificity, pathogenicity and virulence). In addition, there were many other unique genes which had different functional categories that would provide data of interest to many other researchers.
The common primer design function of nWayComp at the whole-genome level is a feature that has not been reported in other software. These primers can be used for microbial strain detection and identification, as well as for studying gene function using spotted microarray technology. The common primer feature avoids redundant primer design for homologous genes. The primer pairs are sorted in the descending order of the amplicon size for users' convenience.
Although nWayComp has no limitation on the size of input files, the speed of the program is significantly reduced by the input file size. Large file size inevitably leads to long time in sequence alignment process, which is done by ClustalW in this program. Future improvement in the running speed of the program depends on the availability of faster multiple sequence alignment tools other than ClustalW.
The nWayComp package is an easy and convenient tool for comprehensive comparative analyses of genomic sequences among multiple strains or among multiple phylogeny-related organisms. It automates a variety of analytical procedures for whole-genome wide gene comparison by bringing together various bioinformatics tools into one package. It also provides genome-wide primer design for spotted microarray analyses.
This project is supported by California Citrus Research Board (CRB project No. 5300-05F).