In Silico Biology 6, 0035 (2006); ©2006, Bioinformation Systems e.V.  


ISHAN: sequence homology analysis package


Pratip Shil, Niraj Dudani and Pandit B. Vidyasagar*




Biophysics Laboratory, Department of Physics
University of Pune, Pune- 411007, India



* Corresponding author
   Email: pbv@physics.unipune.ernet.in
   Phone: +91-20-25692678





Edited by H. Michael; received June 03, 2006; revised and accepted July 04, 2006; published July 11, 2006



Abstract

Sequence based homology studies play an important role in evolutionary tracing and classification of proteins. Various methods are available to analyze biological sequence information. However, with the advent of proteomics era, there is a growing demand for analysis of huge amount of biological sequence information, and it has become necessary to have programs that would provide speedy analysis. ISHAN has been developed as a homology analysis package, built on various sequence analysis tools viz FASTA, ALIGN, CLUSTALW, PHYLIP and CODONW (for DNA sequences). This JAVA application offers the user choice of analysis tools. For testing, ISHAN was applied to perform phylogenetic analysis for sets of Caspase 3 DNA sequences and NF-κB p105 amino acid sequences. By integrating several tools it has made analysis much faster and reduced manual intervention.

Keywords: homology analysis, pairwise alignment, multiple sequence alignment, phylogeny, JAVA, software



Introduction

With the advent of the genomic and proteomic era, the interest and demand for homology analysis and evolutionary tracing of biological sequences has greatly increased. It has immense potential to provide background support to researchers in biotechnology, drug designing, etc. Recently, cancer bioinformatics is also developing into a branch of research. This is because the genomic era has compelled researchers to investigate the molecular basis of disease, which requires detailed knowledge of DNA and amino acid sequences. This necessitates development of software tools that would provide efficient sequence analysis. Though various tools are available which provide for the pairwise and multiple alignments of biological sequences, many of these like FASTA, ALIGN, CLUSTALW require issuing of manual commands in a repetitive fashion, especially when analyzing a set of large number of sequences. For example, in a set of n sequences, the number of pairs is n(n-1) / 2. This results in issuing of commands of the order O(n2). Most commonly used tools do not offer any solution for this problem. Such problems were faced by the authors in earlier projects related to the study of molecular evolution of RUBISCO [1, 2]. Web based services for homology analysis provide for either pairwise aligmnent (PA), multiple sequence alignment (MSA) or phylogenetic analysis [3]. The present application provides a marked improvement by integrating various tools under one wing and hence reducing the analysis time and manual intervention.

ISHAN integrates various tools like FASTA, ALIGN, CLUSTALW, CODONW and PHYLIP toolkit for comprehensive analysis of biological sequences. ISHAN provides pairwise alignment of sequences based on dynamic programming methods like ALIGN. There is also an option for using FASTA global alignment. It also offers progressive method of multiple sequence alignment by utilizing CLUSTALW. ISHAN utilizes the PHYLIP toolkit for construction of phylogenetic trees. ISHAN has been used to perform phylogenetic analysis for sets of amino acid (AA) and DNA sequences for proteins related to apoptosis and cellular stress response. ISHAN application is available at: http://physics.unipune.ernet.in/~pbv/ishan.html. Users are advised to install JAVA Runtime environment version 1.4 or higher. A link has been provided on the given webpage for downloading the same.



Materials and methods


ISHAN has been implemented in Java (Java 2 SDK version 1.4.2) [4]. This allows rapid development and reuse of components by users. It provides a graphical user interface (GUI) which is easy to operate. From the user interface window, it is possible for the user to submit a query set of amino acid or nucleotide sequences, to select the tools for analysis and also to set the parameters for each chosen tool (Fig. 1). The sequence analysis tools viz. FASTA (version 3.4; ftp://ftp.virginia.edu/pub/fasta) [5, 6, 7], ALIGN (version 2.0; ftp://ftp.virginia.edu/pub/fasta) [8], CLUSTALW (version 1.83; http://www.cf.ac.uk/biosi/research/biosoft/downloads/clustalw.html) [9, 10] and CODONW (version 1.3; http://codonw.sourceforge.net) [11] and PHYLIP toolkit (version 3.63; http://evolution.genetics.washington.edu/phylip.html) [12, 13] were downloaded from the internet. Following are the hardware and software requirements for effective installation and running of ISHAN: IBM compatible PC with minimum 233 MHz processor, 32 MB RAM, 10 MB disk space, Microsoft Windows 98/ME/2000/NT/XP with supporting softwares - Java 2 Runtime Environment version 1.4.2 (or higher), spreadsheet software (e.g. MS Excel) for tabulation of scores and image viewing software (e.g. MS Paint).

The output from ISHAN includes files containing PA, MSA, phylogenetic tree, and codon usage (for DNA sequences only). It also generates two tables - one enlisting the sequences and the other enlisting the identities (%) among sequences (outputs from PA) (Tab. 1). This is very advantageous since most of the web based services do not provide such an output. ISHAN generates a report in HTML format which can be viewed in any web browser. This page displays the information about sequences, the Table (mentioned above), and all other outputs (mentioned previously).



Figure 1: Graphical User Interface of ISHAN- form for submission of query set. User can choose the analysis tools and set the parameters for each.


Table 1: Summary of pairwise alignment (PA) outputs of ISHAN in tabulated form for Caspase 3 DNA sequences.
  Rat Human Mouse Rabbit Hamster Chick Pig Zebra-
fish
Cat Dog Chimp Squirrel
monkey
Chinese
hamster
Puffer-
fish
Rat 100                          
Human 84.9 100                        
Mouse 90.5 83.6 100                      
Rabbit 84.8 87.1 84.5 100                    
Hamster 88.7 87.2 87.9 86.1 100                  
Chick 68.1 69.6 69 69 69.3 100                
Pig 83.1 89 83.3 85.7 84.7 70.1 100              
Zebrafish 63 60.7 59.6 60.9 61 61.7 60.7 100            
Cat 83.6 88.2 81.9 85.1 85.1 68.3 88.2 58.9 100          
Dog 84.1 89.4 82.5 85.5 84.9 69.4 88.8 60.2 89.9 100        
Chimp 85 99.5 83.5 87.2 87.1 69.5 88.6 60.8 87.9 89.1 100      
Squirrel monkey 85.3 95.9 83.6 86.8 86.5 69.1 88.5 61.1 88.2 89.3 95.6 100    
Chinese hamster 84.3 99.3 83 86.5 86.6 68.8 88.2 60.8 87.5 88.7 98.8 95.2 100  
Pufferfish 61.6 59.4 61.1 60.9 62 60.6 59.2 68.5 59.6 59.7 60.1 60.5 59.4 100
N.B: Numbers represent the Identity (%) among the sequences.




Results and discussion

To test the functionality of ISHAN, it has been utilized to carry out homology studies on various proteins related to apoptosis and considered important in cancer research viz. Bcl2, NF-κB p100, NF-κB p105, Caspase 1, Caspase 2, Caspase 3 and Caspase 6. In this paper, results for the following query sets have been discussed: Caspase 3 DNA sequences and NF-κB p105 AA sequences. Details of the results are available at our results webpage: http://physics.unipune.ernet.in/~pbv/ishan-results.ppt.

A total of 14 Caspase 3 DNA sequences across different sections of the animal kingdom were chosen as a query set subjected to analysis using ISHAN. DNA sequences were downloaded in FASTA format from the GenBank database (http://www.ncbi.nlm.nih.gov/Genbank/). Sequences from following species were considered: human (Homo sapiens), rat (Rattus norvegicus), mouse (Mus musculus), rabbit (Oryctolagus cuniculus), hamster (Cricetinae sp.), chick (Gallus gallus), pig (Sus scrofa), zebrafish (Danio rerio), cat (Felis catus), dog (Canis familiaris), chimpanzee (Pan troglodytes), Bolivian squirrel monkey (Saimiri boliviensis), Chinese hamster (Cricetulus griseus), pufferfish (Takifugu rubripes). All the analysis tools were run with default parameters. The human sequence appeared close to the chimpanzee, squirrel monkey and Chinese hamster (above 90% identity) (Tab. 1) as they appear in one cluster in the phylogenetic tree. The sequences from dog and cat form a separate branch; the sequences from rodent species viz rat, mouse and hamster are further away from the human sequence (>85% identity). The CODONW analysis reveals that the GC content of Caspase 3 DNA in human, chimpanzee and dog are significantly less (~39%) as compared to all other species (where GC content >42%) studied. For the fish species i. e. zebrafish and pufferfish, the GC content has been found to be very high (>48%). The codon usage analysis also reveals that for higher mammals, viz. human, chimpanzee and dog, this gene contains similar codons. Similarities at the codon level indicate similarity of the corresponding protein sequences.

Complete amino acid sequences for NF-κB p105 were downloaded in FASTA format from the UniProt database (http://www.pir2.uniprot.org/) and the NCBI database (http://www.ncbi.nlm.nih.gov/). The following species were selected for the study: human (Homo sapiens), rat (Rattus norvegicus), mouse (Mus musculus), dog (Canis familiaris), chimpanzee (Pan troglodytes). Outputs of MSA (CLUSTALW) indicate that NF-κB (p105 chain) amino acid sequences are fairly conserved (733 amino acids matching out of around 950 overlap). Results from the pairwise alignments (outputs of ISHAN using ALIGN tool) suggest that NF-κB p105 sequences from human are very close to that of chimpanzee as compared to the others. ISHAN has provisions for generating a similar table whenever a set of sequences are tested for PA. The phylogenetic tree shows the evolutionary relationship among the sequences please see our results webpage). It appears from the MSA (CLUSTALW) output that NF-κB p105 AA sequences are fairly conserved (733 amino acids matching out of around 950 overlap). Results from the pairwise alignments (outputs of ISHAN using ALIGN tool) suggest that NF-κB p105 sequences from human are very close to that of chimpanzee as compared to the others. The results of phylogenetic analysis using ISHAN, for Amino acid and DNA sequences corresponding to various proteins (related to apoptosis and considered important in cancer research) have been summarized in our results webpage (mentioned earlier).

ISHAN has successfully reduced manual intervention and working time for comprehensive sequence analysis. For example, for a set of 13 DNA sequences, complete analysis by using individual tools, compilation of results, tabulation of percentage identities would require approximately 5 working days (considering approx. 7 hours working per day and human efficiency); whereas using ISHAN, the compilation of results and tabulation of data take around 5 seconds (on a PC with Intel Pentium 4 processor).



Conclusion

ISHAN provides a user friendly, flexible platform for performing fast homology analysis and molecular phylogenetic studies on proteins and DNA sequences, by bringing together all the relevant tools under a single package. Since the framework facilitates speedy alignments and compilation of data, evolutionary tracing of proteins and genes can be carried out in a faster way using ISHAN.



Acknowledgements

Authors would like to thank Bhabha Atomic Research Centre-University of Pune Collaborative Research Program (BARC-PU CRP) for sponsoring the project. PBV and PS would like to thank AS-ICTP, Trieste, Italy for providing associateship scheme and library facilities.




References


  1. Vidyasagar, P. B., Shil, P. and Thomas, S. (2004). Conserved oligopeptides in the RUBISCO large chains: An Evolutionary perspective. In: Life in the Universe, Seckbach, J., Chela-Flores, J., Owen, T. and Raulin, F. (eds.), Kluwer Academic Publishers, Netherlands, pp. 133-134.

  2. Vidyasagar, P. B., Shil, P. and Thomas, S. (2005). Evolution of ribulose bisphosphate carboxylase / oxygenase (rubisco) large chains: in silico study. Physiol. Mol. Biol. Plants 11, 225-230.

  3. VBI Bioinformatics Web Services at http://ppdev.bioinformatics.vt.edu:6565?pathportWeb/genomeToolAction.do

  4. Holzner, S. (2000). Java 2, Swing, Servlets, JDBC & Java Beans Programming. The Coriolis Group, Arizona, USA.

  5. Lipman, D. J. and Pearson, W. R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-1441.

  6. Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444-2448.

  7. Pearson, W. R. (1990). Rapid and sensitive sequence comparison with FASTP and FASTA. Meth. Enzymol. 183, 63-98.

  8. Myers, E. and Miller, W. (1988). Optimal alignments in linear space. Comput. Appl. Biosci. 4, 11-17.

  9. Higgins, D. G. and Sharp, P. M. (1988). CLUSTAL: A package for performing multiple sequence alignment on a microcomputer. Gene 73, 237-244.

  10. Thompson, J. D., Higgins, D. G. and Gibson, T. J. (1994). CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight-matrix choice. Nucleic Acids Res. 22, 4673-4680.

  11. Peden, J. (2005). Correspondence Analysis of Codon Usage. (http://www.molbiol.ox.ac.uk/cu/culong.html#Codonw)

  12. Felsenstein, J. (2004). PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle.

  13. Felsenstein, J. (1989). PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5, 164-166.