| In Silico Biology 3, 0036 (2003); ©2003, Bioinformation Systems e.V. |
Bioinformatics Team, Scientific and Engineering Computing Group, Centre for Development of Advanced Computing
Pune University Campus, Ganeshkhind
Pune - 411007, India
phone: +91-20-5694084, fax: +91-20-5694081
email: rajendra@cdac.ernet.in
* corresponding author
Edited by H. Michael; received April 30, 2003; accepted June 04, 2003; published June 24, 2003
In the past decade there has been an increase in the number of completely sequenced genomes due to the race of multibillion-dollar genome-sequencing projects. The enormous biological sequence data thus flooding into the sequence databases necessitates the development of efficient tools for comparative genome sequence analysis. The information deduced by such analysis has various applications viz. structural and functional annotation of novel genes and proteins, finding gene order in the genome, gene fusion studies, constructing metabolic pathways etc. Such study also proves invaluable for pharmaceutical industries, such as in silico drug target identification and new drug discovery. There are various sequence analysis tools available for mining such useful information of which FASTA and Smith-Waterman algorithms are widely used. However, analyzing large datasets of genome sequences using the above codes seems to be impractical on uniprocessor machines. Hence there is a need for improving the performance of the above popular sequence analysis tools on parallel cluster computers. Performance of the Smith-Waterman (SSEARCH) and FASTA programs were studied on PARAM 10000, a parallel cluster of workstations designed and developed in-house. FASTA and SSEARCH programs, which are available from the University of Virginia, were ported on PARAM and were optimized. In this era of high performance computing, where the paradigm is shifting from conventional supercomputers to the cost-effective general-purpose cluster of workstations and PCs, this study finds extreme relevance. Good performance of sequence analysis tools on a cluster of workstations was demonstrated, which is important for accelerating identification of novel genes and drug targets by screening large databases.
Key words: comparative genomics, sequence analysis, Smith-Waterman, SSEARCH, FASTA, parallel computing, PARAM 10000, communication networks, fast ethernet, myrinet, PARAMNet, distributed computing, Disperse
The race of genome sequencing projects has led to a flood of data into the genome sequence databases like EMBL and GenBank, causing them to double in size almost every year [Stoesser et al., 2002; Benson et al., 2002]. Presently, 132 complete genome sequences are available and another 579 genome-sequencing projects are in progress [Bernal et al., 2001]. Mining the available genome sequence databases with analysis tools has a major role in comparative and functional genomics in that sequence similarities can be revealed and analyzed for structural and functional relationships between them. Comparative genomics, which involves the comparison of two complete genomes or sets of gene products of two different organisms, uses the similarity search tools to study the gene number, function and its order in the genome. Such study is also useful in detecting the unique genes present in the pathogenic organisms causing pathogenicity [Fraser et al., 2000] and these genes can be further studied as possible drug targets. In pharmacogenomic studies, the identification of single nucleotide polymorphs (SNPs), as well as various alternate splice sites using the sequence analysis tools, aid in understanding how an individual's genetic inheritance affects the body's response to drugs. However, mining the voluminous sequence databases to generate knowledge is a Herculean task.
There are various heuristic methods like FASTA [Pearson and Lipman, 1988] & BLAST [Altschul et al., 1990], and dynamic programming methods like Smith-Waterman [Smith and Waterman, 1981] to identify homologous sequences by finding similarities. Fast Alignment Search Tool (FASTA) is capable of providing the best local optimal alignments of genomic sequences using the combined heuristic and dynamic programming methods. The FASTA algorithm initially finds the regions shared by a pair of sequences having a high density of identities and subsequently re-scans the top ten best scoring regions using scoring matrices like PAM. A joining penalty can be given to connect the high scoring regions while an optimal alignment is obtained in the final step by using a dynamic programming method. Though the FASTA algorithm is comparatively slower than BLAST, the sensitivity in finding high-ranking scores among distantly related sequences is greater [Pearson, 1995]. FASTA, which has immense applications [Mao et al., 2000; Pearson, 2000], becomes more compute intensive especially for searching longer query sequences (more than 1 Mega bases) against large databases like the human genome sequence database, or the non-redundant (nr) database etc. A parallel implementation of FASTA [Deshpande et al., 1991], which was developed at the University of Virginia, can be used to accelerate huge database searches and also to perform all-against-all database searches [Sakharkar et al., 2002].
The Smith-Waterman (S-W) algorithm, which is a pair-wise sequence alignment method, implements dynamic programming to find the best local optimal alignment and implements length regression statistics to evaluate significant hits. It reports hits that can be missed by the heuristic methods like BLAST or FASTA and is hence used in various studies [Pearson, 1995; Rognes and Seeberg, 1998; Shpaer et al., 1996; Brutlag, 1998; Brenner et al., 1998]. The SSEARCH program, an implementation of S-W algorithm, is used in the KEGG (Kyoto Encyclopedia of Genes and Genomes) database for building ortholog group tables by finding the sequence similarities among genes of all pairs of organisms [Ogata et al., 1999]. The metabolic pathways in the KEGG database are reconstructed based on these ortholog group tables. The Smith-Waterman algorithm is also used in the protein-protein interaction studies at the sequence level for identifying Rosetta stone sequences through gene fusion methods [Enright et al., 1999; Taniguchi and Kanehisa, 1998] and for sequence based structure predictions by transitive homology [Bolten et al., 2001]. Another application of S-W algorithm is its usage in building clusters of proteins from Swissprot and TrEMBL databases for specific taxonomic groups [Apweiler et al., 2001]. Also, the use of S-W algorithm is found to be advantageous for the pair-wise comparison of complete genomes [Martins et al., 2001] when compared to the other tools. Though the S-W algorithm has wide applications, its usage is constrained because of its computationally demanding nature. It is extremely slow to perform on a uniprocessor machine, as a scoring matrix of MxN size ('M' and 'N' are lengths of two sequences) has to be built by assigning value to each cell of the matrix dynamically, penalizing gap openings and extensions. Performing all-against-all searches using the S-W algorithm becomes impractical, as it has to calculate the matrix for every pair of sequences in the database. The all-against-all searches of Swissprot protein database using the S-W algorithm was found to be in the order of 1400 CPU days [Bolten et al., 2001].
To speed up database searching with S-W, several approaches such as using specialized hardware [Hughey, 1996; Compugen, Decypher, GeneMatcher], use of heuristic searches of S-W algorithm like ParAlign [Rognes, 2001] & SALSA [Rognes and Seeberg, 1998] and the use of parallel processing methods [Deshpande et al., 1991; Martins et al., 2001; Brutlag et al., 1993; Sturrock and Collings, 1993] have been tried. Utilizing specialized hardware for the S-W searches involves high cost [Brutlag, 1998], depending on which heuristic methods of the S-W algorithm have been implemented which thereby compromises on sensitivity [Rognes and Seeberg, 1998; Rognes, 2001]. An alternative method to speeding up database searches using the S-W algorithm without compromising on sensitivity is to implement the algorithm using parallel processing methods. Parallel SSEARCH is one such implementation of the S-W algorithm available from the University of Virginia. Such parallel implementation can be effectively used for studying the huge genome databases, by tuning and optimizing the codes on parallel clusters. The parallel clusters gained importance over conventional supercomputers, as they are preferred when one looks at the price to performance ratio. Keeping in mind the present challenges of performing similarity searches on large genomic sequence databases, the parallel FASTA and parallel SSEARCH programs were ported onto the PARAM 10000, a cluster of workstations designed and developed in-house. This paper deals with the optimization efforts of the above codes & reports the efforts to accelerate comparative genomics using various datasets, different hardware networks and message passing libraries on PARAM. The performance of the above codes in a distributed computing environment is also reported.
System
The performance of FASTA and SSEARCH programs was studied on PARAM, which is a distributed shared memory computing system, designed and developed in-house. The PARAM is a cluster of Sun Ultra e450 workstations perceived as an Open-Frame architecture with Solaris 2.6 as an operating system. It consists of 32 Symmetric MultiProcessor (SMP) compute nodes with each node having 4 Sun UltraSparc-II processors and 1 GB (Gigabytes) - 2 GB of physical memory. All nodes are NFS (Network File System) mounted and are connected through Fast Ethernet [Apon and Baker, 2001] and 8 nodes of the above are connected through Myrinet [Boden et al., 1995] and PARAMNet (an in-house product). PARAMNet is built around a high speed and low latency switched hub with Sbus/PCI based Network Interface Cards that uses C-DAC's Communication Co-Processor (CCP). Myrinet (1.28 Giga bits per second (Gbps)) and PARAMNet (400 Mega bits per second (Mbps)) are run through Active Messages (AM) [von Eicken et al., 1992] interface so as to obtain low latency and high bandwidth as compared to Fast Ethernet (100 Mbps). The AM library and KSHIPRA (a lightweight protocol that is used for providing an interface for layering Message passing Interface over AM), of which both were used for tuning FASTA and SSEARCH codes on PARAM, are part of the HPCC software version 1.2.7 [Mohan Ram et al., 1998] developed in-house.
Tuning of FASTA & SSEARCH on PARAM
FASTA package (version 3.3) was downloaded from the University of Virginia (ftp://ftp.virginia.edu/). The source codes for parallel FASTA and parallel SSEARCH with MPI implementation were available in the Fasta package. Without any changes in the source code, the two parallel codes were tuned to the PARAM system using the MPICH implementation [Gropp et al., 1996] of standard portable message-passing interface model (MPI version 1.1, http://www-unix.mcs.anl.gov/mpi/) developed by Argonne National Laboratory, Sun-MPI (http://www.sun.com/servers/hpc/software/, HPC ClusterTools 3.1) and AM-MPI libraries (http://www.cdacindia.com/html/hpcc.asp). Sun-MPI libraries are the optimized, thread-safe message passing libraries of Sun HPC ClusterTools 3.1 for executing on Sun Solaris operating systems. It uses a Remote Shared Memory (RSM) protocol for supporting shared memory over a remote connection to obtain a low-latency and a high-bandwidth for communications among processors within a node and also between processors across nodes. MPICH and Sun-MPI libraries were used to run the codes on a Fast Ethernet network whereas AM-MPI libraries were used to run on Myrinet and PARAMNet. The codes were also optimized using various compiler optimization flags so as to obtain better performance on PARAM.
The sequential codes of FASTA and Smith-Waterman were also ported on to the PARAM system so as to execute the codes in a distributed computing mode. Distributed computing involves integrating multiple computers which are remote from each other, each having a role in a computation problem or information processing, and makes use of Client/Server communication models. Disperse [Clifford and Mackey, 2000] was the tool used for task-distribution of serial codes among various clients with the network connection between the server and the clients using TCP/IP sockets. It uses two simple perl scripts, one used by the server for distributing the serial code and the databases (query and library) across all the processors, and the other used by all the clients so as to execute the code. Such parallel execution of serial codes in distributed computing mode could be beneficial where database-to-database searches are to be made.
Performance of FASTA and Smith-Waterman algorithms on PARAM was studied by two approaches, one by tuning the codes using message-passing libraries and the other by using low-level parallelism with a distributed computing approach. The Parallel FASTA and SSEARCH programs tuned using MPI libraries, are data parallel implementations with the master-worker approach in which one processor acts as master and the other processors as workers. The master processor reads the query, the database, and the number of processors (n), and splits the database into "n-1" parts, which is distributed among "n-1" processors so as to search the database in parallel. After searching the database, the processors send the calculated scores to the master, which further sorts the scores and displays the alignments. The speedup was calculated by dividing the elapsed time for a run using two processors (only a single worker, as the master does not perform a database search) with the elapsed time required by the 'n' processors ('n-1' workers). The scalability, which is the ability to yield good performance with an increasing number of processors, was studied by calculating speedup values for the above algorithms. In the case of the distributed computing approach the server also performs the database search, and the speedup was calculated as the time taken by a single processor divided by the time taken by 'n' processors.
Performance of Parallel SSEARCH
The performance of parallel SSEARCH (MPI Version) on PARAM was studied by taking three major aspects into consideration: the scalability with respect to varying query sequence length, varying database sizes, and varying networks on a different number of processors.
(i) Performance with queries of different lengths
With the advent of genome sequencing projects, nucleotide sequences ranging from a few kilobases to many million bases are available in the sequence databases. Searching such long query sequences against the databases is a tedious task. To understand the effect of query sequence size on the performance of parallel SSEARCH, a set of four nucleotide sequences of influenza virus having different lengths (AF255379, 400 nucleotides (nt); AF258520, 800 nt; AF222810, 1650 nt; A49702, 4023 nt) were searched against a viral database of 105 Megabytes (MB) having 100639748 nucleotides in 115056 sequences. The query and database sequences were downloaded from the EBI site (http://www.ebi.ac.uk/) and the ftp site of NCBI (ftp://ftp.ncbi.nih.gov/). Figure 1 shows the plots of speedups obtained while searching the above four sequences using parallel SSEARCH implemented using Sun-MPI libraries on Fast Ethernet. The speedups obtained for the queries of lengths 400, 800, 1650 and 4023 nucleotides on 64 processors were 24.15, 33.64, 37.62, and 45.01 respectively (Tab. 1). This study clearly illustrates that the long genome sequences could be effectively searched using the parallel S-W algorithm on PARAM.
| Table 1. | Execution time (secs) for searching 4 different query sequences of length 400, 800, 1650 and 4023 nucleotides against a viral database (105 MB) and the speedups obtained using parallel SSEARCH. |
| Number of processors |
400 nucleotides |
800 nucleotides |
1650 nucleotides |
4023 nucleotides |
||||
| Time (sec) | Speedup | Time (sec) | Speedup | Time (sec) | Speedup | Time (sec) | Speedup | |
| 2 (1+1) | 3430 | 1.00 | 8040 | 1.00 | 18360 | 1.00 | 47220 | 1.00 |
| 4 (1+3) | 1213 | 2.83 | 2715 | 2.96 | 6240 | 2.94 | 16440 | 2.87 |
| 8 (1+7) | 604 | 5.68 | 1329 | 6.05 | 3000 | 6.12 | 7831 | 6.03 |
| 16 (1+15) | 315 | 10.89 | 680 | 11.82 | 1519 | 12.09 | 3723 | 12.68 |
| 32 (1+31) | 215 | 15.95 | 404 | 19.90 | 892 | 20.58 | 2112 | 22.36 |
| 64 (1+63) | 142 | 24.15 | 239 | 33.64 | 488 | 37.62 | 1049 | 45.01 |
(ii) Performance with databases of different sizes
The genome sequence databases are rapidly increasing in size and it is becoming difficult to search huge databases on a uniprocessor machine. Mining large sequence databases becomes a major challenge in sequence similarity searches. In order to understand the performance of parallel SSEARCH to search databases of various sizes, a 414-nucleotide length query (AX036307; E. coli) was searched against two different databases of sizes 345.55 MB having 343,135,518 residues in 58,203 sequences (Invertebrate database) and 1.02 GB database having 733,150,580 residues in 19,59,668 sequences (EST database of mouse). This study was performed using Sun-MPI libraries on Fast Ethernet on 64 processors of PARAM. Speedups of 33.76 and 36.37 were obtained for searching the 345.55 MB and 1.02 GB databases respectively using 64 processors of PARAM (Tab. 2). Performance was seen to increase across 2 to 64 processors on Fast Ethernet for all the two-database searches including the large database of 1.02 GB (Fig. 2). This performance study also shows that the optimized codes implemented on PARAM can effectively be used to search large databases.
| Table 2. | Execution time (secs) for searching a query of 414 nucleotides (E. coli, AX036307) against 2 different databases of sizes 345.55 MB (Invertebrate database) & 1.02 GB (Mouse EST database) and the speedups obtained using parallel SSEARCH |
| Number of processors | Invertebrate database | Mouse EST database | ||
| Time (sec) | Speedup | Time (sec) | Speedup | |
| 2 (1+1) | 11986 | 1.00 | 29390 | 1.00 |
| 4 (1+3) | 4282 | 2.80 | 10206 | 2.88 |
| 8 (1+7) | 1929 | 6.21 | 4728 | 6.22 |
| 16 (1+15) | 988 | 12.13 | 2441 | 12.04 |
| 32 (1+31) | 539 | 22.24 | 1172 | 25.08 |
| 64 (1+63) | 355 | 33.76 | 808 | 36.37 |
(iii) Performance with different networks
A parallel cluster computer has an inter-processor communication network as the important component, where the latency and the bandwidth are the two crucial factors of a network. Huge database searches can be performed faster on a cluster using networks of higher bandwidth and lower latency as they overcome the communication overheads. The Fast Ethernet, Myrinet and PARAMNet are the three networks that are used for communication across the nodes of PARAM 10000.
To study the effect of various inter-processor communication networks while searching sequence databases on a cluster of workstations, a query of 414 nucleotides was searched against the Drosophila database (124.33 MB having 122655632 residues in 1170 sequences) using 2 to 32 processors. The database search was repeated with the programs running on the three different networks viz. Fast Ethernet, Myrinet and PARAMNet (Fig. 3). A linear speedup of nearly 3 fold was observed while searching on four processors (3 workers) using all three networks whereas linearity in speedup was not observed as the number of processors increased to more than 4 (Tab. 3). This is due to the better inter-processor communication within a single node, which is a SMP of 4 processors. As the number of processors increases above 4, the messages need to pass through the switch leading to communication overheads. It was observed that when the code was run on PARAMNet (AM-MPI) and Myrinet (AM-MPI), which used active messages for inter-processor communication, the query was searched in a shorter time (20%) as compared to Fast Ethernet. With 32 processors, the run time on Fast Ethernet using Sun-MPI was 10% shorter than using MPICH, due to the use of the RSM protocol by Sun-MPI. The use of optimized libraries like Sun-MPI can thus increase the performance of the codes on the clusters.
| Table 3. | Execution time (secs) for searching a query sequence of 414 nucleotides (Escherichia coli, AX036307) against a Drosophila database (124.33 MB) with parallel SSEARCH and the speedups obtained on three different inter-processor communication networks namely Fast Ethernet (MPICH and Sun-MPI), Myrinet - AM and PARAMNet - AM. |
| Number of processors | Fast Ethernet - MPICH |
Fast Ethernet - Sun MPI |
Myrinet - AM-MPI |
PARAMNet - AM-MPI |
||||
| Time (sec) | Speedup | Time (sec) | Speedup | Time (sec) | Speedup | Time (sec) | Speedup | |
| 2 (1+1) | 4237 | 1.00 | 4200 | 1.00 | 4212 | 1.00 | 3980 | 1.00 |
| 4 (1+3) | 1514 | 2.80 | 1455 | 2.89 | 1442 | 2.92 | 1360 | 2.93 |
| 8 (1+7) | 774 | 5.47 | 742 | 5.66 | 737 | 5.72 | 694 | 5.73 |
| 16 (1+15) | 359 | 11.80 | 350 | 12.00 | 330 | 12.76 | 314 | 12.68 |
| 32 (1+31) | 244 | 17.36 | 219 | 19.18 | 192 | 21.93 | 198 | 20.10 |
Performance of Parallel FASTA
As mentioned earlier, the FASTA program executes very quickly when a small query or database is chosen, but becomes compute intensive when searching a query of longer length against huge databases. To study the performance of parallel FASTA for such searches, a query sequence of length 1,48,849 nucleotides (AL158837) was searched against a human genome sequence dataset of size 1.12 GB (1,150,338,878 residues in 1,29,967 sequences). The parallel FASTA ported using Sun-MPI libraries was used to run on Fast Ethernet across 2 to 64 processors and a speedup of 44 fold was observed on 64 processors (Fig. 4). While searching a longer query sequence against a huge database using parallel FASTA, better speedup was observed than with smaller query lengths. Thus, parallel FASTA can be more effectively used when long genome sequences of human chromosomes, that is, those having more than 10 mega bases, need to be searched against large genome sequence databases.
Performance of SSEARCH and FASTA using a task-distribution system
All-against-all database searches of genome sequences play a major role in comparative genomics, where the basic structure and function of various genes among diverse group of organisms can be studied. But such database searches on a uniprocessor machine are impractical as discussed earlier. Hence, we studied the performance of SSEARCH and FASTA, which was implemented on PARAM using the distributed computing approach for such searches. Performance of SSEARCH was studied by searching a dataset of 2125 human heart EST sequences against another dataset of 7998 human heart and brain EST sequences. A speedup of almost 117 fold was observed on 128 processors of all the compute nodes of the PARAM cluster (Fig. 5). The performance of FASTA was also studied by searching a dataset of 7998 EST sequences of human brain and heart against itself using 128 processors of PARAM. Good scaling was seen across 1 to 128 processors and a speedup of 89 fold was gained on 128 processors (Fig. 5). Better speedups were obtained using SSEARCH (117-fold) rather than the FASTA algorithm (89-fold) for database comparisons.
|
Figure 5: Speedup curves for all-against-all database comparisons using Smith-Waterman and FASTA on 128 processors (distributed computing approach). |
The above study indicates that good speedups can be seen when algorithms are more compute intensive. It also shows that the use of TCP/IP sockets with client/server communication models in a loosely coupled system architecture can be more beneficial when parallel clusters and parallel implementation of the codes are not available. Scientists without much knowledge of parallel implementations can make use of all the available machines connected to a Local Area Network (LAN) to perform huge database searches using sequence analysis codes.
Although computing power is presently doubling every 18 months in agreement with Moore's law [Moore, 1965], the growth of biological sequence databases happens at an even faster pace, doubling every 15 months [Benson et al., 2002]. The execution time for the sequence similarity algorithms like FASTA and Smith-Waterman is proportional to the database size. And since the size of sequence databases is growing faster than processor speed, it is essential to take a different approach for accelerating database searches without waiting for the single processor speed to double.
Keeping in mind the current challenges of mining large sequence databases, the FASTA and Smith-Waterman programs were ported and optimized on a parallel cluster computer, PARAM 10000, developed in-house. Both parallel algorithms revealed good speedups on PARAM, illustrating significant gains in performance. Further, the use of networks such as Myrinet and PARAMNet enhanced the execution speed by more than 20% on 32 processors. The above codes demonstrated a 10% enhancement in the execution speed when the optimized Sun-MPI libraries were used instead of the public domain MPICH, both running on Fast Ethernet. An increase in query sequence and database size also led to a better scalability of the code, indicating a positive correlation of data size to performance. The approach of using the sequential version of FASTA and S-W algorithms and distributing the tasks among processors was found to be highly effective in utilizing the entire range of processors. The S-W algorithm, which is more compute intensive than FASTA, showed higher speedups than FASTA. This shows that a number of processors when loosely coupled can be effectively used for all-against-all database searches with algorithms like S-W. All-against-all database searches using this method proved to be more beneficial on a large number of processors as compared to the conventional approach.
Mining huge databases using the techniques of comparative genome sequence analysis has wide applications viz. building ortholog groups of genes in the KEGG database for constructing metabolic pathways, gene fusion studies, identifying pathogenic genes or the virulence factors etc. However, the task of mining information from huge databases has become highly laborious requiring new strategies. One approach is to use heuristic algorithms, but such algorithms are less sensitive compared to the explicit dynamic programming methods. Using specialized hardware could be another solution, but in many cases it is prohibitive due to the high cost of such hardware. A viable alternative may be to employ parallel computers for performing such sequence database searches. This approach seems to be more pragmatic considering the paradigm shift from conventional supercomputers to cost-effective clusters of workstations and PCs, making high performance computing more accessible to the scientific community. From this perspective, use of parallel clusters has significance in searching databases with accurate methods at a faster pace in order to derive useful information. The advent of genome-based drug discovery led to the possibility of drug target identification in silico. Screening large biological sequence databases to determine the function of a novel protein, which might act as a potential drug target, is one of the major challenges in the post-genomic era. Studying the 3-dimensional structure of a possible drug target protein using homology-modeling methods also involves sequence database searches to find a protein sequence with the known structure and having nearly 20% similarity to the possible target. Sequence similarity searches on large databases using parallel computers can play a major role in such intensive studies, wherein drug targets can be identified at a faster pace. The performance obtained on PARAM 10000 can have immense importance in such applications and is also useful for finding clusters of orthologous genes, detecting SNPs, alternate splice forms, regulatory elements etc. Such study can also have significant importance in comparative genomics where the whole genomes have to be aligned.