Microparallelism and high-performance protein matching
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
A Systolic Array for the Sequence Alignment Problem
A Systolic Array for the Sequence Alignment Problem
Hyper customized processors for bio-sequence database scanning on FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Families of FPGA-based accelerators for approximate string matching
Microprocessors & Microsystems
FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Streamware: programming general-purpose multicore processors using streams
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Applying SIMD approach to whole genome comparison on commodity hardware
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
BLAS Comparison on FPGA, CPU and GPU
ISVLSI '10 Proceedings of the 2010 IEEE Annual Symposium on VLSI
A highly parameterized and efficient FPGA-based skeleton for pairwise biological sequence alignment
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Hi-index | 0.00 |
With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well-known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S-W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field-programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 × speedups over naive CPU versions on quad-core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 × speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 × speedups, instead of 100 × reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S-W algorithm in a 45-nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power-efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. Copyright © 2011 John Wiley & Sons, Ltd.