Efficient Data Access for Parallel BLAST

Authors:
Heshan Lin;Xiaosong Ma;Praveen Chandramohan;Al Geist;Nagiza Samatova
Affiliations:
North Carolina State University;North Carolina State University;Oak Ridge National Laboratory;Oak Ridge National Laboratory;Oak Ridge National Laboratory
Venue:
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Year:
2005

Citing 12
Cited 17

On implementing MPI-IO portably and with high performance

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Parallel I/O for high performance computing

Parallel I/O for high performance computing
Parallelization of local BLAST service on workstation clusters

Future Generation Computer Systems
Database Allocation Strategies for Parallel BLAST Evaluation on Clusters

Distributed and Parallel Databases
Scalable Algorithms for Association Mining

IEEE Transactions on Knowledge and Data Engineering
TurboBLAST(r): A Parallel Implementation of BLAST Built on the TurboHub

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Three Improvements to the BLASTP Search of Genome Databases

SSDBM '00 Proceedings of the 12th International Conference on Scientific and Statistical Database Management
A Study of a Multi-Ring Buffer Management for BLAST

DEXA '03 Proceedings of the 14th International Workshop on Database and Expert Systems Applications
Scalability in the XFS file system

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Hyper-BLAST: a parallelized BLAST on cluster system

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII

Polite parallel computing: student paper

Journal of Computing Sciences in Colleges
ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

IEEE Transactions on Parallel and Distributed Systems
Parallel genomic sequence-searching on an ad-hoc grid: experiences, lessons learned, and implications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
MPI framework for parallel searching in large biological databases

Journal of Parallel and Distributed Computing
An Efficient Parallel Implementation of the Hidden Markov Methods for Genomic Sequence-Search on a Massively Parallel System

IEEE Transactions on Parallel and Distributed Systems
Noncontiguous locking techniques for parallel file systems

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A framework for scheduling parallel dbms user-defined programs on an attached high-performance computer

Proceedings of the 5th conference on Computing frontiers
Mercury BLASTP: Accelerating Protein Sequence Alignment

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Massively parallel genomic sequence search on the Blue Gene/P architecture

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Adaptive Request Scheduling for Parallel Scientific Web Services

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Software note: Construction and characterization of a rock-cluster-based EST analysis pipeline

Computational Biology and Chemistry
Bioportal: a portal for deployment of bioinformatics applications on cluster and grid environments

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Parallel genome sequence searching on SupercomputerBlueGene/P

ECS'10/ECCTD'10/ECCOM'10/ECCS'10 Proceedings of the European conference of systems, and European conference of circuits technology and devices, and European conference of communications, and European conference on Computer science
Parallel performance evaluation of sequence nucleotide alignment on the supercomputer BlueGene/P

ECC'11 Proceedings of the 5th European conference on European computing conference
MP-PIPE: a massively parallel protein-protein interaction prediction engine

Proceedings of the international conference on Supercomputing
Investigation into scaling I/O bound streaming applications productively with an all-FPGA cluster

Parallel Computing
High performance computing workflow for protein functional annotation

Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Searching biological sequence databases is one of the most routine tasks in computational biology. This task is significantly hampered by the exponential growth in sequence database sizes. Recent advances in parallelization of biological sequence search applications have enabled bioinformatics researchers to utilize high-performance computing platforms and, as a result, greatly reduce the execution time of their sequence database searches. However, existing parallel sequence search tools have been focusing mostly on parallelizing the sequence alignment engine. While the computation-intensive alignment tasks become cheaper with larger machines, data-intensive initial preparation and result merging tasks become more expensive. Inefficient handling of input and output data can easily create performance bottlenecks even on supercomputers. It also causes a considerable data management overhead. In this paper, we present a set of techniques for efficient and flexible data handling in parallel sequence search applications. We demonstrate our optimizations through improving mpiBLAST, an open-source parallel BLAST tool rapidly gaining popularity. These optimization techniques aim at enabling flexible database partitioning, reducing I/O by caching small auxiliary files and results, enabling parallel I/O on shared files, and performing scalable result processing protocols. As a result, we reduce mpiBLAST users' operational overhead by removing the requirement of prepartitioning databases. Meanwhile, our experiments show that these techniques can bring by an order of magnitude improvement to both the overall performance and scalability of mpiBLAST.