SDAFT: a novel scalable data access framework for parallel BLAST

Authors:
Jiangling Yin;Junyao Zhang;Jun Wang;Wu-chun Feng
Affiliations:
University of Central Florida, Orlando, Florida;University of Central Florida, Orlando, Florida;University of Central Florida, Orlando, Florida;Virginia Tech, Blacksburg, VA
Venue:
DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
Year:
2013

Citing 8
Cited 0

An efficient parallel approach for identifying protein families in large-scale metagenomic data sets

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
AzureBlast: a case study of developing science applications on the cloud

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Coordinating Computation and I/O in Massively Parallel Sequence Search

IEEE Transactions on Parallel and Distributed Systems
VisIO: Enabling Interactive Visualization of Ultra-Scale, Time Series Data via High-Bandwidth Distributed I/O Systems

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Combining in-situ and in-transit processing to enable extreme-scale scientific analysis

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Design and analysis of data management in scalable parallel scripting

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Managing data-movement for effective shared-memory parallelization of out-of-core sparse solvers

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

To run search tasks in a parallel and load-balanced fashion, existing parallel BLAST schemes such as mpiBLAST introduce a data initialization preparation stage to move database fragments from the shared storage to local cluster nodes. Unfortunately, a quickly growing sequence database becomes too heavy to move in the network in today's big data era. In this paper, we develop a Scalable Data Access Framework (SDAFT) to solve the problem. It employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. SDAFT consists of two inter-locked components: 1) a data centric load-balanced scheduler (DC-scheduler) to enforce data-process locality and 2) a translation layer to translate conventional parallel I/O operations into HDFS I/O. By experimenting our SDAFT prototype system with real-world database and queries at a wide variety of computing platforms, we found that SDAFT can reduce I/O cost by a factor of 4 to 10 and double the overall execution performance as compared with existing schemes.