ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

Authors:
Christopher Oehmen;Jarek Nieplocha
Affiliations:
IEEE;IEEE Computer Society
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2006

Citing 12
Cited 19

Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
Parallelization of local BLAST service on workstation clusters

Future Generation Computer Systems
Database Allocation Strategies for Parallel BLAST Evaluation on Clusters

Distributed and Parallel Databases
TurboBLAST(r): A Parallel Implementation of BLAST Built on the TurboHub

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Multiprotocol Communication Support for the Global Address Space Programming Model on the IBM SP

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Identifying Candidate Disease Genes with High-Performance Computing

The Journal of Supercomputing
Bio-sequence analysis with cradle's 3SoC™ software scalable system on chip

Proceedings of the 2004 ACM symposium on Applied computing
Piers: an efficient model for similarity search in DNA sequence databases

ACM SIGMOD Record
Scientific Computations on Modern Parallel Vector Systems

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Efficient Data Access for Parallel BLAST

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Exploiting processor groups to extend scalability of the GA shared memory programming model

Proceedings of the 2nd conference on Computing frontiers
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

International Journal of High Performance Computing Applications

Parallel genomic sequence-search on a massively parallel system

Proceedings of the 4th international conference on Computing frontiers
An Efficient Parallel Implementation of the Hidden Markov Methods for Genomic Sequence-Search on a Massively Parallel System

IEEE Transactions on Parallel and Distributed Systems
Massively parallel genomic sequence search on the Blue Gene/P architecture

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Brief Communication: A feature vector integration approach for a generalized support vector machine pairwise homology algorithm

Computational Biology and Chemistry
Accelerating BLASTP on the Cell Broadband Engine

PRIB '08 Proceedings of the Third IAPR International Conference on Pattern Recognition in Bioinformatics
BLAST Distributed Execution on Partitioned Databases with Primary Fragments

High Performance Computing for Computational Science - VECPAR 2008
High performance protein sequence database scanning on the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
HSP-HMMER: a tool for protein domain identification on a large scale

Proceedings of the 2009 ACM symposium on Applied Computing
Node-capability-aware replica management for peer-to-peer grids

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
An adaptive multi-policy grid service for biological sequence comparison

Journal of Parallel and Distributed Computing
An organic model for detecting cyber-events

Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research
Parallel genome sequence searching on SupercomputerBlueGene/P

ECS'10/ECCTD'10/ECCOM'10/ECCS'10 Proceedings of the European conference of systems, and European conference of circuits technology and devices, and European conference of communications, and European conference on Computer science
CUDA-BLASTP: Accelerating BLASTP on CUDA-Enabled Graphics Hardware

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
A parallel architecture for DNA matching

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Cross-Ontological analytics: combining associative and hierarchical relations in the gene ontologies to assess gene product similarity

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Investigation into scaling I/O bound streaming applications productively with an all-FPGA cluster

Parallel Computing
Domain-specific languages for composing signature discovery workflows

Proceedings of the 2012 workshop on Domain-specific modeling
Evolutionary drift models for moving target defense

Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop
Improving Performance on Data-Intensive Applications Using a Load Balancing Methodology Based on Divisible Load Theory

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Genes in an organism's DNA (genome) have embedded in them information about proteins, which are the molecules that do most of a cell's work. A typical bacterial genome contains on the order of 5,000 genes. Mammalian genomes can contain tens of thousands of genes. For each genome sequenced, the challenge is to identify protein components (proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused on unlocking protein information embedded in the genetic code, making it possible to assemble a "tree of life” by comparing new sequences against all sequences from known organisms. But, the memory footprint of sequence data is growing more rapidly than per-node core memory. Despite years of research and development, high-performance sequence alignment applications either do not scale well, cannot accommodate very large databases in core, or require special hardware. We have developed a high-performance sequence alignment application, ScalaBLAST, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability. ScalaBLAST relies on a collection of techniques—distributing the target database over available memory, multilevel parallelism to exploit concurrency, parallel I/O, and latency hiding through data prefetching—to achieve high-performance and scalability. This demonstrated approach of database sharing combined with effective task scheduling should have broad ranging applications to other informatics-driven sciences.