Rapid parallel genome indexing with MapReduce

Authors:
Rohith K. Menon;Goutham P. Bhat;Michael C. Schatz
Affiliations:
Stony Brook University, Stony Brook, NY, USA;Stony Brook University, Stony Brook, NY, USA;Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
Venue:
Proceedings of the second international workshop on MapReduce and its applications
Year:
2011

Citing 12
Cited 5

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
The Performance of Linear Time Suffix Sorting Algorithms

DCC '05 Proceedings of the Data Compression Conference
Linear work suffix array construction

Journal of the ACM (JACM)
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
TopHat

Bioinformatics
CloudBurst

Bioinformatics
Fast, easy, and cheap: construction of statistical machine translation models with MapReduce

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
PLANET: massively parallel learning of tree ensembles with MapReduce

Proceedings of the VLDB Endowment
Fast and accurate long-read alignment with Burrows–Wheeler transform

Bioinformatics
Design patterns for efficient graph algorithms in MapReduce

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
HiTEC

Bioinformatics

Riding the elephant: managing ensembles with hadoop

Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Parallel rough set based knowledge acquisition using MapReduce from big data

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
DISRAY: A distributed ray tracing by map-reduce

Computers & Geosciences
Cloud MapReduce for particle filter-based data assimilation for wildfire spread simulation

Proceedings of the High Performance Computing Symposium
A comparison of parallel large-scale knowledge acquisition using rough set theory on different MapReduce runtime systems

International Journal of Approximate Reasoning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sequence alignment is one of the most important applications in computational biology, and is used for such diverse tasks as identifying homologous proteins, analyzing gene expression, mapping variations between individuals, or assembling de novo the genome of organism. Except for trivial cases involving just a small number of short sequences, virtually all other sequence alignment tasks rely on a precomputed index of the sequence to accelerate the alignment. Two of the most important index structures are the suffix array, which consists of the lexicographically sorted list of suffixes of a genome, and the closely related Burrows-Wheeler Transform (BWT), which is a permutation of the genome based on the suffix array. Constructing these structures on large sequences, such as the human genome, requires several hours of serial computation and must be performed for each genome, or genome assembly, to be analyzed. Here we present a novel parallel algorithm for constructing the suffix array and the BWT of a sequence leveraging the unique features of the MapReduce parallel programming model. We demonstrate the performance of the algorithm by greatly accelerating suffix array and BWT construction on five significant genomes using as many as 120 cores leased from the Amazon Elastic Compute Cloud (EC2), reducing the end-to-end runtime from hours to mere minutes. The source code is available under an open source GPL License at: http://code.google.com/p/genome-indexing/