Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure

Authors:
Pradeep Kumar Mantha;Nayong Kim;Andre Luckow;Joohyun Kim;Shantenu Jha
Affiliations:
Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA;Rutgers University, Piscataway, NJ, USA
Venue:
Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
Year:
2012

Citing 12
Cited 0

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Editorial

Bioinformatics
CloudBurst

Bioinformatics
The Sequence Alignment/Map format and SAMtools

Bioinformatics
Understanding application-level interoperability: Scaling-out MapReduce over high-performance grids and clouds

Future Generation Computer Systems
A hierarchical framework for cross-domain MapReduce execution

Proceedings of the second international workshop on Emerging computational methods for the life sciences
Characterizing deep sequencing analytics using BFAST: towards a scalable distributed architecture for next-generation sequencing data

Proceedings of the second international workshop on Emerging computational methods for the life sciences
Exploring MapReduce efficiency with highly-distributed data

Proceedings of the second international workshop on MapReduce and its applications
MapReducing a genomic sequencing workflow

Proceedings of the second international workshop on MapReduce and its applications
SEAL

Bioinformatics
Building gateways for life-science applications using the dynamic application runtime environment (DARE) framework

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data

Proceedings of third international workshop on MapReduce and its Applications Date

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although localization of Next-Generation Sequencing (NGS) data is suitable for many analysis and usage scenarios, it is not universally desirable, nor possible. However most solutions "impose" the localization of data as a pre-condition for NGS analytics. We analyze several existing tools and techniques that use MapReduce programming model for NGS data analysis to determine their effectiveness and extensibility to support distributed data scenarios. We find limitations at multiple levels. To overcome these limitations, we developed a Pilot-based MapReduce (PMR) -- which is a novel implementation of MapReduce using a Pilot task and data management implementation. PMR provides an effective means by which a variety of new or existing methods for NGS and downstream analysis can be carried out whilst providing efficiency and scalability across multiple clusters. Pilot-MapReduce (PMR) circumvents the use of global reduce and yet provides an effective, scalable and distributed solution for MapReduce programming model. We compare and contrast the PMR approach to similar capabilities of Seqal and Crossbow, two other tools which are based on conventional Hadoop-based MapReduce for NGS reads alignment and duplicate read removal or SNP finding, respectively. We find that PMR is a viable tool to support distributed NGS analytics, particularly providing a framework that supports parallelism at multiple levels.