Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure

  • Authors:
  • Pradeep Kumar Mantha;Nayong Kim;Andre Luckow;Joohyun Kim;Shantenu Jha

  • Affiliations:
  • Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA;Rutgers University, Piscataway, NJ, USA

  • Venue:
  • Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Although localization of Next-Generation Sequencing (NGS) data is suitable for many analysis and usage scenarios, it is not universally desirable, nor possible. However most solutions "impose" the localization of data as a pre-condition for NGS analytics. We analyze several existing tools and techniques that use MapReduce programming model for NGS data analysis to determine their effectiveness and extensibility to support distributed data scenarios. We find limitations at multiple levels. To overcome these limitations, we developed a Pilot-based MapReduce (PMR) -- which is a novel implementation of MapReduce using a Pilot task and data management implementation. PMR provides an effective means by which a variety of new or existing methods for NGS and downstream analysis can be carried out whilst providing efficiency and scalability across multiple clusters. Pilot-MapReduce (PMR) circumvents the use of global reduce and yet provides an effective, scalable and distributed solution for MapReduce programming model. We compare and contrast the PMR approach to similar capabilities of Seqal and Crossbow, two other tools which are based on conventional Hadoop-based MapReduce for NGS reads alignment and duplicate read removal or SNP finding, respectively. We find that PMR is a viable tool to support distributed NGS analytics, particularly providing a framework that supports parallelism at multiple levels.