MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bioinformatics
Bioinformatics
The Sequence Alignment/Map format and SAMtools
Bioinformatics
Future Generation Computer Systems
A hierarchical framework for cross-domain MapReduce execution
Proceedings of the second international workshop on Emerging computational methods for the life sciences
Proceedings of the second international workshop on Emerging computational methods for the life sciences
Exploring MapReduce efficiency with highly-distributed data
Proceedings of the second international workshop on MapReduce and its applications
MapReducing a genomic sequencing workflow
Proceedings of the second international workshop on MapReduce and its applications
Bioinformatics
Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data
Proceedings of third international workshop on MapReduce and its Applications Date
Hi-index | 0.00 |
Although localization of Next-Generation Sequencing (NGS) data is suitable for many analysis and usage scenarios, it is not universally desirable, nor possible. However most solutions "impose" the localization of data as a pre-condition for NGS analytics. We analyze several existing tools and techniques that use MapReduce programming model for NGS data analysis to determine their effectiveness and extensibility to support distributed data scenarios. We find limitations at multiple levels. To overcome these limitations, we developed a Pilot-based MapReduce (PMR) -- which is a novel implementation of MapReduce using a Pilot task and data management implementation. PMR provides an effective means by which a variety of new or existing methods for NGS and downstream analysis can be carried out whilst providing efficiency and scalability across multiple clusters. Pilot-MapReduce (PMR) circumvents the use of global reduce and yet provides an effective, scalable and distributed solution for MapReduce programming model. We compare and contrast the PMR approach to similar capabilities of Seqal and Crossbow, two other tools which are based on conventional Hadoop-based MapReduce for NGS reads alignment and duplicate read removal or SNP finding, respectively. We find that PMR is a viable tool to support distributed NGS analytics, particularly providing a framework that supports parallelism at multiple levels.