MapReducing a genomic sequencing workflow

Authors:
Luca Pireddu;Simone Leo;Gianluigi Zanetti
Affiliations:
CRS4, Pula, Italy;CRS4, Pula, Italy;CRS4, Pula, Italy
Venue:
Proceedings of the second international workshop on MapReduce and its applications
Year:
2011

Citing 7
Cited 2

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
CloudBurst

Bioinformatics
Fast and accurate short read alignment with Burrows–Wheeler transform

Bioinformatics
The Sequence Alignment/Map format and SAMtools

Bioinformatics
Biodoop: Bioinformatics on Hadoop

ICPPW '09 Proceedings of the 2009 International Conference on Parallel Processing Workshops
Pydoop: a Python MapReduce and HDFS API for Hadoop

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

Massive genomic data processing and deep analysis

Proceedings of the VLDB Endowment
Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern DNA sequencing machines have opened the flood gates of whole genome data; and the current processing techniques are being washed away. Medium-sized sequencing laboratories can produce 4-5 TB of data per week that need to be post-processed. Unfortunately, they are often still using ad hoc scripts and shared storage volumes to handle the data, resulting in low scalability and reliability problems. We present a MapReduce workflow that harnesses Hadoop to post-process the data produced by deep sequencing machines. The workflow takes the output of the sequencing machines, performs short read mapping with a novel parallel version of the popular BWA aligner, and removes duplicate reads---two thirds of the entire processing workflow. Our experiments show that it provides a scalable solution with a significantly improved throughput over its predecessor. It also greatly reduces the amount of operator attention necessary to run the analyses thanks to the robust platform that Hadoop provides. The workflow is going into production use at the CRS4 Sequencing and Genotyping Platform.