MapReducing a genomic sequencing workflow

  • Authors:
  • Luca Pireddu;Simone Leo;Gianluigi Zanetti

  • Affiliations:
  • CRS4, Pula, Italy;CRS4, Pula, Italy;CRS4, Pula, Italy

  • Venue:
  • Proceedings of the second international workshop on MapReduce and its applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Modern DNA sequencing machines have opened the flood gates of whole genome data; and the current processing techniques are being washed away. Medium-sized sequencing laboratories can produce 4-5 TB of data per week that need to be post-processed. Unfortunately, they are often still using ad hoc scripts and shared storage volumes to handle the data, resulting in low scalability and reliability problems. We present a MapReduce workflow that harnesses Hadoop to post-process the data produced by deep sequencing machines. The workflow takes the output of the sequencing machines, performs short read mapping with a novel parallel version of the popular BWA aligner, and removes duplicate reads---two thirds of the entire processing workflow. Our experiments show that it provides a scalable solution with a significantly improved throughput over its predecessor. It also greatly reduces the amount of operator attention necessary to run the analyses thanks to the robust platform that Hadoop provides. The workflow is going into production use at the CRS4 Sequencing and Genotyping Platform.