MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Bioinformatics
The Sequence Alignment/Map format and SAMtools
Bioinformatics
Biodoop: Bioinformatics on Hadoop
ICPPW '09 Proceedings of the 2009 International Conference on Parallel Processing Workshops
Pydoop: a Python MapReduce and HDFS API for Hadoop
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Massive genomic data processing and deep analysis
Proceedings of the VLDB Endowment
Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
Hi-index | 0.00 |
Modern DNA sequencing machines have opened the flood gates of whole genome data; and the current processing techniques are being washed away. Medium-sized sequencing laboratories can produce 4-5 TB of data per week that need to be post-processed. Unfortunately, they are often still using ad hoc scripts and shared storage volumes to handle the data, resulting in low scalability and reliability problems. We present a MapReduce workflow that harnesses Hadoop to post-process the data produced by deep sequencing machines. The workflow takes the output of the sequencing machines, performs short read mapping with a novel parallel version of the popular BWA aligner, and removes duplicate reads---two thirds of the entire processing workflow. Our experiments show that it provides a scalable solution with a significantly improved throughput over its predecessor. It also greatly reduces the amount of operator attention necessary to run the analyses thanks to the robust platform that Hadoop provides. The workflow is going into production use at the CRS4 Sequencing and Genotyping Platform.