Massive genomic data processing and deep analysis

Authors:
Abhishek Roy;Yanlei Diao;Evan Mauceli;Yiping Shen;Bai-Lin Wu
Affiliations:
University of Massachusetts, Amherst;University of Massachusetts, Amherst;Harvard Medical School & Children's Hospital Boston;Harvard Medical School & Children's Hospital Boston;Harvard Medical School & Children's Hospital Boston
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 6
Cited 1

Mining frequent patterns by pattern-growth: methodology and implications

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Efficient discovery of risk patterns in medical data

Artificial Intelligence in Medicine
Fast and accurate short read alignment with Burrows–Wheeler transform

Bioinformatics
Mining frequent itemsets from uncertain data

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
MapReducing a genomic sequencing workflow

Proceedings of the second international workshop on MapReduce and its applications

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today large sequencing centers are producing genomic data at the rate of 10 terabytes a day and require complicated processing to transform massive amounts of noisy raw data into biological information. To address these needs, we develop a system for end-to-end processing of genomic data, including alignment of short read sequences, variation discovery, and deep analysis. We also employ a range of quality control mechanisms to improve data quality and parallel processing techniques for performance. In the demo, we will use real genomic data to show details of data transformation through the workflow, the usefulness of end results (ready for use as testable hypotheses), the effects of our quality control mechanisms and improved algorithms, and finally performance improvement.