Exploiting MapReduce and data compression for data-intensive applications

Authors:
Guangchen Ruan;Hui Zhang;Beth Plale
Affiliations:
Indiana University;Indiana University;Indiana University
Venue:
Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
Year:
2013

Citing 11
Cited 0

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Image segmentation by automatic histogram thresholding

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency

Proceedings of the first ACM SIGCOMM workshop on Green networking
High throughput data-compression for cloud storage

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Improving I/O Forwarding Throughput with Data Compression

CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Investigation of Data Locality in MapReduce

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Exploiting HPC resources for the 3D-time series analysis of caries lesion activity

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era

IEEE Transactions on Parallel and Distributed Systems
Compressing Intermediate Keys between Mappers and Reducers in SciHadoop

SCC '12 Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the "moving computation to data" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time.