Exploiting MapReduce and data compression for data-intensive applications

  • Authors:
  • Guangchen Ruan;Hui Zhang;Beth Plale

  • Affiliations:
  • Indiana University;Indiana University;Indiana University

  • Venue:
  • Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the "moving computation to data" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time.