MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Image segmentation by automatic histogram thresholding
Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency
Proceedings of the first ACM SIGCOMM workshop on Green networking
High throughput data-compression for cloud storage
Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Improving I/O Forwarding Throughput with Data Compression
CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Hadoop: The Definitive Guide
Investigation of Data Locality in MapReduce
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Exploiting HPC resources for the 3D-time series analysis of caries lesion activity
Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era
IEEE Transactions on Parallel and Distributed Systems
Compressing Intermediate Keys between Mappers and Reducers in SciHadoop
SCC '12 Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis
Hi-index | 0.00 |
HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the "moving computation to data" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time.