Compression-aware I/O performance analysis for big data clustering

Authors:
Zhenghua Xue;Geng Shen;Jianhui Li;Qian Xu;Yang Zhang;Jing Shao
Affiliations:
Computer Network Information Center, Chinese Academy of Sciences;Computer Network Information Center, Chinese Academy of Sciences;Computer Network Information Center, Chinese Academy of Sciences;Baidu.com Inc., China;Computer Network Information Center, Chinese Academy of Sciences;Computer Network Information Center, Chinese Academy of Sciences
Venue:
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Year:
2012

Citing 22
Cited 0

Beating the I/O bottleneck: a case for log-structured file systems

ACM SIGOPS Operating Systems Review
Improved parallel I/O via a two-phase run-time access strategy

ACM SIGARCH Computer Architecture News - Special issue on input/output in parallel computer systems
I/O optimal isosurface extraction (extended abstract)

VIS '97 Proceedings of the 8th conference on Visualization '97
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Data clustering: a review

ACM Computing Surveys (CSUR)
IO-Lite: a unified I/O buffering and caching system

ACM Transactions on Computer Systems (TOCS)
Redefining Clustering for High-Dimensional Applications

IEEE Transactions on Knowledge and Data Engineering
Enhancing Data Migration Performance via Parallel Data Compression

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Data-Clustering Algorithm on Distributed Memory Multiprocessors

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Redundant Disk Arrays: Reliable, Parallel Secondary Storage

Redundant Disk Arrays: Reliable, Parallel Secondary Storage
k-means projective clustering

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
CURLER: finding and visualizing nonlinear correlation clusters

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Parallel bisecting k-means with prediction clustering algorithm

The Journal of Supercomputing
Cheating the I/O bottleneck: network storage with Trapeze/Myrinet

ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
Parallel K-Means Clustering Based on MapReduce

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency

Proceedings of the first ACM SIGCOMM workshop on Green networking
The case for RAMCloud

Communications of the ACM
Clustering very large multi-dimensional datasets with MapReduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Improving I/O Forwarding Throughput with Data Compression

CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the data volume increases, I/O bottleneck has become a great challenge for data analysis. Data compression can alleviate the bottleneck effectively. Taking K-means algorithm as an example, this paper proposes a compression-aware performance improvement model for big-data clustering. The model quantitatively analyzes the effect of a variety of factors related to compression during the entire computational process. We perform clustering experiments on 10 dimensional data with up to 1.114 TB in size on a cluster computer with hundreds of computing cores. The measurement validates that using compression contributes significantly to improving the I/O performance, and confirms our theoretical analysis empirically. Furthermore, the proposed model can effectively determine when and how to use compression to improve I/O performance for big-data analysis.