Compression-aware I/O performance analysis for big data clustering

  • Authors:
  • Zhenghua Xue;Geng Shen;Jianhui Li;Qian Xu;Yang Zhang;Jing Shao

  • Affiliations:
  • Computer Network Information Center, Chinese Academy of Sciences;Computer Network Information Center, Chinese Academy of Sciences;Computer Network Information Center, Chinese Academy of Sciences;Baidu.com Inc., China;Computer Network Information Center, Chinese Academy of Sciences;Computer Network Information Center, Chinese Academy of Sciences

  • Venue:
  • Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the data volume increases, I/O bottleneck has become a great challenge for data analysis. Data compression can alleviate the bottleneck effectively. Taking K-means algorithm as an example, this paper proposes a compression-aware performance improvement model for big-data clustering. The model quantitatively analyzes the effect of a variety of factors related to compression during the entire computational process. We perform clustering experiments on 10 dimensional data with up to 1.114 TB in size on a cluster computer with hundreds of computing cores. The measurement validates that using compression contributes significantly to improving the I/O performance, and confirms our theoretical analysis empirically. Furthermore, the proposed model can effectively determine when and how to use compression to improve I/O performance for big-data analysis.