Beating the I/O bottleneck: a case for log-structured file systems
ACM SIGOPS Operating Systems Review
Improved parallel I/O via a two-phase run-time access strategy
ACM SIGARCH Computer Architecture News - Special issue on input/output in parallel computer systems
I/O optimal isosurface extraction (extended abstract)
VIS '97 Proceedings of the 8th conference on Visualization '97
Fast algorithms for projected clustering
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
ACM Computing Surveys (CSUR)
IO-Lite: a unified I/O buffering and caching system
ACM Transactions on Computer Systems (TOCS)
Redefining Clustering for High-Dimensional Applications
IEEE Transactions on Knowledge and Data Engineering
Enhancing Data Migration Performance via Parallel Data Compression
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Data-Clustering Algorithm on Distributed Memory Multiprocessors
Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Redundant Disk Arrays: Reliable, Parallel Secondary Storage
Redundant Disk Arrays: Reliable, Parallel Secondary Storage
PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
CURLER: finding and visualizing nonlinear correlation clusters
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Super-Scalar RAM-CPU Cache Compression
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Integrating compression and execution in column-oriented database systems
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Parallel bisecting k-means with prediction clustering algorithm
The Journal of Supercomputing
Cheating the I/O bottleneck: network storage with Trapeze/Myrinet
ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
Parallel K-Means Clustering Based on MapReduce
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency
Proceedings of the first ACM SIGCOMM workshop on Green networking
Communications of the ACM
Clustering very large multi-dimensional datasets with MapReduce
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Improving I/O Forwarding Throughput with Data Compression
CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Parallel data processing with MapReduce: a survey
ACM SIGMOD Record
Hi-index | 0.00 |
As the data volume increases, I/O bottleneck has become a great challenge for data analysis. Data compression can alleviate the bottleneck effectively. Taking K-means algorithm as an example, this paper proposes a compression-aware performance improvement model for big-data clustering. The model quantitatively analyzes the effect of a variety of factors related to compression during the entire computational process. We perform clustering experiments on 10 dimensional data with up to 1.114 TB in size on a cluster computer with hundreds of computing cores. The measurement validates that using compression contributes significantly to improving the I/O performance, and confirms our theoretical analysis empirically. Furthermore, the proposed model can effectively determine when and how to use compression to improve I/O performance for big-data analysis.