Scale and performance in a distributed file system
ACM Transactions on Computer Systems (TOCS)
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
DryadInc: reusing work in large-scale computations
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Hierarchical merge for scalable MapReduce
Proceedings of the 2012 workshop on Management of big data systems
Hi-index | 0.00 |
The proliferation of data parallel programming on large clusters has set a new research avenue: accommodating numerous types of data-intensive applications with a feasible plan. Behind the many research efforts, we can observe that there exists a nontrivial amount of redundant I/O in the execution of data-intensive applications. Even the locality-aware scheduling policy in a MapReduce framework is not effective in a cluster environment where storage nodes cannot provide a computation service. In this paper, we introduce Split-Cache to improve the performance of data-intensive OLAP-style applications by reducing redundant I/O in a MapReduce framework. The key strategy to achieve the goal is to cut down the I/O redundancy of reading common input data among applications. SplitCache caches the first input stream in the computing nodes and reuses them for future demand. In execution of the TPC-H benchmark, we achieved 65.5% faster execution and 87% reduction in network traffic in average.