Optimizing data popularity conscious bloom filters
Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
A Heuristic for Fair Correlation-Aware Resource Placement
SEA '09 Proceedings of the 8th International Symposium on Experimental Algorithms
Two-Dimensional Distributed Inverted Files
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
On the expressiveness and trade-offs of large scale tuple stores
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
A correlation-aware data placement strategy for key-value stores
Proceedings of the 11th IFIP WG 6.1 international conference on Distributed applications and interoperable systems
Hi-index | 0.01 |
A multi-object operation incurs communication or synchronization overhead when the requested objects are distributed over different nodes. The object pair correlations (the probability for a pair of objects to be requested together in an operation) are often highly skewed and yet stable over time for real-world distributed applications. Thus, placing strongly correlated objects on the same node (subject to node space constraint) tends to reduce communication overhead for multi-object operations. This paper studies the optimization of correlation-aware data placement. First, we formalize a restricted form of the problem as a variant of the classic Quadratic Assignment problem and we show that it is NP-hard. Based on a linear programming relaxation, we then propose a polynomial-time approximation algorithm that finds an object placement with communication overhead at most two times that of the optimal placement. We further show that the computation cost can be reduced by limiting the optimization scope to a relatively small number of most important objects. We quantitatively evaluate our approach on keyword index placement for full-text search engines using real traces of 3.7 million web pages and 6.8 million search queries. Compared to the correlation-oblivious random object placement, our approach achieves 37–86% communication overhead reduction on a range of optimization scopes and system sizes. The communication reduction is 30–78% compared to a correlation-aware greedy approach.