Optimized union of non-disjoint distributed data sets

Authors:
Itay Dar;Tova Milo;Elad Verbin
Affiliations:
Tel Aviv University;Tel Aviv University;Tel Aviv University
Venue:
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Year:
2009

Citing 25
Cited 1

Network flows: theory, algorithms, and applications

Network flows: theory, algorithms, and applications
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Query processing in a system for distributed databases (SDD-1)

ACM Transactions on Database Systems (TODS)
Selectively estimation for Boolean queries

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
The state of the art in distributed query processing

ACM Computing Surveys (CSUR)
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Informed content delivery across adaptive overlay networks

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
R* Optimizer Validation and Performance Evaluation for Distributed Queries

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Object Fusion in Mediator Systems

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Kademlia: A Peer-to-Peer Information System Based on the XOR Metric

IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
Some complexity questions related to distributive computing(Preliminary Report)

STOC '79 Proceedings of the eleventh annual ACM symposium on Theory of computing
The Piazza Peer Data Management System

IEEE Transactions on Knowledge and Data Engineering
Finding (Recently) Frequent Items in Distributed Data Streams

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Keyword Search in DHT-Based Peer-to-Peer Networks

ICDCS '05 Proceedings of the 25th IEEE International Conference on Distributed Computing Systems
KLEE: a framework for distributed top-k query algorithms

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Optimal peer selection for minimum-delay peer-to-peer streaming with rateless codes

Proceedings of the ACM workshop on Advances in peer-to-peer multimedia streaming
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Sharing aggregate computation for distributed queries

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Boosting topic-based publish-subscribe systems with dynamic clustering

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
The generalized pre-grouping transformation: aggregate-query optimization in the presence of dependencies

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Staying FIT: efficient load shedding techniques for distributed stream processing

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
XML processing in DHT networks

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Survey of clustering algorithms

IEEE Transactions on Neural Networks

One is enough: distributed filtering for duplicate elimination

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a variety of applications, ranging from data integration to distributed query evaluation, there is a need to obtain sets of data items from several sources (peers) and compute their union. As these sets often contain common data items, avoiding the transmission of redundant information is essential for effective union computation. In this paper we define the notion of optimal union plans for non-disjoint data sets residing on distinct peers, and present efficient algorithms for computing and executing such optimal plans. Our algorithms avoid redundant data transmission and optimally exploit the network bandwidth capabilities. A challenge in the design of optimal plans is the lack of a complete map of the distribution of the data items among peers. We analyze the information required for optimal planning and propose novel techniques to obtain compact, cheap to communicate, description of the data sources. We then exploit it for efficient union computation with reasonable accuracy. We demonstrate experimentally the superiority of our approach over the common naive union computation, showing it improves the performance by an order of magnitude.