Finding global icebergs over distributed data sets

Authors:
Qi (George) Zhao;Mitsunori Ogihara;Haixun Wang;Jun (Jim) Xu
Affiliations:
Georgia Tech;Univ. of Rochester;IBM T.J Watson Research Center;Georgia Tech
Venue:
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2006

Citing 19
Cited 7

Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Distributed top-k monitoring

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Spectral bloom filters

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Chain: operator scheduling for memory minimization in data stream systems

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive filters for continuous queries over distributed data streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Sampling-Based Estimator for Top-k Query

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Space-code bloom filter for efficient traffic flow measurement

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Data streaming algorithms for efficient and accurate estimation of flow size distribution

Proceedings of the joint international conference on Measurement and modeling of computer systems
Finding (Recently) Frequent Items in Distributed Data Streams

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Approximate counts and quantiles over sliding windows

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Distributed set-expression cardinality estimation

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Computing Frequent Elements Using Gossip

SIROCCO '08 Proceedings of the 15th international colloquium on Structural Information and Communication Complexity
Identifying frequent items in a network using gossip

Journal of Parallel and Distributed Computing
Distributed threshold querying of general functions by a difference of monotonic representation

Proceedings of the VLDB Endowment
Uncovering Global Icebergs in Distributed Streams: Results and Implications

Journal of Network and Systems Management
Distributed frequent items detection on uncertain data

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Lower bounds for number-in-hand multiparty communication complexity, made easy

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding icebergs–items whose frequency of occurrence is above a certain threshold–is an important problem with a wide range of applications. Most of the existing work focuses on iceberg queries at a single node. However, in many real-life applications, data sets are distributed across a large number of nodes. Two naïve approaches might be considered. In the first, each node ships its entire data set to a central server, and the central server uses single-node algorithms to find icebergs. But it may incur prohibitive communication overhead. In the second, each node submits local icebergs, and the central server combines local icebergs to find global icebergs. But it may fail because in many important applications, globally frequent items may not be frequent at any node. In this work, we propose two novel schemes that provide accurate and efficient solutions to this problem: a sampling-based scheme and a counting-sketch-based scheme. In particular, the latter scheme incurs a communication cost at least an order of magnitude smaller than the naïve scheme of shipping all data, yet is able to achieve very high accuracy. Through rigorous theoretical and experimental analysis we establish the statistical properties of our proposed algorithms, including their accuracy bounds.