Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Computing Iceberg Queries Efficiently
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams
ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags
ACM Transactions on Database Systems (TODS)
What's hot and what's not: tracking most frequent items dynamically
Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Chain: operator scheduling for memory minimization in data stream systems
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive filters for continuous queries over distributed data streams
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Sampling-Based Estimator for Top-k Query
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Space-code bloom filter for efficient traffic flow measurement
Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Data streaming algorithms for efficient and accurate estimation of flow size distribution
Proceedings of the joint international conference on Measurement and modeling of computer systems
Finding (Recently) Frequent Items in Distributed Data Streams
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Approximate counts and quantiles over sliding windows
PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An improved data stream summary: the count-min sketch and its applications
Journal of Algorithms
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Distributed set-expression cardinality estimation
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Computing Frequent Elements Using Gossip
SIROCCO '08 Proceedings of the 15th international colloquium on Structural Information and Communication Complexity
Identifying frequent items in a network using gossip
Journal of Parallel and Distributed Computing
Distributed threshold querying of general functions by a difference of monotonic representation
Proceedings of the VLDB Endowment
Uncovering Global Icebergs in Distributed Streams: Results and Implications
Journal of Network and Systems Management
Distributed frequent items detection on uncertain data
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Building wavelet histograms on large data in MapReduce
Proceedings of the VLDB Endowment
Lower bounds for number-in-hand multiparty communication complexity, made easy
Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Hi-index | 0.00 |
Finding icebergs–items whose frequency of occurrence is above a certain threshold–is an important problem with a wide range of applications. Most of the existing work focuses on iceberg queries at a single node. However, in many real-life applications, data sets are distributed across a large number of nodes. Two naïve approaches might be considered. In the first, each node ships its entire data set to a central server, and the central server uses single-node algorithms to find icebergs. But it may incur prohibitive communication overhead. In the second, each node submits local icebergs, and the central server combines local icebergs to find global icebergs. But it may fail because in many important applications, globally frequent items may not be frequent at any node. In this work, we propose two novel schemes that provide accurate and efficient solutions to this problem: a sampling-based scheme and a counting-sketch-based scheme. In particular, the latter scheme incurs a communication cost at least an order of magnitude smaller than the naïve scheme of shipping all data, yet is able to achieve very high accuracy. Through rigorous theoretical and experimental analysis we establish the statistical properties of our proposed algorithms, including their accuracy bounds.