Finding (Recently) Frequent Items in Distributed Data Streams

Authors:
Amit Manjhi;Vladislav Shkapenyuk;Kedar Dhamdhere;Christopher Olston
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University
Venue:
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Year:
2005

Citing 16
Cited 63

New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient computation of Iceberg cubes with complex measures

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
Maintaining time-decaying stream aggregates

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Distributed top-k monitoring

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Power-conserving computation of order-statistics over sensor networks

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Approximate counts and quantiles over sliding windows

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Autograph: toward automated, distributed worm signature detection

SSYM'04 Proceedings of the 13th conference on USENIX Security Symposium - Volume 13
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Holistic aggregates in a networked world: distributed tracking of approximate quantiles

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Tributaries and deltas: efficient and robust aggregation in sensor network streams

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Sketching streams through the net: distributed approximate query tracking

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Maintaining significant stream statistics over sliding windows

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
INSIGHT: a distributed monitoring system for tracking continuous queries

Proceedings of the twentieth ACM symposium on Operating systems principles
Evaluating the intrinsic dimension of evolving data streams

Proceedings of the 2006 ACM symposium on Applied computing
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Finding global icebergs over distributed data sets

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A geometric approach to monitoring threshold functions over distributed data streams

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Sketching asynchronous streams over a sliding window

Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Supporting dynamic migration in tightly coupled grid applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Finding hierarchical heavy hitters in network measurement system

Proceedings of the 2007 ACM symposium on Applied computing
Streaming in a connected world: querying and tracking distributed data streams

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Cloud control with distributed rate limiting

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
A geometric approach to monitoring threshold functions over distributed data streams

ACM Transactions on Database Systems (TODS)
Finding hierarchical heavy hitters in streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
STAR: self-tuning aggregation for scalable monitoring

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Approximate continuous querying over distributed streams

ACM Transactions on Database Systems (TODS)
Finding frequent items in probabilistic data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Time-decaying aggregates in out-of-order streams

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Shape sensitive geometric monitoring

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A survey on algorithms for mining frequent itemsets over data streams

Knowledge and Information Systems
Short communication: TOPSIS: Finding Top-K significant N-itemsets in sliding windows adaptively

Knowledge-Based Systems
FIDS: Monitoring Frequent Items over Distributed Data Streams

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Computing Frequent Elements Using Gossip

SIROCCO '08 Proceedings of the 15th international colloquium on Structural Information and Communication Complexity
LeeWave: level-wise distribution of wavelet coefficients for processing kNN queries over distributed streams

Proceedings of the VLDB Endowment
Making filters smart in distributed data stream environments

Information Sciences: an International Journal
Optimized union of non-disjoint distributed data sets

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Finding the K highest-ranked answers in a distributed network

Computer Networks: The International Journal of Computer and Telecommunications Networking
Measuring evolving data streams' behavior through their intrinsic dimension

New Generation Computing
Resilient workload manager: taming bursty workload of scaling internet applications

ICAC-INDST '09 Proceedings of the 6th international conference industry session on Autonomic computing and communications industry session
Optimal tracking of distributed heavy hitters and quantiles

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Ranking distributed probabilistic data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Competitive Analysis of Aggregate Max in Windowed Streaming

ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
Thread cooperation in multicore architectures for frequency counting over multiple data streams

Proceedings of the VLDB Endowment
A deterministic algorithm for summarizing asynchronous streams over a sliding window

STACS'07 Proceedings of the 24th annual conference on Theoretical aspects of computer science
A meta-index for querying distributed moving object database servers

Information Systems
Aggregate computation over data streams

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Mining recent approximate frequent items in wireless sensor networks

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 2
Optimal sampling from distributed streams

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Load-balanced query dissemination in privacy-aware online communities

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Network imprecision: a new consistency metric for scalable monitoring

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Identifying frequent items in a network using gossip

Journal of Parallel and Distributed Computing
Supporting self-adaptation in streaming data mining applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Uncovering Global Icebergs in Distributed Streams: Results and Implications

Journal of Network and Systems Management
A geometric approach to monitoring threshold functions over distributed data streams

Ubiquitous knowledge discovery
A geometric approach to monitoring threshold functions over distributed data streams

Ubiquitous knowledge discovery
Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

Data Mining and Knowledge Discovery
CLAP: Collaborative pattern mining for distributed information systems

Decision Support Systems
Optimal random sampling from distributed streams revisited

DISC'11 Proceedings of the 25th international conference on Distributed computing
Lower bounds for number-in-hand multiparty communication complexity, made easy

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Rule synthesizing from multiple related databases

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Searching moving objects in a spatio-temporal distributed database servers system

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part II
Continuous sampling from distributed streams

Journal of the ACM (JACM)
Mergeable summaries

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Randomized algorithms for tracking distributed count, frequencies, and ranks

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Tight bounds for distributed functional monitoring

STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing
Continuous kernel-based outlier detection over distributed data streams

ISPA'07 Proceedings of the 2007 international conference on Frontiers of High Performance Computing and Networking
Continuous adaptive outlier detection on distributed data streams

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
ProFID: Practical frequent items discovery in peer-to-peer networks

Future Generation Computer Systems
Mergeable summaries

ACM Transactions on Database Systems (TODS) - Invited papers issue
Sketch-based geometric monitoring of distributed stream queries

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Na篓ýve methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worst-case communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worst-case characteristics. We verify the effectiveness of our approach empirically using real-world data, and show that our methods incur substantially less communication than na篓ýve approaches while providing the same error guarantees on answers.