Finding top-k elements in data streams

Authors:
Nuno Homem;Joao Paulo Carvalho
Affiliations:
TULisbon - Instituto Superior Técnico, INESC-ID, R. Alves Redol 9, 1000-029 Lisboa, Portugal;TULisbon - Instituto Superior Técnico, INESC-ID, R. Alves Redol 9, 1000-029 Lisboa, Portugal
Venue:
Information Sciences: an International Journal
Year:
2010

Citing 16
Cited 4

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems (TODS)
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Maintaining Stream Statistics over Sliding Windows

SIAM Journal on Computing
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice

ACM Transactions on Computer Systems (TOCS)
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Discovery of maximum length frequent itemsets

Information Sciences: an International Journal
Probabilistic lossy counting: an efficient algorithm for finding heavy hitters

ACM SIGCOMM Computer Communication Review
Efficient single-pass frequent pattern mining using a prefix-tree

Information Sciences: an International Journal
Frequent items in streaming data: An experimental evaluation of the state-of-the-art

Data & Knowledge Engineering
Sliding window-based frequent pattern mining over data streams

Information Sciences: an International Journal
A false negative approach to mining frequent itemsets from high speed transactional data streams

Information Sciences: an International Journal

Mining frequent patterns in a varying-size sliding window of online transactional data streams

Information Sciences: an International Journal
TJJE: An efficient algorithm for top-k join on massive data

Information Sciences: an International Journal
Mining frequent items in data stream using time fading model

Information Sciences: an International Journal
Mining top-k frequent patterns over data streams sliding window

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.07

Visualization

Abstract

Identifying the most frequent elements in a data stream is a well known and difficult problem. Identifying the most frequent elements for each individual, especially in very large populations, is even harder. The use of fast and small memory footprint algorithms is paramount when the number of individuals is very large. In many situations such analysis needs to be performed and kept up to date in near real time. Fortunately, approximate answers are usually adequate when dealing with this problem. This paper presents a new and innovative algorithm that addresses this problem by merging the commonly used counter-based and sketch-based techniques for top-k identification. The algorithm provides the top-k list of elements, their frequency and an error estimate for each frequency value. It also provides strong guarantees on the error estimate, order of elements and inclusion of elements in the list depending on their real frequency. Additionally the algorithm provides stochastic bounds on the error and expected error estimates. Telecommunications customer's behavior and voice call data is used to present concrete results obtained with this algorithm and to illustrate improvements over previously existing algorithms.