Birthday paradox, coupon collectors, caching algorithms and self-organizing search
Discrete Applied Mathematics
ACM Computing Surveys (CSUR)
Frequency Estimation of Internet Packet Streams with Limited Space
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Some New Aspects of the Coupon Collector's Problem
SIAM Journal on Discrete Mathematics
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Mining frequent items in a stream using flexible windows
Intelligent Data Analysis - Knowledge Discovery from Data Streams
Mining Frequent Itemsets in a Stream
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Towards a variable size sliding window model for frequent itemset mining over data streams
Computers and Industrial Engineering
Mining frequent patterns in a varying-size sliding window of online transactional data streams
Information Sciences: an International Journal
A fast algorithm for frequent itemset mining using Patricia* structures
DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Identifying streaming frequent items in ad hoc time windows
Data & Knowledge Engineering
Proceedings of the Second International Conference on Innovative Computing and Cloud Computing
Efficient frequent pattern mining based on Linear Prefix tree
Knowledge-Based Systems
Mining frequent items in data stream using time fading model
Information Sciences: an International Journal
Mining top-k frequent patterns over data streams sliding window
Journal of Intelligent Information Systems
Hi-index | 0.01 |
We study the problem of finding the k most frequent items in a stream of items for the recently proposed max-frequency measure. Based on the properties of an item, the max-frequency of an item is counted over a sliding window of which the length changes dynamically. Besides being parameterless, this way of measuring the support of items was shown to have the advantage of a faster detection of bursts in a stream, especially if the set of items is heterogeneous. The algorithm that was proposed for maintaining all frequent items, however, scales poorly when the number of items becomes large. Therefore, in this paper we propose, instead of reporting all frequent items, to only mine the top-k most frequent ones. First we prove that in order to solve this problem exactly, we still need a prohibitive amount of memory (at least linear in the number of items). Yet, under some reasonable conditions, we show both theoretically and empirically that a memory-efficient algorithm exists. A prototype of this algorithm is implemented and we present its performance w.r.t. memory-efficiency on real-life data and in controlled experiments with synthetic data.