On Finding Similar Items in a Stream of Transactions

Authors:
Andrea Campagna;Rasmus Pagh
Affiliations:
-;-
Venue:
ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
Year:
2010

Citing 0
Cited 1

Improved counter based algorithms for frequent pairs mining in transactional data streams

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	0.01

Visualization

Abstract

While there has been a lot of work on finding frequent item sets in transaction data streams, none of these solve the problem of finding similar pairs according to standard similarity measures. This paper is a first attempt at dealing with this, arguably more important, problem. We start out with a negative result that also explains the lack of theoretical upper bounds on the space usage of data mining algorithms for finding frequent item sets: Any algorithm that (even only approximately and with a chance of error) finds the most frequent k-item set must use space Omega(min{mb,n^k,(mb/phi)^k}) bits, where mb is the number of items in the stream so far, n is the number of distinct items and phi is a support threshold. To achieve any non-trivial space upper bound we must thus abandon a worst-case assumption on the data stream. We work under the model that the transactions come in random order, and show that surprisingly, not only is small-space similarity mining possible for the most common similarity measures, but the mining accuracy improves with the length of the stream for any fixed support threshold.