Sketching for big data recommender systems using fast pseudo-random fingerprints

Authors:
Yoram Bachrach;Ely Porat
Affiliations:
Microsoft Research, Cambridge, UK;Bar-Ilan University, Ramat-Gan, Israel
Venue:
ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part II
Year:
2013

Citing 25
Cited 0

GroupLens: an open architecture for collaborative filtering of netnews

CSCW '94 Proceedings of the 1994 ACM conference on Computer supported cooperative work
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
A small approximately min-wise independent family of hash functions

Journal of Algorithms
Item-based collaborative filtering recommendation algorithms

Proceedings of the 10th international conference on World Wide Web
Estimating Rarity and Similarity over Data Stream Windows

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Stable distributions, pseudorandom generators, embeddings, and data stream computation

Journal of the ACM (JACM)
Data Streams: Models and Algorithms (Advances in Database Systems)

Data Streams: Models and Algorithms (Advances in Database Systems)
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
Less hashing, same performance: building a better bloom filter

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Range-Efficient Counting of Distinct Elements in a Massive Data Stream

SIAM Journal on Computing
Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Sketching techniques for collaborative filtering

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
A survey of collaborative filtering techniques

Advances in Artificial Intelligence
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
An optimal algorithm for the distinct elements problem

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On the k-independence required by linear probing and minwise independence

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
Fingerprinting ratings for collaborative filtering: theoretical and empirical analysis

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Fast moment estimation in data streams in optimal space

Proceedings of the forty-third annual ACM symposium on Theory of computing
Fast locality-sensitive hashing

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Exponential time improvement for min-wise based algorithms

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

A key building block for collaborative filtering recommender systems is finding users with similar consumption patterns. Given access to the full data regarding the items consumed by each user, one can directly compute the similarity between any two users. However, for massive recommender systems such a naive approach requires a high running time and may be intractable in terms of the space required to store the full data. One way to overcome this is using sketching, a technique that represents massive datasets concisely, while still allowing calculating properties of these datasets. Sketching methods maintain very short fingerprints of the item sets of users, which allow approximately computing the similarity between sets of different users. The state of the art sketch [22] has a very low space complexity, and a recent technique [14] shows how to exponentially speed up the computation time involved in building the fingerprints. Unfortunately, these methods are incompatible, forcing a choice between low running time or a small sketch size. We propose an alternative sketching approach, which achieves both a low space complexity similar to that of [22] and a low time complexity similar to [14]. We empirically evaluate our algorithm using the Netflix dataset. We analyze the running time and the sketch size of our approach and compare them to alternatives. Further, we show that in practice the accuracy achieved by our approach is even better than the accuracy guaranteed by the theoretical bounds, so it suffices to use even shorter fingerprints to obtain high quality results.