Finding duplicates in a data stream

Authors:
Parikshit Gopalan;Jaikumar Radhakrishnan
Affiliations:
University of Washington & Microsoft Research SVC;TIFR, Mumbai
Venue:
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Year:
2009

Citing 12
Cited 6

On the power of two-point based sampling

Journal of Complexity
The probabilistic communication complexity of set intersection

SIAM Journal on Discrete Mathematics
On the distributional complexity of disjointness

Theoretical Computer Science
Communication complexity

Communication complexity
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Efficient URL caching for world wide web crawling

WWW '03 Proceedings of the 12th international conference on World Wide Web
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Stable distributions, pseudorandom generators, embeddings, and data stream computation

Journal of the ACM (JACM)
Finding a duplicate and a missing item in a stream

TAMC'07 Proceedings of the 4th international conference on Theory and applications of models of computation

Best-Order Streaming Model

TAMC '09 Proceedings of the 6th Annual Conference on Theory and Applications of Models of Computation
Pseudorandom generators for polynomial threshold functions

Proceedings of the forty-second ACM symposium on Theory of computing
Real-time approximate Range Motif discovery & data redundancy removal algorithm

Proceedings of the 14th International Conference on Extending Database Technology
Best-order streaming model

Theoretical Computer Science
Tight bounds for Lp samplers, finding duplicates in streams, and related problems

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Bounded Independence Fools Halfspaces

SIAM Journal on Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a data stream of length n over an alphabet [m] where n m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m)3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this problem could be solved using sub-linear space and one pass over the input. Our algorithm solves the more general problem of finding a positive frequency element in a stream given by frequency updates where the sum of all frequencies is positive. Our main tool is an Isolation Lemma that reduces this problem to the task of detecting and identifying a Dictatorial variable in a Boolean halfspace. We present various relaxations of the condition n m, under which one can find duplicates efficiently.