On the power of two-point based sampling
Journal of Complexity
The probabilistic communication complexity of set intersection
SIAM Journal on Discrete Mathematics
On the distributional complexity of disjointness
Theoretical Computer Science
Communication complexity
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
Efficient URL caching for world wide web crawling
WWW '03 Proceedings of the 12th international conference on World Wide Web
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Duplicate detection in click streams
WWW '05 Proceedings of the 14th international conference on World Wide Web
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Stable distributions, pseudorandom generators, embeddings, and data stream computation
Journal of the ACM (JACM)
Finding a duplicate and a missing item in a stream
TAMC'07 Proceedings of the 4th international conference on Theory and applications of models of computation
TAMC '09 Proceedings of the 6th Annual Conference on Theory and Applications of Models of Computation
Pseudorandom generators for polynomial threshold functions
Proceedings of the forty-second ACM symposium on Theory of computing
Real-time approximate Range Motif discovery & data redundancy removal algorithm
Proceedings of the 14th International Conference on Extending Database Technology
Theoretical Computer Science
Tight bounds for Lp samplers, finding duplicates in streams, and related problems
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Bounded Independence Fools Halfspaces
SIAM Journal on Computing
Hi-index | 0.00 |
Given a data stream of length n over an alphabet [m] where n m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m)3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this problem could be solved using sub-linear space and one pass over the input. Our algorithm solves the more general problem of finding a positive frequency element in a stream given by frequency updates where the sum of all frequencies is positive. Our main tool is an Isolation Lemma that reduces this problem to the task of detecting and identifying a Dictatorial variable in a Boolean halfspace. We present various relaxations of the condition n m, under which one can find duplicates efficiently.