Comparing Data Streams Using Hamming Norms (How to Zero In)

Authors:
Graham Cormode;Mayur Datar;Piotr Indyk;S. Muthukrishnan
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2003

Citing 28
Cited 27

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Numerical recipes in C (2nd ed.): the art of scientific computing

Numerical recipes in C (2nd ed.): the art of scientific computing
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Synopsis data structures for massive data sets

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Deriving traffic demands for operational IP networks: methodology and experience

IEEE/ACM Transactions on Networking (TON)
Mining time-changing data streams

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Reductions in streaming algorithms, with an application to counting triangles in graphs

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Dynamic multidimensional histograms

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Gigascope: high performance network monitoring with an SQL interface

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Continuous queries over data streams

ACM SIGMOD Record
Identifying Representative Trends in Massive Time Series Data Sets Using Sketches

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Fast Mining of Massive Tabular Data via Approximate Distance Computations

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Inferring internet denial-of-service activity

SSYM'01 Proceedings of the 10th conference on USENIX Security Symposium - Volume 10
NetScope: traffic engineering for IP networks

IEEE Network: The Magazine of Global Internetworking

Space efficient mining of multigraph streams

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Adaptive similarity search in streaming time series with sliding windows

Data & Knowledge Engineering
Very sparse stable random projections for dimension reduction in lα (0

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Estimators and tail bounds for dimension reduction in lα (0

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Declaring independence via the sketching of sketches

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Sketching information divergences

Machine Learning
Robust approximate aggregation in sensor data management systems

ACM Transactions on Database Systems (TODS)
Online pairing of VoIP conversations

The VLDB Journal — The International Journal on Very Large Data Bases
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sketching techniques for collaborative filtering

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Sketching information divergences

COLT'07 Proceedings of the 20th annual conference on Learning theory
Improving compressed counting

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Zero-one frequency laws

Proceedings of the forty-second ACM symposium on Theory of computing
An optimal algorithm for the distinct elements problem

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Fast Manhattan sketches in data streams

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Fingerprinting ratings for collaborative filtering: theoretical and empirical analysis

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Pan-private algorithms via statistics on sketches

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Near-optimal private approximation protocols via a black box transformation

Proceedings of the forty-third annual ACM symposium on Theory of computing
A brief observation-centric analysis on anomaly-based intrusion detection

ISPEC'05 Proceedings of the First international conference on Information Security Practice and Experience
Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis

Theoretical Computer Science
Efficient error estimating coding: feasibility and applications

IEEE/ACM Transactions on Networking (TON)
Improved sketching of hamming distance with error correcting

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Streaming algorithms for data in motion

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies
Efficient sampling of non-strict turnstile data streams

FCT'13 Proceedings of the 19th international conference on Fundamentals of Computation Theory
Mining Top-K Rank Frequent Patterns in Data Streams A Tree Based Approach with Ternary Function and Ternary Feature Vector

Proceedings of the Second International Conference on Innovative Computing and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed 驴on the fly驴 as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the 驴l_0sketch驴 and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.