Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
Numerical recipes in C (2nd ed.): the art of scientific computing
Numerical recipes in C (2nd ed.): the art of scientific computing
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
Synopsis data structures for massive data sets
External memory algorithms
Towards estimation error guarantees for distinct values
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Deriving traffic demands for operational IP networks: methodology and experience
IEEE/ACM Transactions on Networking (TON)
Mining time-changing data streams
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Mining database structure; or, how to build a data quality browser
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Dynamic multidimensional histograms
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Gigascope: high performance network monitoring with an SQL interface
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Continuous queries over data streams
ACM SIGMOD Record
Identifying Representative Trends in Massive Time Series Data Sets Using Sketches
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries
Proceedings of the 27th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
An Approximate L1-Difference Algorithm for Massive Data Streams
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Finding Interesting Associations without Support Pruning
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Fast Mining of Massive Tabular Data via Approximate Distance Computations
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Inferring internet denial-of-service activity
SSYM'01 Proceedings of the 10th conference on USENIX Security Symposium - Volume 10
NetScope: traffic engineering for IP networks
IEEE Network: The Magazine of Global Internetworking
Correlating XML data streams using tree-edit distance embeddings
Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Issues in data stream management
ACM SIGMOD Record
Distributed deviation detection in sensor networks
ACM SIGMOD Record
Approximate Aggregation Techniques for Sensor Databases
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Optimal space lower bounds for all frequency moments
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Algorithms for dynamic geometric problems over data streams
STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Holistic UDAFs at streaming speeds
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Finding hot query patterns over an XQuery stream
The VLDB Journal — The International Journal on Very Large Data Bases
Range-Efficient Computation of F" over Massive Data Streams
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
XML stream processing using tree-edit distance embeddings
ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Sampling in dynamic data streams and applications
SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
Domain-Driven Data Synopses for Dynamic Quantiles
IEEE Transactions on Knowledge and Data Engineering
Streaming pattern discovery in multiple time-series
VLDB '05 Proceedings of the 31st international conference on Very large data bases
What's new: finding significant differences in network data streams
IEEE/ACM Transactions on Networking (TON)
Stable distributions, pseudorandom generators, embeddings, and data stream computation
Journal of the ACM (JACM)
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
Counting distinct items over update streams
Theoretical Computer Science
Effective variation management for pseudo periodical streams
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Very sparse stable random projections for dimension reduction in lα (0
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Reverse nearest neighbor aggregates over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Approximate NN queries on streams with guaranteed error/performance bounds
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Approximate colored range and point enclosure queries
Journal of Discrete Algorithms
Two improved range-efficient algorithms for F0 estimation
Theoretical Computer Science
PGG: an online pattern based approach for stream variation management
Journal of Computer Science and Technology
Two improved range-efficient algorithms for F0 estimation
TAMC'07 Proceedings of the 4th international conference on Theory and applications of models of computation
Approximate colored range queries
ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
The pipelined set cover problem
ICDT'05 Proceedings of the 10th international conference on Database Theory
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Pattern discovery in data streams under the time warping distance
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases, and instead must be processed "on the fly" as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams, and hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalises ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the "l0 sketch" and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.