Sketching probabilistic data streams

Authors:
Graham Cormode;Minos Garofalakis
Affiliations:
AT&T Labs-Research, Florham Park, NJ;Yahoo! Research and UC Berkeley, Santa Clara, CA
Venue:
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Year:
2007

Citing 22
Cited 40

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Processing complex aggregate queries over data streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Querying and mining data streams: you only get one look a tutorial

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Medians and beyond: new aggregation techniques for sensor networks

SenSys '04 Proceedings of the 2nd international conference on Embedded networked sensor systems
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Simpler algorithm for estimating frequency moments of data streams

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Working Models for Uncertain Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Towards correcting input data errors probabilistically using integrity constraints

MobiDE '06 Proceedings of the 5th ACM international workshop on Data engineering for wireless and mobile access
Space- and time-efficient deterministic algorithms for biased quantiles over data streams

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
ULDBs: databases with uncertainty and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Estimating statistical aggregates on probabilistic data streams

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient aggregation algorithms for probabilistic data

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient computation of frequent and top-k elements in data streams

ICDT'05 Proceedings of the 10th international conference on Database Theory

Estimating statistical aggregates on probabilistic data streams

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Event queries on correlated probabilistic streams

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Finding frequent items in probabilistic data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Approximation algorithms for clustering uncertain data

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Cascadia: A System for Specifying, Detecting, and Managing RFID Events

Proceedings of the 6th international conference on Mobile systems, applications, and services
Estimating statistical aggregates on probabilistic data streams

ACM Transactions on Database Systems (TODS)
Sliding-window top-k queries on uncertain streams

Proceedings of the VLDB Endowment
Top-k dominating queries in uncertain databases

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
PROUD: a probabilistic approach to processing similarity queries over uncertain data streams

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficiently Clustering Probabilistic Data Streams

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
A Sliding-Window Approach for Finding Top-k Frequent Itemsets from Uncertain Streams

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Learning from Data Streams: Synopsis and Change Detection

Proceedings of the 2008 conference on STAIRS 2008: Proceedings of the Fourth Starting AI Researchers' Symposium
Continuously monitoring top-k uncertain data streams: a probabilistic threshold method

Distributed and Parallel Databases
Probabilistic histograms for probabilistic data

Proceedings of the VLDB Endowment
PODS: a new model and processing algorithms for uncertain data streams

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Sliding-window top-k queries on uncertain streams

The VLDB Journal — The International Journal on Very Large Data Bases
Mining uncertain data with probabilistic guarantees

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Accelerating probabilistic frequent itemset mining: a model-based approach

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Efficiently computing and querying multidimensional OLAP data cubes over probabilistic relational data

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Conditioning and aggregating uncertain data streams: going beyond expectations

Proceedings of the VLDB Endowment
Distributed frequent items detection on uncertain data

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Handling ER-topk query on uncertain streams

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Continuous inverse ranking queries in uncertain streams

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Retrieving accurate estimates to OLAP queries over uncertain and imprecise multidimensional data streams

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Efficiently answering probability threshold-based shortest path queries over uncertain graphs

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
SIC-means: a semi-fuzzy approach for clustering data streams using c-means

ANNPR'10 Proceedings of the 4th IAPR TC3 conference on Artificial Neural Networks in Pattern Recognition
Efficient trade-off between speed processing and accuracy in summarizing data streams

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Monitoring incremental histogram distribution for change detection in data streams

Sensor-KDD'08 Proceedings of the Second international conference on Knowledge Discovery from Sensor Data
Incremental update on probabilistic frequent itemsets in uncertain databases

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Space-efficient estimation of statistics over sub-sampled streams

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
An embedded co-processor for accelerating window joins over uncertain data streams

Microprocessors & Microsystems
CLARO: modeling and processing uncertain data streams

The VLDB Journal — The International Journal on Very Large Data Bases
Probabilistic top-k dominating queries in uncertain databases

Information Sciences: an International Journal
Mining frequent subgraphs over uncertain graph databases under probabilistic semantics

The VLDB Journal — The International Journal on Very Large Data Bases
A framework for distributed managing uncertain data in RFID traceability networks

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Uncertain OLAP over multidimensional data streams: state-of-the-art analysis and research perspectives

FGIT'12 Proceedings of the 4th international conference on Future Generation Information Technology
Probabilistic inference of object identifications for event stream analytics

Proceedings of the 16th International Conference on Extending Database Technology
FARP: Mining fuzzy association rules from a probabilistic quantitative database

Information Sciences: an International Journal
Probabilistic k-skyband operator over sliding windows

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Probabilistic skyline operator over sliding windows

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The management of uncertain, probabilistic data has recently emerged as a useful paradigm for dealing with the inherent unreliabilities of several real-world application domains, including data cleaning, information integration, and pervasive, multi-sensor computing. Unlike conventional data sets, a set of probabilistic tuples defines a probability distribution over an exponential number of possible worlds (i.e., "grounded", deterministic databases). This "possibleworlds" interpretation allows for clean query semantics but also raises hard computational problems for probabilistic database query processors. To further complicate matters, in many scenarios (e.g., large-scale process and environmental monitoring using multiple sensor modalities), probabilistic data tuples arrive and need to be processed in a streaming fashion; that is, using limited memory and CPU resources and without the benefit of multiple passes over a static probabilistic database. Such probabilistic data streams raise a host of new research challenges for stream-processing engines that, to date, remain largely unaddressed. In this paper, we propose the first space- and time-efficient algorithms for approximating complex aggregate queries (including, the number of distinct values and join/self-join sizes) over probabilistic data streams. Following the possible-worlds semantics, such aggregates essentially define probability distributions over the space of possible aggregation results, and our goal is to characterize such distributions through efficient approximations of their key moments (such as expectation and variance). Our algorithms offer strong randomized estimation guarantees while using only sublinear space in the size of the stream(s), and rely on novel, concise streaming sketch synopses that extend conventional sketching ideas to the probabilistic streams setting. Our experimental results verify the effectiveness of our approach.