Efficient and scalable monitoring and summarization of large probabilistic data

Authors:
Mingwang Tang
Affiliations:
University of Utah, Salt Lake City, UT, USA
Venue:
Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Year:
2013

Citing 29
Cited 0

Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast, small-space algorithms for approximate histogram maintenance

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Evaluating probabilistic queries over imprecise data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
MYSTIQ: a system for finding more answers by using probabilities

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximation and streaming algorithms for histogram construction problems

ACM Transactions on Database Systems (TODS)
Trio: a system for data, uncertainty, and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Estimating statistical aggregates on probabilistic data streams

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient aggregation algorithms for probabilistic data

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Model-driven data acquisition in sensor networks

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient indexing methods for probabilistic threshold queries over uncertain data

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Data integration with uncertainty

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Using Data Mining to Estimate Missing Sensor Data

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Real-Time Monitoring of Uncertain Data Streams Using Probabilistic Similarity

RTSS '07 Proceedings of the 28th IEEE International Real-Time Systems Symposium
Algorithms for distributed functional monitoring

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Orion 2.0: native support for uncertain data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Constraint Monitoring Using Adaptive Thresholds

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Histograms and Wavelets on Probabilistic Data

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Continuously monitoring top-k uncertain data streams: a probabilistic threshold method

Distributed and Parallel Databases
Fast and effective histogram construction

Proceedings of the 18th ACM conference on Information and knowledge management
Probabilistic histograms for probabilistic data

Proceedings of the VLDB Endowment
Evaluation of probabilistic threshold queries in MCDB

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Sampling based algorithms for quantile computation in sensor networks

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
SPROUT2: a squared query engine for uncertain web data

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Probabilistic Databases

Probabilistic Databases
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Efficient Threshold Monitoring for Distributed Probabilistic Data

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In numerous real applications, uncertainty is inherently introduced when massive data are generated. Modern database management systems aim to incorporate and handle data with uncertainties as a first-class citizen, where uncertain data are represented as probabilistic relations. In my thesis, my work has focused on monitoring and summarization of large probabilistic data. Specifically, we extended the distributed threshold monitoring problem to distributed probabilistic data. Instead, we actually need to monitor the aggregated value (e.g. sum) of distributed probabilistic data against both the score threshold and the probability threshold, which make the techniques designed for deterministic data are not directly applicable. Our algorithms have significantly reduced both the communication and computation costs as shown by an extensive experimental evaluation on large real datasets. On the other hand, building histograms to summarize the distribution of certain feature in a large data set is a fundamental problem in data management. Recent work have extended this studies to probabilistic data, but their methods suffer from the limited scalability. We present novel methods to build scalable histograms over large probabilistic data using distributed and parallel algorithms. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our implementations in MapReduce, when compared to the existing, state-of-the-art centralized methods.