Balancing histogram optimality and practicality for query result size estimation
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast, small-space algorithms for approximate histogram maintenance
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Optimal Histograms with Quality Guarantees
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Evaluating probabilistic queries over imprecise data
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
MYSTIQ: a system for finding more answers by using probabilities
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximation and streaming algorithms for histogram construction problems
ACM Transactions on Database Systems (TODS)
Trio: a system for data, uncertainty, and lineage
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Estimating statistical aggregates on probabilistic data streams
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient aggregation algorithms for probabilistic data
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Model-driven data acquisition in sensor networks
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient indexing methods for probabilistic threshold queries over uncertain data
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Data integration with uncertainty
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Using Data Mining to Estimate Missing Sensor Data
ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Real-Time Monitoring of Uncertain Data Streams Using Probabilistic Similarity
RTSS '07 Proceedings of the 28th IEEE International Real-Time Systems Symposium
Algorithms for distributed functional monitoring
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Orion 2.0: native support for uncertain data
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Constraint Monitoring Using Adaptive Thresholds
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Histograms and Wavelets on Probabilistic Data
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Continuously monitoring top-k uncertain data streams: a probabilistic threshold method
Distributed and Parallel Databases
Fast and effective histogram construction
Proceedings of the 18th ACM conference on Information and knowledge management
Probabilistic histograms for probabilistic data
Proceedings of the VLDB Endowment
Evaluation of probabilistic threshold queries in MCDB
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Sampling based algorithms for quantile computation in sensor networks
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
SPROUT2: a squared query engine for uncertain web data
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Probabilistic Databases
Building wavelet histograms on large data in MapReduce
Proceedings of the VLDB Endowment
Efficient Threshold Monitoring for Distributed Probabilistic Data
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Hi-index | 0.00 |
In numerous real applications, uncertainty is inherently introduced when massive data are generated. Modern database management systems aim to incorporate and handle data with uncertainties as a first-class citizen, where uncertain data are represented as probabilistic relations. In my thesis, my work has focused on monitoring and summarization of large probabilistic data. Specifically, we extended the distributed threshold monitoring problem to distributed probabilistic data. Instead, we actually need to monitor the aggregated value (e.g. sum) of distributed probabilistic data against both the score threshold and the probability threshold, which make the techniques designed for deterministic data are not directly applicable. Our algorithms have significantly reduced both the communication and computation costs as shown by an extensive experimental evaluation on large real datasets. On the other hand, building histograms to summarize the distribution of certain feature in a large data set is a fundamental problem in data management. Recent work have extended this studies to probabilistic data, but their methods suffer from the limited scalability. We present novel methods to build scalable histograms over large probabilistic data using distributed and parallel algorithms. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our implementations in MapReduce, when compared to the existing, state-of-the-art centralized methods.