Efficient and scalable monitoring and summarization of large probabilistic data

  • Authors:
  • Mingwang Tang

  • Affiliations:
  • University of Utah, Salt Lake City, UT, USA

  • Venue:
  • Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In numerous real applications, uncertainty is inherently introduced when massive data are generated. Modern database management systems aim to incorporate and handle data with uncertainties as a first-class citizen, where uncertain data are represented as probabilistic relations. In my thesis, my work has focused on monitoring and summarization of large probabilistic data. Specifically, we extended the distributed threshold monitoring problem to distributed probabilistic data. Instead, we actually need to monitor the aggregated value (e.g. sum) of distributed probabilistic data against both the score threshold and the probability threshold, which make the techniques designed for deterministic data are not directly applicable. Our algorithms have significantly reduced both the communication and computation costs as shown by an extensive experimental evaluation on large real datasets. On the other hand, building histograms to summarize the distribution of certain feature in a large data set is a fundamental problem in data management. Recent work have extended this studies to probabilistic data, but their methods suffer from the limited scalability. We present novel methods to build scalable histograms over large probabilistic data using distributed and parallel algorithms. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our implementations in MapReduce, when compared to the existing, state-of-the-art centralized methods.