The monte carlo database system: Stochastic analysis close to the data

  • Authors:
  • Ravi Jampani;Fei Xu;Mingxi Wu;Luis Perez;Chris Jermaine;Peter J. Haas

  • Affiliations:
  • University of Florida, Gainesville, FL;Microsoft Corporation, Redmond, WA;Oracle Corporation, Redwood Shores, CA;Rice University, Houston, TX;Rice University, Houston, TX;IBM Almaden Research Center, Armonk, NY

  • Venue:
  • ACM Transactions on Database Systems (TODS)
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The application of stochastic models and analysis techniques to large datasets is now commonplace. Unfortunately, in practice this usually means extracting data from a database system into an external tool (such as SAS, R, Arena, or Matlab), and then running the analysis there. This extract-and-model paradigm is typically error-prone, slow, does not support fine-grained modeling, and discourages what-if and sensitivity analyses. In this article we describe MCDB, a database system that permits a wide spectrum of stochastic models to be used in conjunction with the data stored in a large database, without ever extracting the data. MCDB facilitates in-database execution of tasks such as risk assessment, prediction, and imputation of missing data, as well as management of errors due to data integration, information extraction, and privacy-preserving data anonymization. MCDB allows a user to define “random” relations whose contents are determined by stochastic models. The models can then be queried using standard SQL. Monte Carlo techniques are used to analyze the probability distribution of the result of an SQL query over random relations. Novel “tuple-bundle” processing techniques can effectively control the Monte Carlo overhead, as shown in our experiments.