Bayesian analysis of massive datasets via particle filters

Authors:
Greg Ridgeway;David Madigan
Affiliations:
RAND, Santa Monica, CA;Rutgers University, Piscataway, NJ
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 5
Cited 0

Asymptotic methods in statistical theory

Asymptotic methods in statistical theory
A statistical perspective on knowledge discovery in databases

Advances in knowledge discovery and data mining
Instance Selection and Construction for Data Mining

Instance Selection and Construction for Data Mining
Bayesian Clustering by Dynamics

Machine Learning - Special issue: Unsupervised learning
Statistical Themes and Lessons for Data Mining

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Markov Chain Monte Carlo (MCMC) techniques revolutionized statistical practice in the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence of massive datasets and the expansion of the field of data mining has created the need to produce statistically sound methods that scale to these large problems. Except for the most trivial examples, current MCMC methods require a complete scan of the dataset for each iteration eliminating their candidacy as feasible data mining techniques.In this article we present a method for making Bayesian analysis of massive datasets computationally feasible. The algorithm simulates from a posterior distribution that conditions on a smaller, more manageable portion of the dataset. The remainder of the dataset may be incorporated by reweighting the initial draws using importance sampling. Computation of the importance weights requires a single scan of the remaining observations. While importance sampling increases efficiency in data access, it comes at the expense of estimation efficiency. A simple modification, based on the "rejuvenation" step used in particle filters for dynamic systems models, sidesteps the loss of efficiency with only a slight increase in the number of data accesses.To show proof-of-concept, we demonstrate the method on a mixture of transition models that has been used to model web traffic and robotics. For this example we show that estimation efficiency is not affected while offering a 95% reduction in data accesses.