Towards high-throughput gibbs sampling at scale: a study across storage managers

Authors:
Ce Zhang;Christopher Ré
Affiliations:
University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 33
Cited 1

Incomplete Information in Relational Databases

Journal of the ACM (JACM)
Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Database Management Systems

Database Management Systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Balanced graph partitioning

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Monte Carlo Statistical Methods (Springer Texts in Statistics)

Monte Carlo Statistical Methods (Springer Texts in Statistics)
Trio: a system for data, uncertainty, and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Query language support for incomplete information in the MayBMS system

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
MCDB: a monte carlo approach to managing uncertain data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Fast collapsed gibbs sampling for latent dirichlet allocation

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
BayesStore: managing large, uncertain data repositories with probabilistic graphical models

Proceedings of the VLDB Endowment
Graphical Models, Exponential Families, and Variational Inference

Foundations and Trends® in Machine Learning
StatSnowball: a statistical approach to extracting entity relationships

Proceedings of the 18th international conference on World wide web
SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Representing uncertain data: models, properties, and algorithms

The VLDB Journal — The International Journal on Very Large Data Bases
PrDB: managing and exploiting rich correlations in probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
A study of replacement algorithms for a virtual-storage computer

IBM Systems Journal
Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
Ricardo: integrating R and Hadoop

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Markov Logic: An Interface Layer for Artificial Intelligence

Markov Logic: An Interface Layer for Artificial Intelligence
An architecture for parallel topic models

Proceedings of the VLDB Endowment
Scalable probabilistic databases with factor graphs and MCMC

Proceedings of the VLDB Endowment
Queries and materialized views on probabilistic databases

Journal of Computer and System Sciences
PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

ACM Transactions on Intelligent Systems and Technology (TIST)
Web information extraction using Markov logic networks

Proceedings of the 20th international conference companion on World wide web
Tuffy: scaling up statistical inference in Markov logic networks using an RDBMS

Proceedings of the VLDB Endowment
Hybrid in-database inference for declarative information extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The monte carlo database system: Stochastic analysis close to the data

ACM Transactions on Database Systems (TODS)
Bayesian Reasoning and Machine Learning

Bayesian Reasoning and Machine Learning
Optimizing Statistical Information Extraction Programs over Evolving Text

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment

GeoDeepDive: statistical inference using familiar data-processing languages

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Factor graphs and Gibbs sampling are a popular combination for Bayesian statistical methods that are used to solve diverse problems including insurance risk models, pricing models, and information extraction. Given a fixed sampling method and a fixed amount of time, an implementation of a sampler that achieves a higher throughput of samples will achieve a higher quality than a lower-throughput sampler. We study how (and whether) traditional data processing choices about materialization, page layout, and buffer-replacement policy need to be changed to achieve high-throughput Gibbs sampling for factor graphs that are larger than main memory. We find that both new theoretical and new algorithmic techniques are required to understand the tradeoff space for each choice. On both real and synthetic data, we demonstrate that traditional baseline approaches may achieve two orders of magnitude lower throughput than an optimal approach. For a handful of popular tasks across several storage backends, including HBase and traditional unix files, we show that our simple prototype achieves competitive (and sometimes better) throughput compared to specialized state-of-the-art approaches on factor graphs that are larger than main memory.