Accelerating Bayesian network parameter learning using Hadoop and MapReduce

Authors:
Aniruddha Basak;Irina Brinster;Xianheng Ma;Ole J. Mengshoel
Affiliations:
Carnegie Mellon University, Moffett Field, CA;Carnegie Mellon University, Moffett Field, CA;Carnegie Mellon University, Moffett Field, CA;Carnegie Mellon University, Moffett Field, CA
Venue:
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Year:
2012

Citing 6
Cited 1

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Scalable Parallel Implementation of Exact Inference in Bayesian Networks

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Operations for learning with graphical models

Journal of Artificial Intelligence Research
Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
Parallel Exact Inference on a CPU-GPGPU Heterogenous System

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing

Optimizing parallel belief propagation in junction treesusing regression

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning conditional probability tables of large Bayesian Networks (BNs) with hidden nodes using the Expectation Maximization algorithm is heavily computationally intensive. There are at least two bottlenecks, namely the potentially huge data set size and the requirement for computation and memory resources. This work applies the distributed computing framework MapReduce to Bayesian parameter learning from complete and incomplete data. We formulate both traditional parameter learning (complete data) and the classical Expectation Maximization algorithm (incomplete data) within the MapReduce framework. Analytically and experimentally we analyze the speed-up that can be obtained by means of MapReduce. We present the details of our Hadoop implementation, report speed-ups versus the sequential case, and compare various Hadoop configurations for experiments with Bayesian networks of different sizes and structures. For Bayesian networks with large junction trees, we surprisingly find that MapReduce can give a speed-up compared to the sequential Expectation Maximization algorithm for learning from 20 cases or fewer. The benefit of MapReduce for learning various Bayesian networks is investigated on data sets with up to 1,000,000 records.