Parallelization of module network structure learning and performance tuning on SMP

Authors:
Hongshan Jiang;Chunrong Lai;Wenguang Chen;Yurong Chen;Wei Hu;Weimin Zheng;Yimin Zhang
Affiliations:
Tsinghua University, Dept. of Computer Science, Beijing, China;Intel China Research Center Ltd., Beijing, China;Tsinghua University, Dept. of Computer Science, Beijing, China;Intel China Research Center Ltd., Beijing, China;Intel China Research Center Ltd., Beijing, China;Tsinghua University, Dept. of Computer Science, Beijing, China;Intel China Research Center Ltd., Beijing, China
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 5
Cited 0

Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
A tutorial on learning with Bayesian networks

Proceedings of the NATO Advanced Study Institute on Learning in graphical models
Hoard: a scalable memory allocator for multithreaded applications

ACM SIGPLAN Notices
Parallel Module Network Learning on Distributed Memory Multiprocessors

ICPPW '05 Proceedings of the 2005 International Conference on Parallel Processing Workshops
Learning Module Networks

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

As an extension of Bayesian network, module network is an appropriate model for inferring causal network of a mass of variables from insufficient evidences. However learning such a model is still a timeconsuming process. In this paper, we propose a parallel implementation of module network learning algorithm using OpenMP. We propose a static task partitioning strategy which distributes sub-search-spaces over worker threads to get the tradeoff between loadbalance and software-cache-contention. To overcome performance penalties derived from shared-memory contention, we adopt several optimization techniques such as memory pre-allocation, memory alignment and static function usage. These optimizations have different patterns of influence on the sequential performance and the parallel speedup. Experiments validate the effectiveness of these optimizations. For a 2,200 nodes dataset, they enhance the parallel speedup up to 88%, together with a 2X sequential performance improvement. With resource contentions reduced, workload imbalance becomes the main hurdle to parallel scalability and the program behaviors more stable in various platforms.