Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields

Authors:
Han-Shen Huang;Bo-Hou Yang;Yu-Ming Chang;Chun-Nan Hsu
Affiliations:
Institute of Information Science, Academia Sinica, Taipei, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan and Department of Electrical Engineering, Chang Gung University, Taoyuan, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan
Venue:
Data Mining and Knowledge Discovery
Year:
2009

Citing 21
Cited 1

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Adaptive Probabilistic Networks with Hidden Variables

Machine Learning - Special issue on learning with probabilistic representations
On computing the largest fraction of missing information for the EM algorithm and the worst linear function for data augmentation

Computational Statistics & Data Analysis
Accelerating EM for Large Databases

Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Triple Jump Acceleration for the EM Algorithm

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Introduction to the CoNLL-2000 shared task: chunking

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
A comparison of algorithms for maximum entropy parameter estimation

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Accelerating the convergence of the EM algorithm using the vector ε algorithm

Computational Statistics & Data Analysis
Acceleration schemes with application to the EM algorithm

Computational Statistics & Data Analysis
Integrating high dimensional bi-directional parsing models for gene mention tagging

Bioinformatics
Training Conditional Random Fields by Periodic Step Size Adaptation for Large-Scale Text Mining

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Introduction to the bio-entity recognition task at JNLPBA

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Biomedical named entity recognition using conditional random fields and rich feature sets

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Local learning in probabilistic networks with hidden variables

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Update rules for parameter estimation in Bayesian networks

UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence
On the convergence of bound optimization algorithms

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

Periodic step-size adaptation in second-order gradient descent for single-pass on-line structured learning

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The triple jump extrapolation method is an effective approximation of Aitken's acceleration that can accelerate the convergence of many algorithms for data mining, including EM and generalized iterative scaling (GIS). It has two options--global and componentwise extrapolation. Empirical studies showed that neither can dominate the other and it is not known which one is better under what condition. In this paper, we investigate this problem and conclude that, when the Jacobian is (block) diagonal, componentwise extrapolation will be more effective. We derive two hints to determine the block diagonality. The first hint is that when we have a highly sparse data set, the Jacobian of the EM mapping for training a Bayesian network will be block diagonal. The second is that the block diagonality of the Jacobian of the GIS mapping for training CRF is negatively correlated with the strength of feature dependencies. We empirically verify these hints with controlled and real-world data sets and show that our hints can accurately predict which method will be superior. We also show that both global and componentwise extrapolation can provide substantial acceleration. In particular, when applied to train large-scale CRF models, the GIS variant accelerated by componentwise extrapolation not only outperforms its global extrapolation counterpart, as our hint predicts, but can also compete with limited-memory BFGS (L-BFGS), the de facto standard for CRF training, in terms of both computational efficiency and F-scores. Though none of the above methods are as fast as stochastic gradient descent (SGD), careful tuning is required for SGD and the results given in this paper provide a useful foundation for automatic tuning.