Probabilistic reasoning in intelligent systems: networks of plausible inference
Probabilistic reasoning in intelligent systems: networks of plausible inference
Matrix computations (3rd ed.)
Bayesian classification (AutoClass): theory and results
Advances in knowledge discovery and data mining
Adaptive Probabilistic Networks with Hidden Variables
Machine Learning - Special issue on learning with probabilistic representations
Computational Statistics & Data Analysis
Accelerating EM for Large Databases
Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Shallow parsing with conditional random fields
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Triple Jump Acceleration for the EM Algorithm
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Introduction to the CoNLL-2000 shared task: chunking
ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
A comparison of algorithms for maximum entropy parameter estimation
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Accelerating the convergence of the EM algorithm using the vector ε algorithm
Computational Statistics & Data Analysis
Acceleration schemes with application to the EM algorithm
Computational Statistics & Data Analysis
Training Conditional Random Fields by Periodic Step Size Adaptation for Large-Scale Text Mining
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Introduction to the bio-entity recognition task at JNLPBA
JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Biomedical named entity recognition using conditional random fields and rich feature sets
JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Local learning in probabilistic networks with hidden variables
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Update rules for parameter estimation in Bayesian networks
UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence
On the convergence of bound optimization algorithms
UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Hi-index | 0.00 |
The triple jump extrapolation method is an effective approximation of Aitken's acceleration that can accelerate the convergence of many algorithms for data mining, including EM and generalized iterative scaling (GIS). It has two options--global and componentwise extrapolation. Empirical studies showed that neither can dominate the other and it is not known which one is better under what condition. In this paper, we investigate this problem and conclude that, when the Jacobian is (block) diagonal, componentwise extrapolation will be more effective. We derive two hints to determine the block diagonality. The first hint is that when we have a highly sparse data set, the Jacobian of the EM mapping for training a Bayesian network will be block diagonal. The second is that the block diagonality of the Jacobian of the GIS mapping for training CRF is negatively correlated with the strength of feature dependencies. We empirically verify these hints with controlled and real-world data sets and show that our hints can accurately predict which method will be superior. We also show that both global and componentwise extrapolation can provide substantial acceleration. In particular, when applied to train large-scale CRF models, the GIS variant accelerated by componentwise extrapolation not only outperforms its global extrapolation counterpart, as our hint predicts, but can also compete with limited-memory BFGS (L-BFGS), the de facto standard for CRF training, in terms of both computational efficiency and F-scores. Though none of the above methods are as fast as stochastic gradient descent (SGD), careful tuning is required for SGD and the results given in this paper provide a useful foundation for automatic tuning.