Periodic step-size adaptation in second-order gradient descent for single-pass on-line structured learning

Authors:
Chun-Nan Hsu;Han-Shen Huang;Yu-Ming Chang;Yuh-Jye Lee
Affiliations:
Institute of Information Science, Academia Sinica, Taipei, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan;Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
Venue:
Machine Learning
Year:
2009

Citing 28
Cited 4

Adaptive algorithms and stochastic approximations

Adaptive algorithms and stochastic approximations
Natural gradient works efficiently in learning

Neural Computation
Making large-scale support vector machine learning practical

Advances in kernel methods
A statistical study of on-line learning

On-line learning in neural networks
On computing the largest fraction of missing information for the EM algorithm and the worst linear function for data augmentation

Computational Statistics & Data Analysis
Iterative solution of nonlinear equations in several variables

Iterative solution of nonlinear equations in several variables
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Effiicient BackProp

Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop
Conjugate Directions for Stochastic Gradient Descent

ICANN '02 Proceedings of the International Conference on Artificial Neural Networks
Introduction to Stochastic Search and Optimization

Introduction to Stochastic Search and Optimization
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
On-line learning for very large data sets: Research Articles

Applied Stochastic Models in Business and Industry - Statistical Learning
Learning structured prediction models: a large margin approach

Learning structured prediction models: a large margin approach
Large Margin Methods for Structured and Interdependent Output Variables

The Journal of Machine Learning Research
Learning structured prediction models: a large margin approach

ICML '05 Proceedings of the 22nd international conference on Machine learning
Triple Jump Acceleration for the EM Algorithm

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A comparison of algorithms for maximum entropy parameter estimation

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Accelerated training of conditional random fields with stochastic gradient methods

ICML '06 Proceedings of the 23rd international conference on Machine learning
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Global and Componentwise Extrapolation for Accelerating Data Mining from Large Incomplete Data Sets with the EM Algorithm

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Step Size Adaptation in Reproducing Kernel Hilbert Space

The Journal of Machine Learning Research
Solving multiclass support vector machines with LaRank

Proceedings of the 24th international conference on Machine learning
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM

Proceedings of the 24th international conference on Machine learning
Large-Scale Kernel Machines (Neural Information Processing)

Large-Scale Kernel Machines (Neural Information Processing)
Integrating high dimensional bi-directional parsing models for gene mention tagging

Bioinformatics
Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

The Journal of Machine Learning Research
Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields

Data Mining and Knowledge Discovery
Biomedical named entity recognition using conditional random fields and rich feature sets

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications

Guest editorial: special issue on structured prediction

Machine Learning
Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Fast multi-task learning for query spelling correction

Proceedings of the 21st ACM international conference on Information and knowledge management
A rank-one update method for least squares linear discriminant analysis with concept drift

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

It has been established that the second-order stochastic gradient descent (SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass through the training examples. However, second-order SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive for structured prediction problems that usually involve a very high dimensional feature space. This paper presents a new second-order SGD method, called Periodic Step-size Adaptation (PSA). PSA approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian, which is proved to be simpler and more effective than directly approximating Hessian in an on-line setting. We tested PSA on a wide variety of models and tasks, including large scale sequence labeling tasks using conditional random fields and large scale classification tasks using linear support vector machines and convolutional neural networks. Experimental results show that single-pass performance of PSA is always very close to empirical optimum.