Adaptive algorithms and stochastic approximations
Adaptive algorithms and stochastic approximations
Natural gradient works efficiently in learning
Neural Computation
Making large-scale support vector machine learning practical
Advances in kernel methods
A statistical study of on-line learning
On-line learning in neural networks
Computational Statistics & Data Analysis
Iterative solution of nonlinear equations in several variables
Iterative solution of nonlinear equations in several variables
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop
Conjugate Directions for Stochastic Gradient Descent
ICANN '02 Proceedings of the International Conference on Artificial Neural Networks
Introduction to Stochastic Search and Optimization
Introduction to Stochastic Search and Optimization
Shallow parsing with conditional random fields
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
On-line learning for very large data sets: Research Articles
Applied Stochastic Models in Business and Industry - Statistical Learning
Learning structured prediction models: a large margin approach
Learning structured prediction models: a large margin approach
Large Margin Methods for Structured and Interdependent Output Variables
The Journal of Machine Learning Research
Learning structured prediction models: a large margin approach
ICML '05 Proceedings of the 22nd international conference on Machine learning
Triple Jump Acceleration for the EM Algorithm
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A comparison of algorithms for maximum entropy parameter estimation
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Accelerated training of conditional random fields with stochastic gradient methods
ICML '06 Proceedings of the 23rd international conference on Machine learning
Training linear SVMs in linear time
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Step Size Adaptation in Reproducing Kernel Hilbert Space
The Journal of Machine Learning Research
Solving multiclass support vector machines with LaRank
Proceedings of the 24th international conference on Machine learning
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM
Proceedings of the 24th international conference on Machine learning
Large-Scale Kernel Machines (Neural Information Processing)
Large-Scale Kernel Machines (Neural Information Processing)
Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
The Journal of Machine Learning Research
Data Mining and Knowledge Discovery
Biomedical named entity recognition using conditional random fields and rich feature sets
JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Guest editorial: special issue on structured prediction
Machine Learning
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Fast multi-task learning for query spelling correction
Proceedings of the 21st ACM international conference on Information and knowledge management
Hi-index | 0.00 |
It has been established that the second-order stochastic gradient descent (SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass through the training examples. However, second-order SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive for structured prediction problems that usually involve a very high dimensional feature space. This paper presents a new second-order SGD method, called Periodic Step-size Adaptation (PSA). PSA approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian, which is proved to be simpler and more effective than directly approximating Hessian in an on-line setting. We tested PSA on a wide variety of models and tasks, including large scale sequence labeling tasks using conditional random fields and large scale classification tasks using linear support vector machines and convolutional neural networks. Experimental results show that single-pass performance of PSA is always very close to empirical optimum.