Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty

Authors:
Yoshimasa Tsuruoka;Jun'ichi Tsujii;Sophia Ananiadou
Affiliations:
University of Manchester, UK and National Centre for Text Mining (NaCTeM), UK;University of Manchester, UK and National Centre for Text Mining (NaCTeM), UK and University of Tokyo, Japan;University of Manchester, UK and National Centre for Text Mining (NaCTeM), UK
Venue:
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Year:
2009

Citing 20
Cited 20

Note on learning rate schedules for stochastic optimization

NIPS-3 Proceedings of the 1990 conference on Advances in neural information processing systems 3
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Introduction to Stochastic Search and Optimization

Introduction to Stochastic Search and Optimization
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Evaluation and extension of maximum entropy models with inequality constraints

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Accelerated training of conditional random fields with stochastic gradient methods

ICML '06 Proceedings of the 23rd international conference on Machine learning
Parsing the WSJ using CCG and log-linear models

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Joint learning improves semantic role labeling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Improving the scalability of semi-Markov conditional random fields for named entity recognition

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A discriminative global training algorithm for statistical MT

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Scalable training of L1-regularized log-linear models

Proceedings of the 24th international conference on Machine learning
Efficient projections onto the l1-ball for learning in high dimensions

Proceedings of the 25th international conference on Machine learning
Training Conditional Random Fields by Periodic Step Size Adaptation for Large-Scale Text Mining

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

The Journal of Machine Learning Research
Introduction to the bio-entity recognition task at JNLPBA

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Sparse Online Learning via Truncated Gradient

The Journal of Machine Learning Research
EfficientL1regularized logistic regression

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Dependency parsing by belief propagation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Semantic role labelling with tree conditional random fields

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning

Grafting-light: fast, incremental feature selection and structure learning of Markov random fields

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Practical very large scale CRFs

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Cross-language text classification using structural correspondence learning

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
N-best reranking by multitask learning

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Kernel slicing: scalable online training with conjunctive features

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Unsupervised word alignment with arbitrary features

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Learning condensed feature representations from large unsupervised data sets for supervised learning

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Learning with lookahead: can history-based models rival globally optimized models?

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Frequency-aware truncated methods for sparse online learning

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Cross-Lingual Adaptation Using Structural Correspondence Learning

ACM Transactions on Intelligent Systems and Technology (TIST)
Improved answer ranking in social question-answering portals

Proceedings of the 3rd international workshop on Search and mining user-generated contents
Extracting bacteria biotopes with semi-supervised named entity recognition and coreference resolution

BioNLP Shared Task '11 Proceedings of the BioNLP Shared Task 2011 Workshop
The CMU-ARK German-English translation system

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Structured sparsity in structured prediction

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Structural and topical dimensions in multi-task patent translation

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Optimized online rank learning for machine translation

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Probabilistic Chinese word segmentation with non-local information and stochastic training

Information Processing and Management: an International Journal
Learning Abbreviations from Chinese and English Terms by Modeling Non-Local Information

ACM Transactions on Asian Language Information Processing (TALIP)
Regularized vector field learning with sparse approximation for mismatch removal

Pattern Recognition
Robust feature selection based on regularized brownboost loss

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However, L1-regularization, which is becoming popular in natural language processing because of its ability to produce compact models, cannot be efficiently applied in SGD training, due to the large dimensions of feature vectors and the fluctuations of approximate gradients. We present a simple method to solve these problems by penalizing the weights according to cumulative values for L1 penalty. We evaluate the effectiveness of our method in three applications: text chunking, named entity recognition, and part-of-speech tagging. Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for L1-regularized loglinear models.