Scalable training of L1-regularized log-linear models

Authors:
Galen Andrew;Jianfeng Gao
Affiliations:
Microsoft Research, One Microsoft Way, Redmond, WA;Microsoft Research, One Microsoft Way, Redmond, WA
Venue:
Proceedings of the 24th international conference on Machine learning
Year:
2007

Citing 7
Cited 62

A limited memory algorithm for bound constrained optimization

SIAM Journal on Scientific Computing
Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization

ACM Transactions on Mathematical Software (TOMS)
Discriminative Reranking for Natural Language Parsing

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Feature selection, L1 vs. L2 regularization, and rotational invariance

ICML '04 Proceedings of the twenty-first international conference on Machine learning
A comparison of algorithms for maximum entropy parameter estimation

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Evaluation and extension of maximum entropy models with inequality constraints

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Coarse-to-fine n-best parsing and MaxEnt discriminative reranking

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

Discriminative structure and parameter learning for Markov logic networks

Proceedings of the 25th international conference on Machine learning
A quasi-Newton approach to non-smooth convex optimization

Proceedings of the 25th international conference on Machine learning
Laplace maximum margin Markov networks

Proceedings of the 25th international conference on Machine learning
StatSnowball: a statistical approach to extracting entity relationships

Proceedings of the 18th international conference on World wide web
Large-scale deep unsupervised learning using graphics processors

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
More generality in efficient multiple kernel learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Primal sparse Max-margin Markov networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Improving classification accuracy using automatically extracted training data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A discriminative alignment model for abbreviation recognition

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Modeling latent-dynamic in shallow parsing: a latent conditional model with improved inference

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Fast full parsing by linear-chain conditional random fields

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
A discriminative candidate generator for string transformations

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Sparse multi-scale grammars for discriminative latent variable parsing

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Feature selection for activity recognition in multi-robot domains

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Generalizing local translation models

SSST '08 Proceedings of the Second Workshop on Syntax and Structure in Statistical Translation
Exponential family sparse coding with applications to self-taught learning

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Polynomial to linear: efficient classification with conjunctive features

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Maximum Entropy Discrimination Markov Networks

The Journal of Machine Learning Research
Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

The Journal of Machine Learning Research
A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

The Journal of Machine Learning Research
Grafting-light: fast, incremental feature selection and structure learning of Markov random fields

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast query execution for retrieval models based on path-constrained random walks

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
BioSnowball: automated population of Wikis

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
HiLighter: automatically building robust signatures of performance behavior for small- and large-scale systems

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Improved models of distortion cost for statistical machine translation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Practical very large scale CRFs

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Function-based question classification for general QA

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

The Journal of Machine Learning Research
A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

The Journal of Machine Learning Research
Extending the entity grid with entity-specific features

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Learning condensed feature representations from large unsupervised data sets for supervised learning

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
An improved GLMNET for l1-regularized logistic regression

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Predictive client-side profiles for personalized advertising

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

The Journal of Machine Learning Research
Language use as a reflection of socialization in online communities

LSM '11 Proceedings of the Workshop on Languages in Social Media
Large scale real-life action recognition using conditional random fields with stochastic training

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
l1-penalized linear mixed-effects models for BCI

ICANN'11 Proceedings of the 21th international conference on Artificial neural networks - Volume Part I
Author age prediction from text using linear regression

LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Structured sparsity in structured prediction

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Rumor has it: identifying misinformation in microblogs

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Identifying a small set of marker genes using minimum expected cost of misclassification

Artificial Intelligence in Medicine
The echo state conditional random field model for sequential data modeling

Expert Systems with Applications: An International Journal
The latent words language model

Computer Speech and Language
Citation-based bootstrapping for large-scale author disambiguation

Journal of the American Society for Information Science and Technology
Confidence-weighted linear classification for text categorization

The Journal of Machine Learning Research
An improved GLMNET for L1-regularized logistic regression

The Journal of Machine Learning Research
Entity clustering across languages

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Graph-based lexicon expansion with sparsity-inducing penalties

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Inferring novel associations between SNP sets and gene sets in eQTL study using sparse graphical model

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Discovering factions in the computational linguistics community

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
A class-based agreement model for generating accurately inflected translations

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Bootstrapping a unified model of lexical and phonetic acquisition

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
A probabilistic model for canonicalizing named entity mentions

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Labeling images by integrating sparse multiple distance learning and semantic context modeling

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Playing by the rules: mining query associations to predict search performance

Proceedings of the sixth ACM international conference on Web search and data mining
Multi-resolutive sparse approximations of d-dimensional data

Computer Vision and Image Understanding
Query expansion using path-constrained random walks

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Clickage: towards bridging semantic and intent gaps via mining click logs of search engines

Proceedings of the 21st ACM international conference on Multimedia
Large-scale multilabel propagation based on efficient sparse graph construction

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Fuzzy rough based regularization in Generalized Multiple Kernel Learning

Computers & Mathematics with Applications
Maximum-entropy word alignment and posterior-based phrase extraction for machine translation

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The L-BFGS limited-memory quasi-Newton method is the algorithm of choice for optimizing the parameters of large-scale log-linear models with L2 regularization, but it cannot be used for an L1-regularized loss due to its non-differentiability whenever some parameter is zero. Efficient algorithms have been proposed for this task, but they are impractical when the number of parameters is very large. We present an algorithm Orthant-Wise Limited-memory Quasi-Newton (OWL-QN), based on L-BFGS, that can efficiently optimize the L1-regularized log-likelihood of log-linear models with millions of parameters. In our experiments on a parse reranking task, our algorithm was several orders of magnitude faster than an alternative algorithm, and substantially faster than L-BFGS on the analogous L2-regularized problem. We also present a proof that OWL-QN is guaranteed to converge to a globally optimal parameter vector.