Efficient and numerically stable sparse learning

Authors:
Sihong Xie;Wei Fan;Olivier Verscheure;Jiangtao Ren
Affiliations:
Sun Yat-Sen University, Guangzhou, China;IBM T.J. Watson Research Center, New York;IBM T.J. Watson Research Center, New York;Sun Yat-Sen University, Guangzhou, China
Venue:
ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Year:
2010

Citing 7
Cited 0

The robustness of the p-norm algorithms

COLT '99 Proceedings of the twelfth annual conference on Computational learning theory
Large Margin Classification Using the Perceptron Algorithm

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
On Model Selection Consistency of Lasso

The Journal of Machine Learning Research
Boosting with structural sparsity

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Stochastic methods for l1 regularized loss minimization

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Sparse Online Learning via Truncated Gradient

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of numerical stability and model density growth when training a sparse linear model from massive data. We focus on scalable algorithms that optimize certain loss function using gradient descent, with either l0 or l1 regularization. We observed numerical stability problems in several existing methods, leading to divergence and low accuracy. In addition, these methods typically have weak controls over sparsity, such that model density grows faster than necessary. We propose a framework to address the above problems. First, the update rule is numerically stable with convergence guarantee and results in more reasonable models. Second, besides l1 regularization, it exploits the sparsity of data distribution and achieves a higher degree of sparsity with a PAC generalization error bound. Lastly, it is parallelizable and suitable for training large margin classifiers on huge datasets. Experiments show that the proposed method converges consistently and outperforms other baselines using 10% of features by as much as 6% reduction in error rate on average. Datasets and software are available from the authors.