Large-scale distributed non-negative sparse coding and sparse dictionary learning

Authors:
Vikas Sindhwani;Amol Ghoting
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY, USA;IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Venue:
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2012

Citing 13
Cited 1

Latent dirichlet allocation

The Journal of Machine Learning Research
Non-negative Matrix Factorization with Sparseness Constraints

The Journal of Machine Learning Research
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Efficient projections onto the l1-ball for learning in high dimensions

Proceedings of the 25th international conference on Machine learning
Distributed Algorithms for Topic Models

The Journal of Machine Learning Research
Online Learning for Matrix Factorization and Sparse Coding

The Journal of Machine Learning Research
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Proceedings of the 19th international conference on World wide web
Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing

Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing
An architecture for parallel topic models

Proceedings of the VLDB Endowment
SystemML: Declarative machine learning on MapReduce

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Statistics for High-Dimensional Data: Methods, Theory and Applications

Statistics for High-Dimensional Data: Methods, Theory and Applications
Computing a nonnegative matrix factorization -- provably

STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing

Trends and outlook for the massive-scale analytics stack

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of building compact, unsupervised representations of large, high-dimensional, non-negative data using sparse coding and dictionary learning schemes, with an emphasis on executing the algorithm in a Map-Reduce environment. The proposed algorithms may be seen as parallel optimization procedures for constructing sparse non-negative factorizations of large, sparse matrices. Our approach alternates between a parallel sparse coding phase implemented using greedy or convex (l1) regularized risk minimization procedures, and a sequential dictionary learning phase where we solve a set of l0 optimization problems exactly. These two-fold sparsity constraints lead to better statistical performance on text analysis tasks and at the same time make it possible to implement each iteration in a single Map-Reduce job. We detail our implementations and optimizations that lead to the ability to factor matrices with more than 100 million rows and billions of non-zero entries in just a few hours on a small commodity cluster.