Novel document detection for massive data streams using distributed dictionary learning

Authors:
S. P. Kasiviswanathan;G. Cong;P. Melville;R. D. Lawrence
Affiliations:
GE Global Research Center, Camino Ramon, CA;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY
Venue:
IBM Journal of Research and Development
Year:
2013

Citing 23
Cited 0

Parallel and distributed computation: numerical methods

Parallel and distributed computation: numerical methods
Topic Detection and Tracking: Event-Based Information Organization

Topic Detection and Tracking: Event-Based Information Organization
Bursty and hierarchical structure in streams

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Latent dirichlet allocation

The Journal of Machine Learning Research
Parameter free bursty events detection in text streams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)

Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)
Efficient projections onto the l1-ball for learning in high dimensions

Proceedings of the 25th international conference on Machine learning
Introduction to Information Retrieval

Introduction to Information Retrieval
Robust Face Recognition via Sparse Representation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Twitter power: Tweets as electronic word of mouth

Journal of the American Society for Information Science and Technology
Online Learning for Matrix Factorization and Sparse Coding

The Journal of Machine Learning Research
TwitterMonitor: trend detection over the twitter stream

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Dense error correction via l1-minimization

IEEE Transactions on Information Theory
Identifying breakpoints in public opinion

Proceedings of the First Workshop on Social Media Analytics
Twitter under crisis: can we trust what we RT?

Proceedings of the First Workshop on Social Media Analytics
Information resonance on Twitter: watching Iran

Proceedings of the First Workshop on Social Media Analytics
Emerging topic detection using dictionary learning

Proceedings of the 20th ACM international conference on Information and knowledge management
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Alternating Direction Algorithms for $\ell_1$-Problems in Compressive Sensing

SIAM Journal on Scientific Computing
Dense subgraph maintenance under streaming edge weight updates for real-time story identification

Proceedings of the VLDB Endowment
Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Foundations and Trends® in Machine Learning
-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation

IEEE Transactions on Signal Processing
Concept labeling: building text classifiers with minimal supervision

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the high volume of content being generated online, it becomes necessary to employ automated techniques to separate out the documents belonging to novel topics from the background discussion, in a robust and scalable manner (with respect to the size of the document set). We present a solution to this challenge based on sparse coding, in which a stream of documents (where each document is modeled as an m-dimensional vector y) can be used to learn a dictionary matrix A of dimension m × k, such that the documents can be approximately represented by a linear combination of a few columns of A. If a new document cannot be represented with low error as a sparse linear combination of these columns, then this is a strong indicator of novelty of the document. We scale up this approach to handle millions of documents by parallelizing sparse coding and dictionary learning, and by using the alternating-directions method to solve the resulting optimization problems. We conduct our experiments on high-performance computing clusters with differing architectures and evaluate our approach on news streams and streaming data from Twitter®. Based on the analysis, we share our insights on the distributed optimization and machine architecture that can help the design of exascale systems supporting data analytics.