Novel document detection for massive data streams using distributed dictionary learning

  • Authors:
  • S. P. Kasiviswanathan;G. Cong;P. Melville;R. D. Lawrence

  • Affiliations:
  • GE Global Research Center, Camino Ramon, CA;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY

  • Venue:
  • IBM Journal of Research and Development
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given the high volume of content being generated online, it becomes necessary to employ automated techniques to separate out the documents belonging to novel topics from the background discussion, in a robust and scalable manner (with respect to the size of the document set). We present a solution to this challenge based on sparse coding, in which a stream of documents (where each document is modeled as an m-dimensional vector y) can be used to learn a dictionary matrix A of dimension m × k, such that the documents can be approximately represented by a linear combination of a few columns of A. If a new document cannot be represented with low error as a sparse linear combination of these columns, then this is a strong indicator of novelty of the document. We scale up this approach to handle millions of documents by parallelizing sparse coding and dictionary learning, and by using the alternating-directions method to solve the resulting optimization problems. We conduct our experiments on high-performance computing clusters with differing architectures and evaluate our approach on news streams and streaming data from Twitter®. Based on the analysis, we share our insights on the distributed optimization and machine architecture that can help the design of exascale systems supporting data analytics.