Language pyramid and multi-scale text analysis

  • Authors:
  • Shuang-Hong Yang;Hongyuan Zha

  • Affiliations:
  • Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA

  • Venue:
  • CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The classical Bag-of-Word (BOW) model represents a document as a histogram of word occurrence, losing the spatial information that is invaluable for many text analysis tasks. In this paper, we present the Language Pyramid (LaP) model, which casts a document as a probabilistic distribution over the joint semantic-spatial space and motivates a multi-scale 2D local smoothing framework for nonparametric text coding. LaP efficiently encodes both semantic and spatial contents of a document into a pyramid of matrices that are smoothed both semantically and spatially at a sequence of resolutions, providing a convenient multi-scale imagic view for natural language understanding. The LaP representation can be used in text analysis in a variety of ways, among which we investigate two instantiations in the current paper: (1) multi-scale text kernels for document categorization, and (2) multi-scale language models for ad hoc text retrieval. Experimental results illustrate that: for classification, LaP outperforms BOW by (up to) 4% on moderate-length texts (RCV1 text benchmark) and 15% on short texts (Yahoo! queries); and for retrieval, LaP gains 12% MAP improvement over uni-gram language models on the OHSUMED data set.