Local Word Bag Model for Text Categorization

  • Authors:
  • Wen Pu;Ning Liu;Shuicheng Yan;Jun Yan;Kunqing Xie;Zheng Chen

  • Affiliations:
  • -;-;-;-;-;-

  • Venue:
  • ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many text processing applications adopted the Bag of Words (BOW) model representation of documents, in which each document is represented as a vector of weighted terms or n-grams, and then cosine distance between two vectors is used as the similarity measurement. Although the great success in information retrieval and text categorization, the conventional BOW model ignores the detailed local text information, i.e. the co-occurrence pattern of words at sentence or paragraph level. In this paper, we propose a novel approach to represent a document as a set of local tf-idf vectors, or what we called local word bags (LWB). By encapsulating local information distributed around a document into multiple LWBs, we can measure the similarity of two documents via the partial match of their corresponding local bags. To perform the matching efficiently, we introduce the Local Word Bag kernel (LWB kernel), a variant of VGPyramid match kernel. The new kernel enables the discriminative machine learning methods like SVM to compute the partial matching between two sets of LWBs in linear time after an one time hierarchical clustering procedure over all local bags at the initialization stage. Experiments on real world datasets demonstrate the effectiveness of our new approach.