Semantic smoothing of document models for agglomerative clustering

  • Authors:
  • Xiaohua Zhou;Xiaodan Zhang;Xiaohua Hu

  • Affiliations:
  • Drexel University, College of Information Science & Technology;Drexel University, College of Information Science & Technology;Drexel University, College of Information Science & Technology

  • Venue:
  • IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we argue that the agglomerative clustering with vector cosine similarity measure performs poorly due to two reasons. First, the nearest neighbors of a document belong to different classes in many cases since any pair of documents shares lots of "general" words. Second, the sparsity of class-specific "core" words leads to grouping documents with the same class labels into different clusters. Both problems can be resolved by suitable smoothing of document model and using Kullback-Leibler divergence of two smoothed models as pairwise document distances. Inspired by the recent work in information retrieval, we propose a novel context-sensitive semantic smoothing method that can automatically identifies multiword phrases in a document and then statistically map phrases to individual document terms. We evaluate the new model-based similarity measure on three datasets using complete linkage criterion for agglomerative clustering and find out it significantly improves the clustering quality over the traditional vector cosine measure.