Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

  • Authors:
  • Vipin Balachandran;Deepak P;Deepak Khemani

  • Affiliations:
  • Indian Institute of Technology Madras, Chennai, India;IBM Research - India, Bangalore, India;Indian Institute of Technology Madras, Chennai, India

  • Venue:
  • Proceedings of the 18th ACM conference on Information and knowledge management
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop a clustering algorithm toward the outlined goal of building interpretable and reconfigurable cluster models; it works by generating rules with disjunctions and conditions on the frequencies of words, to decide on the membership of a document to a cluster. Each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. We show that our approach outperforms the unsupervised decision tree approach by huge margins. We show that the purity and f-measure losses to achieve interpretability are as little as 5% and 3% respectively using our approach.