Clustering and understanding documents via discrimination information maximization

  • Authors:
  • Malik Tahir Hassan;Asim Karim

  • Affiliations:
  • Dept. of Computer Science, LUMS School of Science and Engineering, Lahore, Pakistan;Dept. of Computer Science, LUMS School of Science and Engineering, Lahore, Pakistan

  • Venue:
  • PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text document clustering is a popular task for understanding and summarizing large document collections. Besides the need for efficiency, document clustering methods should produce clusters that are readily understandable as collections of documents relating to particular contexts or topics. Existing clustering methods often ignore term-document semantics while relying upon geometric similarity measures. In this paper, we present an efficient iterative partitional clustering method, CDIM, that maximizes the sum of discrimination information provided by documents. The discrimination information of a document is computed from the discrimination information provided by the terms in it, and term discrimination information is estimated from the currently labeled document collection. A key advantage of CDIM is that its clusters are describable by their highly discriminating terms --- terms with high semantic relatedness to their clusters' contexts. We evaluate CDIM both qualitatively and quantitatively on ten text data sets. In clustering quality evaluation, we find that CDIM produces high-quality clusters superior to those generated by the best methods. We also demonstrate the understandability provided by CDIM, suggesting its suitability for practical document clustering.