Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

Authors:
Vipin Balachandran;Deepak P;Deepak Khemani
Affiliations:
Indian Institute of Technology Madras, Chennai, India;IBM Research - India, Bangalore, India;Indian Institute of Technology Madras, Chennai, India
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 10
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data clustering: a review

ACM Computing Surveys (CSUR)
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A matrix density based algorithm to hierarchically co-cluster documents and words

WWW '03 Proceedings of the 12th international conference on World Wide Web
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Interpretable Hierarchical Clustering by Constructing an Unsupervised Decision Tree

IEEE Transactions on Knowledge and Data Engineering
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning

International Journal of Approximate Reasoning

Quantified Score

Hi-index	0.01

Visualization

Abstract

Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop a clustering algorithm toward the outlined goal of building interpretable and reconfigurable cluster models; it works by generating rules with disjunctions and conditions on the frequencies of words, to decide on the membership of a document to a cluster. Each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. We show that our approach outperforms the unsupervised decision tree approach by huge margins. We show that the purity and f-measure losses to achieve interpretability are as little as 5% and 3% respectively using our approach.