High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets

Authors:
Hassan H. Malik;John R. Kender
Affiliations:
Columbia University;Columbia University
Venue:
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Year:
2006

Citing 0
Cited 7

Self-sufficient itemsets: An approach to screening potentially interesting associations between items

ACM Transactions on Knowledge Discovery from Data (TKDD)
Dynamic hierarchical algorithms for document clustering

Pattern Recognition Letters
A document clustering algorithm for discovering and describing topics

Pattern Recognition Letters
Hierarchical document clustering using local patterns

Data Mining and Knowledge Discovery
Evolutionary clustering using frequent itemsets

Proceedings of the First International Workshop on Novel Data Stream Pattern Mining Techniques
Frequent itemset based hierarchical document clustering using Wikipedia as external knowledge

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
A pattern discovery model for effective text mining

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.01

Visualization

Abstract

High dimensionality remains a significant challenge for document clustering. Recent approaches used frequent itemsets and closed frequent itemsets to reduce dimensionality, and to improve the efficiency of hierarchical document clustering. In this paper, we introduce the notion of "closed interesting" itemsets (i.e. closed itemsets with high interestingness). We provide heuristics such as "super item" to efficiently mine these itemsets and show that they provide significant dimensionality reduction over closed frequent itemsets. Using "closed interesting" itemsets, we propose a new, sub-linearly scalable, hierarchical document clustering method that outperforms state of the art agglomerative, partitioning and frequent-itemset based methods both in terms of clustering quality and runtime performance, without requiring dataset specific parameter tuning. We evaluate twenty interestingness measures and show that when used to generate "closed interesting" itemsets, and to select parent nodes, Mutual Information, Added Value, Yule's Q and Chi- Square offer best clustering performance.