Hierarchical document clustering using local patterns

Authors:
Hassan H. Malik;John R. Kender;Dmitriy Fradkin;Fabian Moerchen
Affiliations:
Thomson Reuters, New York, USA 10007;Columbia University, New York, USA 10027;Siemens Corporate Research, Princeton, USA 08540;Siemens Corporate Research, Princeton, USA 08540
Venue:
Data Mining and Knowledge Discovery
Year:
2010

Citing 16
Cited 6

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Share Based Measures for Itemsets

PKDD '97 Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery
Selecting the right interestingness measure for association patterns

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering all most specific sentences

ACM Transactions on Database Systems (TODS)
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
TopCat: Data Mining for Topic Identification in a Text Corpus

IEEE Transactions on Knowledge and Data Engineering
SUMMARY: Efficiently Summarizing Transactions for Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Scalable Construction of Topic Directory with Nonparametric Closed Termset Mining

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Text document clustering based on frequent word sequences

Proceedings of the 14th ACM international conference on Information and knowledge management
Interestingness measures for data mining: A survey

ACM Computing Surveys (CSUR)
High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Optimizing Frequency Queries for Data Mining Applications

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Mining top-k frequent closed itemsets is not in APX

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Guest Editorial: Global modeling using local patterns

Data Mining and Knowledge Discovery
Exploring the corporate ecosystem with a semi-supervised entity graph

Proceedings of the 20th ACM international conference on Information and knowledge management
Measuring the coverage and redundancy of information search services on e-commerce platforms

Electronic Commerce Research and Applications
Data Field for Hierarchical Clustering

International Journal of Data Warehousing and Mining
A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval

Knowledge-Based Systems
Mining frequent patterns and association rules using similarities

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The global pattern mining step in existing pattern-based hierarchical clustering algorithms may result in an unpredictable number of patterns. In this paper, we propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC first discovers locally promising patterns by allowing each instance to "vote" for its representative size-2 patterns in a way that ensures an effective balance between local pattern frequency and pattern significance in the dataset. The cluster hierarchy (i.e., the global model) is then directly constructed using these locally promising patterns as features. Each pattern forms an initial (possibly overlapping) cluster, and the rest of the cluster hierarchy is obtained by following a unique iterative cluster refinement process. By effectively utilizing instance-to-cluster relationships, this process directly identifies clusters for each level in the hierarchy, and efficiently prunes duplicate clusters. Furthermore, IDHC produces cluster labels that are more descriptive (patterns are not artificially restricted), and adapts a soft clustering scheme that allows instances to exist in suitable nodes at various levels in the cluster hierarchy. We present results of experiments performed on 16 standard text datasets, and show that IDHC outperforms state-of-the-art hierarchical clustering algorithms in terms of average entropy and FScore measures.