Criterion functions for document clustering

Authors:
George Karypis;Ding-Zhu Du;Ying Zhao
Affiliations:
University of Minnesota;University of Minnesota;University of Minnesota
Venue:
Criterion functions for document clustering
Year:
2005

Citing 0
Cited 7

Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
A Unified View on Clustering Binary Data

Machine Learning
A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Semi-supervised model-based document clustering: A comparative study

Machine Learning
Measuring the impact of sense similarity on word sense induction

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Distributional semantics from text and images

GEMS '11 Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics
Clustering and understanding documents via discrimination information maximization

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In this thesis, we focus on a class of clustering algorithms that treat the clustering problem as an optimization process which seeks to maximize or minimize a particular clustering criterion function defined over the entire clustering solution. In this thesis, we present a comprehensive study on desirable characteristics and feasibility of various criterion functions under different clustering requirements raised by real world applications. In particular, we focus on seven global criterion functions for clustering large documents datasets, three of which are introduced by us. The first part of this thesis consists of a detailed experimental evaluation using 15 different datasets and three different partitional clustering approaches, followed by a theoretical analysis of the characteristics of the various criterion functions. Our analysis shows that the criterion functions that are more robust to the difference of cluster tightness and produce more balanced clusters tend to perform well. Our three new criterion functions are among the ones achieving the best overall results. We further discuss how the various criterion functions perform to produce hierarchical and soft clustering solutions. We present a comprehensive experimental evaluation of six partitional and nine agglomerative hierarchical clustering methods using twelve datasets. A new class of agglomerative algorithms, constrained agglomerative algorithm, is also proposed and achieves the best results. We also focus on four criterion functions, derive their soft-clustering extensions, present a comprehensive experimental evaluation involving twelve different datasets, and analyze their overall characteristics. Finally, we extend the various criterion functions to incorporate prior knowledge on natural topics existing in datasets. Specifically, we define the problem of topic-driven clustering, which organizes a document collection according to a given set of topics. We propose three topic-driven schemes that consider the similarity between documents and topics and the relationship among documents themselves simultaneously. Our experimental results show that the proposed topic-driven schemes are efficient and effective with topic prototypes of different levels of specificity.