Comprehensible and accurate cluster labels in text clustering

Authors:
Jerzy Stefanowski;Dawid Weiss
Affiliations:
Poznan University of Technology, Poland;Poznan University of Technology, Poland
Venue:
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Year:
2007

Citing 10
Cited 2

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Concept decompositions for large sparse text data using clustering

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Document clustering with committees

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Fast and Exact Out-of-Core K-Means Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
The anatomy of SnakeT: a hierarchical clustering engine for web-page snippets

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
A divide-and-merge methodology for clustering

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A Concept-Driven Algorithm for Clustering Search Results

IEEE Intelligent Systems

Beyond precision@10: clustering the long tail of web search results

Proceedings of the 20th ACM international conference on Information and knowledge management
Search result presentation based on faceted clustering

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The purpose of text clustering in information retrieval is to discover groups of semantically related documents. Accurate and comprehensible cluster descriptions (labels) let the user comprehend the collection's content faster and are essential for various document browsing interfaces. The task of creating descriptive, sensible cluster labels is difficult---typical text clustering algorithms focus on optimizing proximity between documents inside a cluster and rely on keyword representation for describing discovered clusters. In the approach called Description Comes First (DCF) cluster labels are as important as document groups---DCF promotes machine discovery of comprehensible candidate cluster labels later used to discover related document groups. In this paper we describe an application of DCF to the k-Means algorithm, including results of experiments performed on the 20-newsgroups document collection. Experimental evaluation showed that DCF does not decrease the metrics used to assess the quality of document assignment and offers good cluster labels in return. The algorithm utilizes search engine's data structures directly to scale to large document collections.