Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Grouper: a dynamic clustering interface to Web search results
WWW '99 Proceedings of the eighth international conference on World Wide Web
Concept decompositions for large sparse text data using clustering
Machine Learning
Modern Information Retrieval
Document clustering with committees
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Fast and Exact Out-of-Core K-Means Clustering
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
The anatomy of SnakeT: a hierarchical clustering engine for web-page snippets
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
A divide-and-merge methodology for clustering
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A Concept-Driven Algorithm for Clustering Search Results
IEEE Intelligent Systems
Beyond precision@10: clustering the long tail of web search results
Proceedings of the 20th ACM international conference on Information and knowledge management
Search result presentation based on faceted clustering
Proceedings of the 21st ACM international conference on Information and knowledge management
Hi-index | 0.00 |
The purpose of text clustering in information retrieval is to discover groups of semantically related documents. Accurate and comprehensible cluster descriptions (labels) let the user comprehend the collection's content faster and are essential for various document browsing interfaces. The task of creating descriptive, sensible cluster labels is difficult---typical text clustering algorithms focus on optimizing proximity between documents inside a cluster and rely on keyword representation for describing discovered clusters. In the approach called Description Comes First (DCF) cluster labels are as important as document groups---DCF promotes machine discovery of comprehensible candidate cluster labels later used to discover related document groups. In this paper we describe an application of DCF to the k-Means algorithm, including results of experiments performed on the 20-newsgroups document collection. Experimental evaluation showed that DCF does not decrease the metrics used to assess the quality of document assignment and offers good cluster labels in return. The algorithm utilizes search engine's data structures directly to scale to large document collections.