Leightweight Document Clustering

Authors:
Sholom M. Weiss;Brian F. White;Chidanand Apté
Affiliations:
-;-;-
Venue:
PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Year:
2000

Citing 7
Cited 2

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Techniques

Readings in information retrieval
Term-weighting approaches in automatic text retrieval

Readings in information retrieval
Using interdocument similarity information in document retrieval systems

Readings in information retrieval
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Lightweight Document Matching for Help-Desk Applications

IEEE Intelligent Systems
Model Selection in Unsupervised Learning with Applications To Document Clustering

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning

Lightweight Collaborative Filtering Method for Binary-Encoded Data

PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
An Empirical Comparison of Machine Learning Techniques in Predicting the Bug Severity of Open and Closed Source Projects

International Journal of Open Source Software and Processes

Quantified Score

Hi-index	0.00

Visualization

Abstract

A lightweight document clustering method is described that operates in high dimensions, processes tens of thousands of documents and groups them into several thousand clusters, or by varying a single parameter, into a few dozen clusters. The method uses a reduced indexing view of the original documents, where only the k best keywords of each document are indexed. An efficient procedure for clustering is speci fied in two parts (a) compute k most similar documents for each document in the collection and (b) group the documents into clusters using these similarity scores. The method has been evaluated on a database of over 50,000 customer service problem reports that are reduced to 3,000 clusters and 5,000 exemplar documents. Results demonstrate efficient clustering performance with excellent group similarity measures.