Implementing agglomerative hierarchic clustering algorithms for use in document retrieval
Information Processing and Management: an International Journal
Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Term-weighting approaches in automatic text retrieval
Readings in information retrieval
Using interdocument similarity information in document retrieval systems
Readings in information retrieval
Fast and effective text mining using linear-time document clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Lightweight Document Matching for Help-Desk Applications
IEEE Intelligent Systems
Using text classification and multiple concepts to answer e-mails
Expert Systems with Applications: An International Journal
Hi-index | 0.01 |
Document databases may be ill-formed, containing redundant and poorly organized documents. For example, a database of customers' descriptions of problems with products and the vendor's descriptions of their resolution may contain many descriptions of the same problem. A highly desirable goal is to transform the database into a concise set of summarized reports-- model cases--which in turn are more amenable to search and problem resolution without expert intervention. In this paper, we describe techniques for attempting to automate the procedures for reducing a database to its essential components. Our initial application is self help for resolution of product problems. A lightweight document clustering method is described that operates in high dimensionality, processing tens of thousands of documents and grouping them into several thousand clusters. Techniques are described for summarization and exemplar selection to further refine the database contents. The method has been evaluated on a database of over 100000 customer-service problem reports that are reduced to 3000 clusters and 5000 exemplar documents. Preliminary results are promising and demonstrate efficient clustering performance with excellent group similarity measures, reducing the original database size by several orders of magnitude.