Enhancing Document Clustering through Heuristics and Summary-Based Pre-processing

Authors:
Sri Harsha Allamraju;Robert Chun
Affiliations:
Department of Computer Science, San Jose State University, San Jose, CA 95192;Department of Computer Science, San Jose State University, San Jose, CA 95192
Venue:
Proceedings of the Symposium on Human Interface 2009 on Human Interface and the Management of Information. Information and Interaction. Part II: Held as part of HCI International 2009
Year:
2009

Citing 3
Cited 0

CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Knowledge workers are burdened with information overload. The information they need might be scattered in many places, buried in a file system, in their email, or on the web. Traditional Clustering algorithms help in assimilating these wide sources of information and generating meaningful relationships amongst them. A typical clustering preprocessing involves tokenization, removal of stop words, stemming, pruning etc. In this paper, we propose the use of summary and heuristics of a document as a pre-processing technique. This technique preserves the formatting of a document and uses this information for producing better clusters. In addition, only a summary of a document is used as the basis for clustering instead of the whole document. Clustering algorithms using the proposed pre-processing technique on formatted documents resulted in improved and more meaningful clusters.