Enhancing Document Clustering through Heuristics and Summary-Based Pre-processing

  • Authors:
  • Sri Harsha Allamraju;Robert Chun

  • Affiliations:
  • Department of Computer Science, San Jose State University, San Jose, CA 95192;Department of Computer Science, San Jose State University, San Jose, CA 95192

  • Venue:
  • Proceedings of the Symposium on Human Interface 2009 on Human Interface and the Management of Information. Information and Interaction. Part II: Held as part of HCI International 2009
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Knowledge workers are burdened with information overload. The information they need might be scattered in many places, buried in a file system, in their email, or on the web. Traditional Clustering algorithms help in assimilating these wide sources of information and generating meaningful relationships amongst them. A typical clustering preprocessing involves tokenization, removal of stop words, stemming, pruning etc. In this paper, we propose the use of summary and heuristics of a document as a pre-processing technique. This technique preserves the formatting of a document and uses this information for producing better clusters. In addition, only a summary of a document is used as the basis for clustering instead of the whole document. Clustering algorithms using the proposed pre-processing technique on formatted documents resulted in improved and more meaningful clusters.