Shape pattern matching: A tool to cluster unstructured text documents

  • Authors:
  • Durga Toshniwal;Rishiraj Saha Roy

  • Affiliations:
  • (Correspd. E-mail: durgafec@iitr.ernet.in) Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee - 247 667, Uttarakhand, India;Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee - 247 667, Uttarakhand, India

  • Venue:
  • Journal of Computational Methods in Sciences and Engineering - Special Supplement Issue in Section A and B: Selected Papers from the ISCA International Conference on Software Engineering and Data Engineering, 2009
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Research in text mining has recently gained a lot of importance due to the large increase in the number of electronic news articles, books, research papers, and e-mail messages. Clustering organizes text documents in an unsupervised fashion. In this paper, we propose an algorithm for clustering unstructured text documents using shape pattern matching. The Vector Space Model is used to represent our dataset as a term-weight matrix. The high dimensional vector space has been mapped to a two-dimensional plane that has the term weights plotted against a time axis. In this way, the text documents are represented in the form of time sequences. Initially, the documents are broadly grouped into categories that are determined using domain knowledge. The relevant portion of the document vector is then clipped out. The shape patterns present in these clipped portions are observed. Indexing of these shape patterns is done by preparing their alphabet. Grouping documents within a category which share the same shape pattern results in the required clusters.