Shape pattern matching: A tool to cluster unstructured text documents

Authors:
Durga Toshniwal;Rishiraj Saha Roy
Affiliations:
(Correspd. E-mail: durgafec@iitr.ernet.in) Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee - 247 667, Uttarakhand, India;Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee - 247 667, Uttarakhand, India
Venue:
Journal of Computational Methods in Sciences and Engineering - Special Supplement Issue in Section A and B: Selected Papers from the ISCA International Conference on Software Engineering and Data Engineering, 2009
Year:
2010

Citing 9
Cited 0

Word association norms, mutual information, and lexicography

Computational Linguistics
Locality preserving indexing for document representation

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Dependence Among Terms in Vector Space Model

IDEAS '04 Proceedings of the International Database Engineering and Applications Symposium
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
A Bit Level Representation for Time Series Data Mining with Shape Based Similarity

Data Mining and Knowledge Discovery
Statistical Evaluation of Measure and Distance on Document Classification Problems in Text Mining

CIT '07 Proceedings of the 7th IEEE International Conference on Computer and Information Technology
Perception-based approach to time series data mining

Applied Soft Computing
Text Clustering with Feature Selection by Using Statistical Data

IEEE Transactions on Knowledge and Data Engineering
Document Representation and Dimension Reduction for Text Clustering

ICDEW '07 Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Research in text mining has recently gained a lot of importance due to the large increase in the number of electronic news articles, books, research papers, and e-mail messages. Clustering organizes text documents in an unsupervised fashion. In this paper, we propose an algorithm for clustering unstructured text documents using shape pattern matching. The Vector Space Model is used to represent our dataset as a term-weight matrix. The high dimensional vector space has been mapped to a two-dimensional plane that has the term weights plotted against a time axis. In this way, the text documents are represented in the form of time sequences. Initially, the documents are broadly grouped into categories that are determined using domain knowledge. The relevant portion of the document vector is then clipped out. The shape patterns present in these clipped portions are observed. Indexing of these shape patterns is done by preparing their alphabet. Grouping documents within a category which share the same shape pattern results in the required clusters.