Word association norms, mutual information, and lexicography
Computational Linguistics
Locality preserving indexing for document representation
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Dependence Among Terms in Vector Space Model
IDEAS '04 Proceedings of the International Database Engineering and Applications Symposium
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
A Bit Level Representation for Time Series Data Mining with Shape Based Similarity
Data Mining and Knowledge Discovery
Statistical Evaluation of Measure and Distance on Document Classification Problems in Text Mining
CIT '07 Proceedings of the 7th IEEE International Conference on Computer and Information Technology
Perception-based approach to time series data mining
Applied Soft Computing
Text Clustering with Feature Selection by Using Statistical Data
IEEE Transactions on Knowledge and Data Engineering
Document Representation and Dimension Reduction for Text Clustering
ICDEW '07 Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop
Hi-index | 0.00 |
Research in text mining has recently gained a lot of importance due to the large increase in the number of electronic news articles, books, research papers, and e-mail messages. Clustering organizes text documents in an unsupervised fashion. In this paper, we propose an algorithm for clustering unstructured text documents using shape pattern matching. The Vector Space Model is used to represent our dataset as a term-weight matrix. The high dimensional vector space has been mapped to a two-dimensional plane that has the term weights plotted against a time axis. In this way, the text documents are represented in the form of time sequences. Initially, the documents are broadly grouped into categories that are determined using domain knowledge. The relevant portion of the document vector is then clipped out. The shape patterns present in these clipped portions are observed. Indexing of these shape patterns is done by preparing their alphabet. Grouping documents within a category which share the same shape pattern results in the required clusters.