Text document clustering based on frequent word sequences

Authors:
Yanjun Li;Soon M. Chung
Affiliations:
Wright State University, Dayton, OH;Wright State University, Dayton, OH
Venue:
Proceedings of the 14th ACM international conference on Information and knowledge management
Year:
2005

Citing 1
Cited 8

Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

Weighted kernel model for text categorization

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Utilizing phrase-similarity measures for detecting and clustering informative RSS news articles

Integrated Computer-Aided Engineering
Searching Correlated Objects in a Long Sequence

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Generating Fuzzy Equivalence Classes on RSS News Articles for Retrieving Correlated Information

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Object relevance weight pattern mining for activity recognition and segmentation

Pervasive and Mobile Computing
Clustering zebrafish genes based on frequent-itemsets and frequency levels

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Hierarchical document clustering using local patterns

Data Mining and Knowledge Discovery
Web image clustering with reduced keywords and weighted bipartite spectral graph partitioning

PCM'06 Proceedings of the 7th Pacific Rim conference on Advances in Multimedia Information Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a new text clustering algorithm, named Clustering based on Frequent Word Sequences (CFWS). A word sequence is frequent if it occurs in more than certain percentage of the documents in the text database. In the past, the vector space model was commonly used for information retrieval, but it treats documents as bags of words, ignoring the sequential pattern of word occurrences in the documents. However, the meaning of natural languages strongly depends on the word sequences, and the frequent word sequences can provide compact and valuable information about the text database. Bisecting k-means and FIHC algorithms are evaluated on the performance of text clustering, and are compared with the proposed CFWS algorithm. It has been shown that CFWS has much better performance.