Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering

Authors:
Yingbo Miao;Vlado Kešelj;Evangelos Milios
Affiliations:
Dalhousie University;Dalhousie University;Dalhousie University
Venue:
Proceedings of the 14th ACM international conference on Information and knowledge management
Year:
2005

Citing 1
Cited 1

Text Mining with Information-Theoretic Clustering

Computing in Science and Engineering

MALEF: Framework for distributed machine learning and data mining

International Journal of Intelligent Information and Database Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a novel method for document clustering using character N-grams. In the traditional vector-space model, the documents are represented as vectors, in which each dimension corresponds to a word. We propose a document representation based on the most frequent character N-grams, with window size of up to 10 characters. We derive a new distance measure, which produces uniformly better results when compared to the word-based and term-based methods. The result becomes more significant in the light of the robustness of the N-gram method with no language-dependent preprocessing. Experiments on the performance of a clustering algorithm on a variety of test document corpora demonstrate that the N-gram representation with n=3 outperforms both word and term representations. The comparison between word and term representations depends on the data set and the selected dimensionality.