Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering

  • Authors:
  • Yingbo Miao;Vlado Kešelj;Evangelos Milios

  • Affiliations:
  • Dalhousie University;Dalhousie University;Dalhousie University

  • Venue:
  • Proceedings of the 14th ACM international conference on Information and knowledge management
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a novel method for document clustering using character N-grams. In the traditional vector-space model, the documents are represented as vectors, in which each dimension corresponds to a word. We propose a document representation based on the most frequent character N-grams, with window size of up to 10 characters. We derive a new distance measure, which produces uniformly better results when compared to the word-based and term-based methods. The result becomes more significant in the light of the robustness of the N-gram method with no language-dependent preprocessing. Experiments on the performance of a clustering algorithm on a variety of test document corpora demonstrate that the N-gram representation with n=3 outperforms both word and term representations. The comparison between word and term representations depends on the data set and the selected dimensionality.