Text similarity using google tri-grams

  • Authors:
  • Aminul Islam;Evangelos Milios;Vlado Kešelj

  • Affiliations:
  • Faculty of Computer Science, Dalhousie University, Halifax, Canada;Faculty of Computer Science, Dalhousie University, Halifax, Canada;Faculty of Computer Science, Dalhousie University, Halifax, Canada

  • Venue:
  • Canadian AI'12 Proceedings of the 25th Canadian conference on Advances in Artificial Intelligence
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The purpose of this paper is to propose an unsupervised approach for measuring the similarity of texts that can compete with supervised approaches. Finding the inherent properties of similarity between texts using a corpus in the form of a word n-gram data set is competitive with other text similarity techniques in terms of performance and practicality. Experimental results on a standard data set show that the proposed unsupervised method outperforms the state-of-the-art supervised method and the improvement achieved is statistically significant at 0.05 level. The approach is language-independent; it can be applied to other languages as long as n-grams are available.