A Variant of N-Gram Based Language Classification

  • Authors:
  • Andrija Tomović;Predrag Janičić

  • Affiliations:
  • Friedrich Miescher Institute for Biomedical Research, Part of the Novartis Research Foundation, Maulbeerstrasse 66, CH-4058 Basel, Switzerland;Faculty of Mathematics, University of Belgrade, Studentski trg 16,11000 Belgrade, Serbia

  • Venue:
  • AI*IA '07 Proceedings of the 10th Congress of the Italian Association for Artificial Intelligence on AI*IA 2007: Artificial Intelligence and Human-Oriented Computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This has been, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is very robust and successful, even for 20-fold classification, and even for short text strings. We give a detailed study for different lengths of strings and size of n-grams and we explore what classification parameters give the best performance. There is no requirement for vocabularies, but only for a few training documents. As a main corpus, we used a EU set of documents in 20 languages. Experimental comparison shows that our approach gives better results than four other popular approaches.