N-grams and morphological normalization in text classification: a comparison on a Croatian-English parallel corpus

  • Authors:
  • Artur Šilić;Jean-Hugues Chauchat;Bojana Dalbelo Bašić;Annie Morin

  • Affiliations:
  • University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Zagreb, Croatia;Université de Lyon 2, Faculté de Sciences Economique et de Gestion, Laboratoire Eric, Bron Cedex, France;University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Zagreb, Croatia;Université de Rennes 1, IRISA, Rennes Cedex, France

  • Venue:
  • EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.