N-grams and morphological normalization in text classification: a comparison on a Croatian-English parallel corpus

Authors:
Artur Šilić;Jean-Hugues Chauchat;Bojana Dalbelo Bašić;Annie Morin
Affiliations:
University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Zagreb, Croatia;Université de Lyon 2, Faculté de Sciences Economique et de Gestion, Laboratoire Eric, Bron Cedex, France;University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Zagreb, Croatia;Université de Rennes 1, IRISA, Rennes Cedex, France
Venue:
EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
Year:
2007

Citing 9
Cited 5

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature selection using linear classifier weights: interaction with classification models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
Language morphology offset: Text classification on a Croatian-English parallel corpus

Information Processing and Management: an International Journal

Automatic acquisition of inflectional lexica for morphological normalisation

Information Processing and Management: an International Journal
Does dictionary based bilingual retrieval work in a non-normalized index?

Information Processing and Management: an International Journal
Automatic authorship attribution for texts in croatian language using combinations of features

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Evaluation of normalization techniques in text classification for portuguese

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part III
Technical Section: EXOD: A tool for building and exploring a large graph of open datasets

Computers and Graphics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.