Language morphology offset: Text classification on a Croatian-English parallel corpus

Authors:
M. Malenica;T. Šmuc;J. Šnajder;B. Dalbelo Bašić
Affiliations:
Division of Electronics, Rudjer Bošković Institute, Bijenička 54, 10000 Zagreb, Croatia;Division of Electronics, Rudjer Bošković Institute, Bijenička 54, 10000 Zagreb, Croatia;University of Zagreb, Faculty of Electrical Engineering and Computing, Unska 3, 10000 Zagreb, Croatia;University of Zagreb, Faculty of Electrical Engineering and Computing, Unska 3, 10000 Zagreb, Croatia
Venue:
Information Processing and Management: an International Journal
Year:
2008

Citing 14
Cited 5

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Support-Vector Networks

Machine Learning
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Feature selection using linear classifier weights: interaction with classification models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
On document relevance and lexical cohesion between query terms

Information Processing and Management: an International Journal
The role of multi-word units in interactive information retrieval

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Automatic acquisition of inflectional lexica for morphological normalisation

Information Processing and Management: an International Journal
Textual features for corpus visualization using correspondence analysis

Intelligent Data Analysis
N-grams and morphological normalization in text classification: a comparison on a Croatian-English parallel corpus

EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
Automatic authorship attribution for texts in croatian language using combinations of features

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Question classification for a Croatian QA system

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate how, and to what extent, morphological complexity of the language influences text classification using support vector machines (SVM). The Croatian-English parallel corpus provides the basis for direct comparison of two languages of radically different morphological complexity. We quantified, compared, and statistically tested the effects of morphological normalisation on SVM classifier performance based on a series of parallel experiments on both languages, carried over a large scale of different feature subset sizes obtained by different feature selection methods, and applying different levels of morphological normalisation. We also quantified the trade-off between feature space size and performance for different levels of morphological normalisation, and compared the results for both languages. Our experiments have shown that the improvements in SVM classifier performance is statistically significant; they are greater for small and medium number of features, especially for Croatian, whereas for large number of features the improvements are rather small and may be negligible in practice for both languages.