Automatic acquisition of inflectional lexica for morphological normalisation

Authors:
J. Šnajder;B. Dalbelo Bašić;M. Tadić
Affiliations:
Department of Electronics, Microelectronics, Computer and Intelligent Systems, Faculty of Electrical Engineering and Computing, Unska 3, 10000 Zagreb, Croatia;Department of Electronics, Microelectronics, Computer and Intelligent Systems, Faculty of Electrical Engineering and Computing, Unska 3, 10000 Zagreb, Croatia;Department of Linguistics, Faculty of Humanities and Social Sciences, University of Zagreb, Ivana Lučića 3, Zagreb, Croatia
Venue:
Information Processing and Management: an International Journal
Year:
2008

Citing 16
Cited 8

Conception, evolution, and application of functional programming languages

ACM Computing Surveys (CSUR)
Method for evaluation of stemming algorithms based on error counting

Journal of the American Society for Information Science
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
A stemming procedure and stopword list for general French corpora

Journal of the American Society for Information Science
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Learning word normalization using word suffix and context from unlabeled data

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Functional morphology

Proceedings of the ninth ACM SIGPLAN international conference on Functional programming
Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
Language morphology offset: Text classification on a Croatian-English parallel corpus

Information Processing and Management: an International Journal
Building the Croatian morphological lexicon

MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
N-grams and morphological normalization in text classification: a comparison on a Croatian-English parallel corpus

EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
Morphological lexicon extraction from raw text data

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Automatic acquisition of a slovak lexicon from a raw corpus

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue

Textual features for corpus visualization using correspondence analysis

Intelligent Data Analysis
Automatic authorship attribution for texts in croatian language using combinations of features

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
An accuracy-enhanced light stemmer for arabic text

ACM Transactions on Speech and Language Processing (TSLP)
Unsupervised topic-oriented keyphrase extraction and its application to Croatian

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Question classification for a Croatian QA system

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Random indexing distributional semantic models for Croatian language

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Review: The automatic creation of concept maps from documents written using morphologically rich languages

Expert Systems with Applications: An International Journal
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.