Unsupervised word decomposition with the promodes algorithm

Authors:
Sebastian Spiegler;Bruno Golénia;Peter A. Flach
Affiliations:
Computer Science Department, University of Bristol, UK;Computer Science Department, University of Bristol, UK;Computer Science Department, University of Bristol, UK
Venue:
CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Year:
2009

Citing 4
Cited 1

Unsupervised learning of the morphology of a natural language

Computational Linguistics
Simple Morpheme Labelling in Unsupervised Morpheme Analysis

Advances in Multilingual and Multimodal Information Retrieval
Overview of Morpho challenge 2008

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Paramor: from paradigm structure to natural language morphology induction

Paramor: from paradigm structure to natural language morphology induction

Enhanced word decomposition by calibrating the decision threshold of probabilistic models and using a model ensemble

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present PROMODES an algorithm for unsupervised word decomposition, which is based on a probabilistic generative model. The model considers segment boundaries as hidden variables and includes probabilities for letter transitions within segments. For the Morpho Challenge 2009, we demonstrate three versions of PROMODES. The first one uses a simple segmentation algorithm on a subset of the data and applies maximum likelihood estimates for model parameters when decomposing words of the original language data. The second version estimates its parameters through expectation maximization (EM). A third method is a committee of unsupervised learners where learners correspond to different EM initializations. The solution is found by majority vote which decides whether to segment at a word position or not. In this paper, we describe the probabilistic model, parameter estimation and how the most likely decomposition of an input word is found. We have tested PROMODES on non-vowelized and vowelized Arabic as well as on English, Finnish, German and Turkish. All three methods achieved competitive results.