Weakly supervised morphology learning for agglutinating languages using small training sets

Authors:
Ksenia Shalonova;Bruno Golénia
Affiliations:
University of Bristol;University of Bristol
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Year:
2010

Citing 8
Cited 0

Dynamic itemset counting and implication rules for market basket data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming

Machine Learning - Special issue on inducive logic programming
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Bootstrapping morphological analyzers by combining human elicitation and machine learning

Computational Linguistics
Similarity-based methods for word sense disambiguation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Unsupervised discovery of morphemes

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Pronunciation prediction with Default&Refine

Computer Speech and Language
Towards Learning Morphology for Under-Resourced Fusional and Agglutinating Languages

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper describes a weakly supervised approach for decomposing words into all morphemes: stems, prefixes and suffixes, using wordforms with marked stems as training data. As we concentrate on under-resourced languages, the amount of training data is limited and we need some amount of supervision in the form of a small number of wordforms with marked stems. In the first stage we introduce a new Supervised Stem Extraction algorithm (SSE). Once stems have been extracted, an improved unsupervised segmentation algorithm GBUMS (Graph-Based Unsupervised Morpheme Segmentation) is used to segment suffix or prefix sequences into individual suffixes and prefixes. The approach, experimentally validated on Turkish and isiZulu languages, gives high performance on test data and is comparable to a fully supervised method.