Unsupervised morphological segmentation with log-linear models

Authors:
Hoifung Poon;Colin Cherry;Kristina Toutanova
Affiliations:
University of Washington, Seattle, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Year:
2009

Citing 9
Cited 19

Unsupervised learning of the morphology of a natural language

Computational Linguistics
Knowledge-free induction of inflectional morphologies

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing (TSLP)
Contrastive estimation: training log-linear models on unlabeled data

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A hybrid Markov/semi-Markov conditional random field for sequence segmentation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Joint unsupervised coreference resolution with Markov logic

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised multilingual learning for POS tagging

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Cross-lingual propagation for morphological analysis

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2

Using a maximum entropy model to build segmentation lattices for MT

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Joint inference for knowledge extraction from biomedical literature

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Unsupervised search for the optimal segmentation for statistical machine translation

ACLstudent '10 Proceedings of the ACL 2010 Student Research Workshop
Semi-supervised learning of concatenative morphology

SIGMORPHON '10 Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology
Predicting the semantic compositionality of prefix verbs

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
EMMA: a novel Evaluation Metric for Morphological Analysis

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Simulating morphological analyzers with stochastic taggers for confidence estimation

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Unsupervised discriminative language model training for machine translation using simulated confusion sets

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Learning sub-word units for open vocabulary speech recognition

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-Markov models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Modeling syntactic context improves morphological segmentation

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Investigating the Relationship Between Linguistic Representation and Computation through an Unsupervised Model of Human Morphology Learning

Research on Language and Computation
Improving Korean verb-verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts

Pattern Recognition Letters
Universal morphological analysis using structured nearest neighbor prediction

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Fast generation of translation forest for large-scale SMT discriminative training

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Automatic event extraction with structured preference modeling

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Unsupervised morphology rivals supervised morphology for Arabic MT

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
The study of effect of length in morphological segmentation of agglutinative languages

MM '12 Proceedings of the First Workshop on Multilingual Modeling
Unsupervised segmentation for different types of morphological processes using multiple sequence alignment

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Morphological segmentation breaks words into morphemes (the basic semantic units). It is a key component for natural language processing systems. Unsupervised morphological segmentation is attractive, because in every language there are virtually unlimited supplies of text, but very few labeled resources. However, most existing model-based systems for unsupervised morphological segmentation use directed generative models, making it difficult to leverage arbitrary overlapping features that are potentially helpful to learning. In this paper, we present the first log-linear model for unsupervised morphological segmentation. Our model uses overlapping features such as morphemes and their contexts, and incorporates exponential priors inspired by the minimum description length (MDL) principle. We present efficient algorithms for learning and inference by combining contrastive estimation with sampling. Our system, based on monolingual features only, outperforms a state-of-the-art system by a large margin, even when the latter uses bilingual information such as phrasal alignment and phonetic correspondence. On the Arabic Penn Treebank, our system reduces F1 error by 11% compared to Morfessor.