Simulating morphological analyzers with stochastic taggers for confidence estimation

Authors:
Christian Monson;Kristy Hollingshead;Brian Roark
Affiliations:
Center for Spoken Language Understanding, Oregon Health & Science University;Center for Spoken Language Understanding, Oregon Health & Science University;Center for Spoken Language Understanding, Oregon Health & Science University
Venue:
CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Year:
2009

Citing 14
Cited 1

Unsupervised learning of the morphology of a natural language

Computational Linguistics
Introduction to the CoNLL-2000 shared task: chunking

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Introduction to the CoNLL-2002 shared task: language-independent named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Comparing and combining finite-state and context-free parsers

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Simple Morpheme Labelling in Unsupervised Morpheme Analysis

Advances in Multilingual and Multimodal Information Retrieval
Unsupervised multilingual learning for POS tagging

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised morphological segmentation with log-linear models

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Linear complexity context-free parsing pipelines via chart constraints

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Hunmorph: open source word analysis

Software '05 Proceedings of the Workshop on Software
Exploring different representational units in English-to-Turkish statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Overview of Morpho challenge 2008

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
ParaMor and Morpho challenge 2008

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Paramor: from paradigm structure to natural language morphology induction

Paramor: from paradigm structure to natural language morphology induction

Morphological analysis by multiple sequence alignment

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a method for providing stochastic confidence estimates for rule-based and black-box natural language (NL) processing systems. Our method does not require labeled training data: We simply train stochastic models on the output of the original NL systems. Numeric confidence estimates enable both minimum Bayes risk-style optimization as well as principled system combination for these knowledge-based and black-box systems. In our specific experiments, we enrich ParaMor, a rule-based system for unsupervised morphology induction, with probabilistic segmentation confidences by training a statistical natural language tagger to simulate ParaMor's morphological segmentations. By adjusting the numeric threshold above which the simulator proposes morpheme boundaries, we improve F1 of morpheme identification on a Hungarian corpus by 5.9% absolute. With numeric confidences in hand, we also combine ParaMor's segmentation decisions with those of a second (blackbox) unsupervised morphology induction system, Morfessor. Our joint ParaMor-Morfessor system enhances F1 performance by a further 3.4% absolute, ultimately moving F1 from 41.4% to 50.7%.