Bootstrapping a multilingual part-of-speech tagger in one person-day

Authors:
Silviu Cucerzan;David Yarowsky
Affiliations:
Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD
Venue:
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Year:
2002

Citing 9
Cited 10

Tagging English text with a probabilistic model

Computational Linguistics
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Memory-based morphological analysis

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
A Bayesian model for morpheme and paradigm identification

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Minimally supervised morphological analysis by multimodal alignment

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Language independent, minimally supervised induction of lexical probabilities

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Knowledge-free induction of morphology using latent semantic analysis

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7

Minimally supervised induction of grammatical gender

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Language independent NER using a unified model of internal and contextual evidence

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Statistical machine translation using coercive two-level syntactic transduction

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
BiFrameNet: bilingual frame semantics resource construction by cross-lingual induction

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Tagging Portuguese with a Spanish tagger using cognates

CrossLangInduction '06 Proceedings of the International Workshop on Cross-Language Knowledge Induction
A low-budget tagger for Old Czech

LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Automatically inducing a part-of-speech tagger by projecting from multiple source languages across aligned corpora

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Verb analysis in a highly inflective language with an MFF algorithm

PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
Part-of-speech tagging for Chinese-English mixed texts with dynamic features

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Semi-automatic acquisition of two-level morphological rules for iban language

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a method for bootstrapping a fine-grained, broad-coverage part-of-speech (POS) tagger in a new language using only one person-day of data acquisition effort. It requires only three resources, which are currently readily available in 60-100 world languages: (1) an online or hard-copy pocket-sized bilingual dictionary, (2) a basic library reference grammar, and (3) access to an existing monolingual text corpus in the language. The algorithm begins by inducing initial lexical POS distributions from English translations in a bilingual dictionary without POS tags. It handles irregular, regular and semi-regular morphology through a robust generative model using weighted Levenshtein alignments. Unsupervised induction of grammatical gender is performed via global modeling of context-window feature agreement. Using a combination of these and other evidence sources, interactive training of context and lexical prior models are accomplished for fine-grained POS tag spaces. Experiments show high accuracy, fine-grained tag resolution with minimal new human effort.