A backoff model for bootstrapping resources for non-English languages

Authors:
Chenhai Xi;Rebecca Hwa
Affiliations:
University of Pittsburgh;University of Pittsburgh
Venue:
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Year:
2005

Citing 14
Cited 13

Machine translation divergences: a formal description and proposed solution

Computational Linguistics
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
An Introduction to the Theory of Computation

An Introduction to the Theory of Computation
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Minimizing manual annotation cost in supervised training from corpora

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
An unsupervised method for word sense tagging using parallel corpora

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Evaluating translational correspondence using annotation projection

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Applying co-training methods to statistical parsing

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Active learning for HPSG parse selection

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Bootstrapping POS taggers using unlabelled data

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Active learning with statistical models

Journal of Artificial Intelligence Research

Optimal constituent alignment with edge covers for semantic projection

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Innovations in Natural Language Document Processing for Requirements Engineering

Innovations for Requirement Analysis. From Stakeholders' Needs to Formal Designs
Unsupervised multilingual learning for POS tagging

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised multilingual grammar induction

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Cross-lingual annotation projection of semantic roles

Journal of Artificial Intelligence Research
Multilingual part-of-speech tagging: two unsupervised approaches

Journal of Artificial Intelligence Research
Using cross-lingual projections to generate semantic role labeled corpus for Urdu: a resource poor language

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Unsupervised part-of-speech tagging with bilingual graph-based projections

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Semi-supervised Learning Framework for Cross-Lingual Projection

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
Unsupervised multilingual learning

Unsupervised multilingual learning
Unsupervised structure prediction with non-parallel multilingual guidance

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Selective sharing for multilingual dependency parsing

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Learning to map into a universal POS tagset

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The lack of annotated data is an obstacle to the development of many natural language processing applications; the problem is especially severe when the data is non-English. Previous studies suggested the possibility of acquiring resources for non-English languages by bootstrapping from high quality English NLP tools and parallel corpora; however, the success of these approaches seems limited for dissimilar language pairs. In this paper, we propose a novel approach of combining a bootstrapped resource with a small amount of manually annotated data. We compare the proposed approach with other bootstrapping methods in the context of training a Chinese Part-of-Speech tagger. Experimental results show that our proposed approach achieves a significant improvement over EM and self-training and systems that are only trained on manual annotations.