Inducing multilingual text analysis tools via robust projection across aligned corpora

Authors:
David Yarowsky;Grace Ngai;Richard Wicentowski
Affiliations:
Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD
Venue:
HLT '01 Proceedings of the first international conference on Human language technology research
Year:
2001

Citing 10
Cited 87

A statistical approach to machine translation

Computational Linguistics
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Stochastic inversion transduction grammars and bilingual parsing of parallel corpora

Computational Linguistics
Bitext maps and alignment via pattern recognition

Computational Linguistics
An algorithm for simultaneously bracketing parallel texts by aligning words

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
K-vec: a new approach for aligning parallel texts

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Transformation-based learning in the fast lane

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Minimally supervised morphological analysis by multimodal alignment

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics

Cross-Language Access to Recorded Speech in the MALACH Project

TSD '02 Proceedings of the 5th International Conference on Text, Speech and Dialogue
A systematic comparison of various statistical alignment models

Computational Linguistics
Automatic construction of English/Chinese parallel corpora

Journal of the American Society for Information Science and Technology
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Lexical triggers and latent semantic analysis for cross-lingual language model adaptation

ACM Transactions on Asian Language Information Processing (TALIP)
Inducing information extraction systems for new languages via cross-language projection

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Crosslinguistic transfer in automatic verb classification

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A multilingual paradigm for automatic verb classification

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus

Natural Language Engineering
Bootstrapping parsers via syntactic projection across parallel texts

Natural Language Engineering
Optimization of word alignment clues

Natural Language Engineering
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Phrasal cohesion and statistical machine translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Bootstrapping a multilingual part-of-speech tagger in one person-day

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Using 'smart' bilingual projection to feature-tag a monolingual dictionary

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
A projection extension algorithm for statistical machine translation

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Cross-lingual lexical triggers in statistical language modeling

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing (TSLP)
Aligning words using matrix factorisation

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Improving bitext word alignments via syntax-based reordering of English

ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
A framework for unsupervised natural language morphology induction

ACLstudent '04 Proceedings of the ACL 2004 workshop on Student research
Optimal constituent alignment with edge covers for semantic projection

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Evaluating cross-language annotation transfer in the MultiSemCor corpus

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
NeurAlign: combining word alignments using neural networks

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Alignment link projection using transformation-based learning

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Cross-linguistic projection of role-semantic information

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
A FrameNet-based semantic role labeler for Swedish

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
On the application of different evolutionary algorithms to the alignment problem in statistical machine translation

Neurocomputing
Ripple Down Rule learning for automated word lemmatisation

AI Communications
Statistical machine translation

ACM Computing Surveys (CSUR)
The bootstrapping of the Yarowsky algorithm in real corpora

Information Processing and Management: an International Journal
Data-driven dependency parsing of new languages using incomplete and noisy training data

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Semantically rich human-aided machine annotation

CorpusAnno '05 Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky
Tagging Portuguese with a Spanish tagger using cognates

CrossLangInduction '06 Proceedings of the International Workshop on Cross-Language Knowledge Induction
Projecting POS tags and syntactic dependencies from English and French to Polish in aligned corpora

CrossLangInduction '06 Proceedings of the International Workshop on Cross-Language Knowledge Induction
Mention detection crossing the language barrier

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised multilingual learning for POS tagging

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Morphological analysis for statistical machine translation

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Combination of statistical word alignments based on multiple preprocessing schemes

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Cross-lingual bootstrapping of semantic lexicons: the case of FrameNet

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Cross-lingual propagation for morphological analysis

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Unsupervised induction of natural language morphology inflection classes

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology
Multilingual noise-robust supervised morphological analysis using the WordFrame model

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology
Cross-Language Information Propagation for Arabic Mention Detection

ACM Transactions on Asian Language Information Processing (TALIP)
Induction of fine-grained part-of-speech taggers via classifier combination and crosslingual projection

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
A hybrid approach to align sentences and words in English-Hindi parallel corpora

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Dependency grammar induction via bitext projection constraints

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
An extensible crosslinguistic readability framework

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Exploiting translational correspondences for pattern-independent MWE identification

MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
Unsupervised morphological segmentation and clustering with document boundaries

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Cross-lingual annotation projection of semantic roles

Journal of Artificial Intelligence Research
Multilingual part-of-speech tagging: two unsupervised approaches

Journal of Artificial Intelligence Research
Translation by iterative collaboration between monolingual users

Proceedings of Graphics Interface 2010
Finding cognate groups using phylogenies

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
A statistical model for lost language decipherment

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
A cross-lingual induction technique for German adverbial participles

NLPLING '10 Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground
Learning better monolingual models with unannotated bilingual text

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Improving translation via targeted paraphrasing

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Enhancing mention detection using projection via aligned corpora

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A cross-lingual annotation projection approach for relation detection

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Using cross-lingual projections to generate semantic role labeled corpus for Urdu: a resource poor language

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Cross-lingual induction for deep broad-coverage syntax: a case study on German participles

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Covariance in Unsupervised Learning of Probabilistic Grammars

The Journal of Machine Learning Research
Partial parsing from bitext projections

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Scaling up automatic cross-lingual semantic role annotation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

Language Resources and Evaluation
Unsupervised multilingual learning

Unsupervised multilingual learning
Improving statistical word alignments with morpho-syntactic transformations

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
A low-budget tagger for Old Czech

LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Experiments in cross-language morphological annotation transfer

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
The value of monolingual crowdsourcing in a real-world translation scenario: simulation using Haitian Creole emergency SMS messages

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Universal morphological analysis using structured nearest neighbor prediction

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A correction model for word alignments

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Automatically inducing a part-of-speech tagger by projecting from multiple source languages across aligned corpora

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Cross-Lingual alignment of framenet annotations through hidden markov models

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
A survey of methods to ease the development of highly multilingual text mining applications

Language Resources and Evaluation
Nudging the envelope of direct transfer methods for multilingual named entity recognition

WILS '12 Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure
Multilingual named entity recognition using parallel data and metadata from Wikipedia

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
A graph-based cross-lingual projection approach for weakly supervised relation extraction

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Universal grapheme-to-phoneme prediction over Latin alphabets

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Part-of-speech tagging for Chinese-English mixed texts with dynamic features

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Accurate unsupervised joint named-entity extraction from unaligned parallel text

NEWS '12 Proceedings of the 4th Named Entity Workshop
Learning multilingual named entity recognition from Wikipedia

Artificial Intelligence
Using targeted paraphrasing and monolingual crowdsourcing to improve translation

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
Cross-Lingual Annotation Projection for Weakly-Supervised Relation Extraction

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish.Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite noisy, however, even with optimal alignments. Thus this paper presents noise-robust tagger, bracketer and lemmatizer training procedures capable of accurate system bootstrapping from noisy and incomplete initial projections.Performance of the induced stand-alone part-of-speech tagger applied to French achieves 96% core part-of-speech (POS) tag accuracy, and the corresponding induced noun-phrase bracketer exceeds 91% F-measure. The induced morphological analyzer achieves over 99% lemmatization accuracy on the complete French verbal system.This achievement is particularly noteworthy in that it required absolutely no hand-annotated training data in the given language, and virtually no language-specific knowledge or resources beyond raw text. Performance also significantly exceeds that obtained by direct annotation projection.