Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Authors:
Dragos Stefan Munteanu;Daniel Marcu
Affiliations:
-;-
Venue:
Computational Linguistics
Year:
2005

Citing 28
Cited 71

A statistical approach to machine translation

Computational Linguistics
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating Word Translation Probabilities from Unrelated Monolingual Corpora Using the EM Algorithm

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Fast and Accurate Sentence Alignment of Bilingual Corpora

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
A systematic comparison of various statistical alignment models

Computational Linguistics
Adaptive Parallel Sentences Mining from Web Bilingual News Collection

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Bitext maps and alignment via pattern recognition

Computational Linguistics
A portable algorithm for mapping bitext correspondence

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
An IR approach for translating new words from nonparallel, comparable texts

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
K-vec: a new approach for aligning parallel texts

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Automatic identification of word translations from unrelated English and German corpora

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Using noisy bilingual data for statistical machine translation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
An unsupervised method for word sense tagging using parallel corpora

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Discriminative training and maximum entropy models for statistical machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
A noisy-channel approach to question answering

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Reliable measures for aligning Japanese-English news articles and sentences

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A comparison of algorithms for maximum entropy parameter estimation

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Sentence alignment for monolingual comparable corpora

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Improving IBM word-alignment model 1

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
A geometric view on bilingual lexicon extraction from comparable corpora

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics

Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
On the use of comparable corpora to improve SMT performance

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Evaluation of the bible as a resource for cross-language information retrieval

MLRI '06 Proceedings of the Workshop on Multilingual Language Resources and Interoperability
Language and translation model adaptation using comparable corpora

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A simple sentence-level extraction algorithm for comparable data

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Domain adaptation for statistical classifiers

Journal of Artificial Intelligence Research
An Intelligent Agent That Autonomously Learns How to Translate

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
Frontiers in linguistic annotation for lower-density languages

LAC '06 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006
A beam-search extraction algorithm for comparable data

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
WikiBABEL: a wiki-style platform for creation of parallel data

ACLDemos '09 Proceedings of the ACL-IJCNLP 2009 Software Demonstrations
Mining bilingual data from the web with adaptively learnt patterns

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Train the machine with what it can learn: corpus selection for SMT

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Exploiting comparable corpora with TER and TERp

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Improved statistical machine translation using monolingually-derived paraphrases

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
wikiBABEL: community creation of multilingual data

WikiSym '08 Proceedings of the 4th International Symposium on Wikis
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A Collection of Comparable Corpora for Under-resourced Languages

Proceedings of the 2010 conference on Human Language Technologies -- The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010
Translingual document representations from discriminative projections

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Example-based paraphrasing for improved phrase-based statistical machine translation

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Extracting parallel fragments from comparable corpora for data-to-text generation

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
An empirical study on web mining of parallel data

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Large scale parallel document mining for machine translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Contextual modeling for meeting translation using unsupervised word sense disambiguation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Creating a Persian-English comparable corpus

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
A kernel regression framework for SMT

Machine Translation
A survey of paraphrasing and textual entailment methods

Journal of Artificial Intelligence Research
Enhancing multi-lingual information extraction via cross-media inference and fusion

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
EM-based hybrid model for bilingual terminology extraction from comparable corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Joint bilingual sentiment classification with unlabeled parallel corpora

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Crowdsourcing translation: professional quality from non-professionals

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Rare word translation extraction from aligned comparable documents

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
From bilingual dictionaries to interlingual document representations

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Two easy improvements to lexical weighting

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Using Sublexical Translations to Handle the OOV Problem in Machine Translation

ACM Transactions on Asian Language Information Processing (TALIP)
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Learning discriminative projections for text similarity measures

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Two ways to use a noisy parallel news corpus for improving statistical machine translation

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Paraphrase fragment extraction from monolingual comparable corpora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Extracting parallel phrases from comparable data

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Active learning with multiple annotations for comparable data classification task

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Comparable fora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Unsupervised alignment of comparable data and text resources

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
An Expectation Maximization algorithm for textual unit alignment

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Graph-based bilingual sentence alignment from large scale web pages

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Cross-lingual text fragment alignment using divergence from randomness

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
Translation selection through machine learning with language resources

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
A minimally supervised approach for detecting and ranking document translation pairs

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
The Karlsruhe Institute of Technology translation systems for the WMT 2011

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
CMU Haitian Creole-English translation system for WMT 2011

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Improving bilingual projections via sparse covariance matrices

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Watermarking the outputs of structured prediction with an application in statistical machine translation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Machine translation between Hebrew and Arabic

Machine Translation
Enabling users to create their own web-based machine translation engine

Proceedings of the 21st international conference companion on World Wide Web
Topic based creation of a persian-english comparable corpus

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Detecting highly confident word translations from comparable corpora without any prior knowledge

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Why not grab a free lunch?: mining large corpora for parallel sentences to improve translation modeling

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
HDU: cross-lingual textual entailment with SMT features

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-)parallel translation equivalents

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
Design of a hybrid high quality machine translation system

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
Translation model adaptation for statistical machine translation with monolingual topic information

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Cross-lingual mixture model for sentiment classification

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Twitter translation using translation-based cross-lingual retrieval

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
A language modeling approach for extracting translation knowledge from comparable corpora

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Domain adaptation in statistical machine translation using comparable corpora: case study for english latvian IT localisation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Leveraging arabic-english bilingual corpora with crowd sourcing-based annotation for arabic-hebrew SMT

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Distributional phrasal paraphrase generation for statistical machine translation

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
Mining a Persian-English comparable corpus for cross-language information retrieval

Information Processing and Management: an International Journal
An intelligent Web agent that autonomously learns how to translate

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.