Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora

Authors:
Dekai Wu;Pascale Fung
Affiliations:
Department of Computer Science, Human Language Technology Center, HKUST;Department of Electrical and Electronic Engineering, University of Science and Technology, Clear Water Bay, Hong Kong
Venue:
IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Year:
2005

Citing 13
Cited 6

Syntax-Directed Transduction

Journal of the ACM (JACM)
Adaptive Parallel Sentences Mining from Web Bilingual News Collection

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Stochastic inversion transduction grammars and bilingual parsing of parallel corpora

Computational Linguistics
An algorithm for simultaneously bracketing parallel texts by aligning words

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Mixed language query disambiguation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
A syntax-based statistical translation model

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
A comparative study on reordering constraints in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
A hierarchical phrase-based model for statistical machine translation

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Stochastic lexicalized inversion transduction grammar for alignment

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Reordering constraints for phrase-based statistical machine translation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Syntax-based alignment: supervised or unsupervised?

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Paraphrase fragment extraction from monolingual comparable corpora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
Textual entailment recognition using inversion transduction grammars

MLCW'05 Proceedings of the First international conference on Machine Learning Challenges: evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment
Automatic parallel fragment extraction from noisy data

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Design of a hybrid high quality machine translation system

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new implication of Wu's (1997) Inversion Transduction Grammar (ITG) Hypothesis, on the problem of retrieving truly parallel sentence translations from large collections of highly non-parallel documents. Our approach leverages a strong language universal constraint posited by the ITG Hypothesis, that can serve as a strong inductive bias for various language learning problems, resulting in both efficiency and accuracy gains. The task we attack is highly practical since non-parallel multilingual data exists in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Our aim here is to mine truly parallel sentences, as opposed to comparable sentence pairs or loose translations as in most previous work. The method we introduce exploits Bracketing ITGs to produce the first known results for this problem. Experiments show that it obtains large accuracy gains on this task compared to the expected performance of state-of-the-art models that were developed for the less stringent task of mining comparable sentence pairs.