Large scale parallel document mining for machine translation

Authors:
Jakob Uszkoreit;Jay M. Ponte;Ashok C. Popat;Moshe Dubiner
Affiliations:
Google, Inc.;Google, Inc.;Google, Inc.;Google, Inc.
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Year:
2010

Citing 15
Cited 15

Permutation Editing and Matching via Embeddings

ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Discriminative training and maximum entropy models for statistical machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
On the use of comparable corpora to improve SMT performance

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
MINT: a method for effective and scalable mining of named entity transliterations from large comparable corpora

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Lattice-based minimum error rate training for statistical machine translation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Mining a comparable text corpus for a Vietnamese - French statistical machine translation system

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation

Crowdsourcing translation: professional quality from non-professionals

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Two ways to use a noisy parallel news corpus for improving statistical machine translation

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Building a web-based parallel corpus and filtering out machine-translated text

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Cross-lingual text fragment alignment using divergence from randomness

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
A minimally supervised approach for detecting and ranking document translation pairs

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Instance selection for machine translation using feature decay algorithms

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Inducing sentence structure from parallel corpora for reordering

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Watermarking the outputs of structured prediction with an application in statistical machine translation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Google's hybrid approach to research

Communications of the ACM
Finding translations in scanned book collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Toward statistical machine translation without parallel corpora

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Automatic parallel fragment extraction from noisy data

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Why not grab a free lunch?: mining large corpora for parallel sentences to improve translation modeling

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Unsupervised translation sense clustering

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.02

Visualization

Abstract

A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an initial, low-quality batch translation. In contrast to other approaches which require specialized metadata, the system uses only the textual content of the documents. Results are presented for a corpus of over two billion web pages and for a large collection of digitized public-domain books.