Large scale parallel document mining for machine translation

  • Authors:
  • Jakob Uszkoreit;Jay M. Ponte;Ashok C. Popat;Moshe Dubiner

  • Affiliations:
  • Google, Inc.;Google, Inc.;Google, Inc.;Google, Inc.

  • Venue:
  • COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
  • Year:
  • 2010

Quantified Score

Hi-index 0.02

Visualization

Abstract

A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an initial, low-quality batch translation. In contrast to other approaches which require specialized metadata, the system uses only the textual content of the documents. Results are presented for a corpus of over two billion web pages and for a large collection of digitized public-domain books.