Finding translations in scanned book collections

Authors:
Ismet Zeki Yalniz;R. Manmatha
Affiliations:
University of Massachusetts, Amherst, Amherst, MA, USA;University of Massachusetts, Amherst, Amherst, MA, USA
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 20
Cited 0

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms for the Longest Common Subsequence Problem

Journal of the ACM (JACM)
A fast algorithm for computing longest common subsequences

Communications of the ACM
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Automatic construction of English/Chinese parallel corpora

Journal of the American Society for Information Science and Technology
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
From words to corpora: recognizing translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Identifying duplicate content using statistically improbable phrases

Bioinformatics
Solving longest common subsequence and related problems on graphical processing units

Software—Practice & Experience
Large scale parallel document mining for machine translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Cross-language plagiarism detection

Language Resources and Evaluation
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management
A Fast Alignment Scheme for Automatic OCR Evaluation of Books

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
A minimally supervised approach for detecting and ranking document translation pairs

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes an approach for identifying translations of books in large scanned book collections with OCR errors. The method is based on the idea that although individual sentences do not necessarily preserve the word order when translated, a book must preserve the linear progression of ideas for it to be a valid translation. Consider two books in two different languages, say English and German. The English book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. Similarly, the book in German is represented by its sequence of words which appear only once. An English-German dictionary is used to transform the word sequence of the English book into German by translating individual words in place. It is not necessary to translate all the words and this method works even with small dictionaries. Both sequences are now in German and can, therefore, be aligned using a Longest Common Subsequence (LCS) algorithm. We describe two scoring functions TRANS-cs and TRANS-its which account for both the LCS length and the lengths of the original word sequences. Experiments demonstrate that TRANS-its is particularly successful in finding translations of books and outperforms several baselines including metadata search based on matching titles and authors. Experiments performed on a Europarl parallel corpus for four language pairs, English-Finnish, English-French, English-German, English-Spanish, and a scanned book collection of 50K English-German books show that the proposed method retrieves translations of books with an average MAP score of 1.0 and a speed of 10K book pair comparisons per second on a single core.