Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms for the Longest Common Subsequence Problem
Journal of the ACM (JACM)
A fast algorithm for computing longest common subsequences
Communications of the ACM
An Information-Theoretic Definition of Similarity
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Automatic construction of English/Chinese parallel corpora
Journal of the American Society for Information Science and Technology
Computational Linguistics - Special issue on web as corpus
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Mining the Web for bilingual text
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
From words to corpora: recognizing translation
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Moses: open source toolkit for statistical machine translation
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Solving longest common subsequence and related problems on graphical processing units
Software—Practice & Experience
Large scale parallel document mining for machine translation
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Cross-language plagiarism detection
Language Resources and Evaluation
Partial duplicate detection for large book collections
Proceedings of the 20th ACM international conference on Information and knowledge management
A Fast Alignment Scheme for Automatic OCR Evaluation of Books
ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
A minimally supervised approach for detecting and ranking document translation pairs
WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Hi-index | 0.00 |
This paper describes an approach for identifying translations of books in large scanned book collections with OCR errors. The method is based on the idea that although individual sentences do not necessarily preserve the word order when translated, a book must preserve the linear progression of ideas for it to be a valid translation. Consider two books in two different languages, say English and German. The English book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. Similarly, the book in German is represented by its sequence of words which appear only once. An English-German dictionary is used to transform the word sequence of the English book into German by translating individual words in place. It is not necessary to translate all the words and this method works even with small dictionaries. Both sequences are now in German and can, therefore, be aligned using a Longest Common Subsequence (LCS) algorithm. We describe two scoring functions TRANS-cs and TRANS-its which account for both the LCS length and the lengths of the original word sequences. Experiments demonstrate that TRANS-its is particularly successful in finding translations of books and outperforms several baselines including metadata search based on matching titles and authors. Experiments performed on a Europarl parallel corpus for four language pairs, English-Finnish, English-French, English-German, English-Spanish, and a scanned book collection of 50K English-German books show that the proposed method retrieves translations of books with an average MAP score of 1.0 and a speed of 10K book pair comparisons per second on a single core.