Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Adaptive Parallel Sentences Mining from Web Bilingual News Collection
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Computational Linguistics - Special issue on web as corpus
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora
Computational Linguistics
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Language and translation model adaptation using comparable corpora
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A fast method for parallel document identification
NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Extracting parallel sentences from comparable corpora using document level alignment
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Large scale parallel document mining for machine translation
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Mining relational structure from millions of books: position paper
Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Finding translations in scanned book collections
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Efficient Nearest-Neighbor Search in the Probability Simplex
Proceedings of the 2013 Conference on the Theory of Information Retrieval
Hi-index | 0.00 |
We describe an approach for generating a ranked list of candidate document translation pairs without the use of bilingual dictionary or machine translation system. We developed this approach as an initial, filtering step, for extracting parallel text from large, multilingual---but non-parallel---corpora. We represent bilingual documents in a vector space whose basis vectors are the overlapping tokens found in both languages of the collection. Using this representation, weighted by tf·idf, we compute cosine document similarity to create a ranked list of candidate document translation pairs. Unlike cross-language information retrieval, where a ranked list in the target language is evaluated for each source query, we are interested in, and evaluate, the more difficult task of finding translated document pairs. We first perform a feasibility study of our approach on parallel collections in multiple languages, representing multiple language families and scripts. The approach is then applied to a large bilingual collection of around 800k books. To avoid the computational cost of O(n2) document pair comparisons, we employ locality sensitive hashing (LSH) approximation algorithm for cosine similarity, which reduces our time complexity to O(n log n).