No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Authors:
Ferhan Ture;Tamer Elsayed;Jimmy Lin
Affiliations:
University of Maryland, College Park, MD, USA;King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia;University of Maryland, College Park, MD, USA
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 31
Cited 6

Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A systematic comparison of various statistical alignment models

Computational Linguistics
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Probabilistic structured query methods

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
A bootstrapping method for extracting bilingual text pairs

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Achieving both high precision and high recall in near-duplicate detection

Proceedings of the 17th ACM conference on Information and knowledge management
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Statistical Machine Translation

Statistical Machine Translation
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Self-taught hashing for fast similarity search

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Pairwise Element Computation with MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Untangling the cross-lingual link structure of Wikipedia

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Translingual document representations from discriminative projections

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Cross-language high similarity search: why no sub-linear time bound can be expected

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval

File routing middleware for cloud deduplication

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
A probabilistic model for multimodal hash function learning

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Manhattan hashing for large-scale image retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Why not grab a free lunch?: mining large corpora for parallel sentences to improve translation modeling

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Cross-Language high similarity search using a conceptual thesaurus

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multi-lingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints.