VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Locality-sensitive hashing scheme based on p-stable distributions
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The ESA retrieval model revisited
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A Wikipedia-based multilingual retrieval model
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Cross-Language high similarity search using a conceptual thesaurus
CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Hi-index | 0.00 |
This paper contributes to an important variant of cross-language information retrieval, called cross-language high similarity search. Given a collection D of documents and a query q in a language different from the language of D, the task is to retrieve highly similar documents with respect to q. Use cases for this task include cross-language plagiarism detection and translation search. The current line of research in cross-language high similarity search resorts to the comparison of q and the documents in D in a multilingual concept space—which, however, requires a linear scan of D. Monolingual high similarity search can be tackled in sub-linear time, either by fingerprinting or by “brute force n-gram indexing”, as it is done by Web search engines. We argue that neither fingerprinting nor brute force n-gram indexing can be applied to tackle cross-language high similarity search, and that a linear scan is inevitable. Our findings are based on theoretical and empirical insights.