Cross-language high similarity search: why no sub-linear time bound can be expected

Authors:
Maik Anderka;Benno Stein;Martin Potthast
Affiliations:
Faculty of Media, Bauhaus University Weimar, Weimar, Germany;Faculty of Media, Bauhaus University Weimar, Weimar, Germany;Faculty of Media, Bauhaus University Weimar, Weimar, Germany
Venue:
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Year:
2010

Citing 6
Cited 2

A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The ESA retrieval model revisited

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A Wikipedia-based multilingual retrieval model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Cross-Language high similarity search using a conceptual thesaurus

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper contributes to an important variant of cross-language information retrieval, called cross-language high similarity search. Given a collection D of documents and a query q in a language different from the language of D, the task is to retrieve highly similar documents with respect to q. Use cases for this task include cross-language plagiarism detection and translation search. The current line of research in cross-language high similarity search resorts to the comparison of q and the documents in D in a multilingual concept space—which, however, requires a linear scan of D. Monolingual high similarity search can be tackled in sub-linear time, either by fingerprinting or by “brute force n-gram indexing”, as it is done by Web search engines. We argue that neither fingerprinting nor brute force n-gram indexing can be applied to tackle cross-language high similarity search, and that a linear scan is inevitable. Our findings are based on theoretical and empirical insights.