Candidate document retrieval for web-scale text reuse detection

Authors:
Matthias Hagen;Benno Stein
Affiliations:
Faculty of Media, Bauhaus-Universität Weimar, Germany;Faculty of Media, Bauhaus-Universität Weimar, Germany
Venue:
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Year:
2011

Citing 15
Cited 2

Predicting query performance

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Using Noun Phrase Heads to Extract Document Keyphrases

AI '00 Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
Maximal termsets as a query structuring mechanism

Proceedings of the 14th ACM international conference on Information and knowledge management
What makes a query difficult?

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Random sampling from a search engine's index

Journal of the ACM (JACM)
A survey of pre-retrieval query performance predictors

Proceedings of the 17th ACM conference on Information and knowledge management
Query by document

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
The Top Ten Algorithms in Data Mining

The Top Ten Algorithms in Data Mining
A case for improved evaluation of query difficulty prediction

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
Capacity-constrained query formulation

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Introducing the user-over-ranking hypothesis

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval

The impact of spelling errors on patent search

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
From keywords to keyqueries: content descriptors for the web

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.01

Visualization

Abstract

Given a document d, the task of text reuse detection is to find those passages in d which in identical or paraphrased form also appear in other documents. To solve this problem at web-scale, keywords representing d's topics have to be combined to web queries. The retrieved web documents can then be delivered to a text reuse detection system for an in-depth analysis. We focus on the query formulation problem as the crucial first step in the detection process and present a new query formulation strategy that achieves convincing results: compared to a maximal termset query formulation strategy [10, 14], which is the most sensible non-heuristic baseline, we save on average 70% of the queries in realistic experiments. With respect to the candidate documents' quality, our heuristic retrieves documents that are, on average, more similar to the given document than the results of previously published query formulation strategies [4, 8].