Natural language vs. Boolean query evaluation: a comparison of retrieval performance
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Near-duplicate detection for eRulemaking
dg.o '05 Proceedings of the 2005 national conference on Digital government research
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Coordinated weighted sampling for estimating aggregates over multiple weight assignments
Proceedings of the VLDB Endowment
Detection of near-duplicate user generated contents: the SMS spam collection
Proceedings of the 3rd international workshop on Search and mining user-generated contents
Hi-index | 0.00 |
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its negative effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling inexact duplicate documents.