Detecting duplicate web documents using clickthrough data

Authors:
Filip Radlinski;Paul N. Bennett;Emine Yilmaz
Affiliations:
Microsoft, Vancouver, BC, Canada;Microsoft Research, Redmond, WA, USA;Microsoft, Cambridge, United Kingdom
Venue:
Proceedings of the fourth ACM international conference on Web search and data mining
Year:
2011

Citing 32
Cited 4

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Information Retrieval

Information Retrieval
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A taxonomy of web search

ACM SIGIR Forum
Beyond independent relevance: methods and evaluation metrics for subtopic retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Eye-tracking analysis of user behavior in WWW search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating implicit measures to improve web search

ACM Transactions on Information Systems (TOIS)
Modeling User Search Behavior

LA-WEB '05 Proceedings of the Third Latin American Web Congress
Less is more: probabilistic models for retrieving fewer relevant documents

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search

ACM Transactions on Information Systems (TOIS)
Active exploration for learning rankings from clickthrough data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An experimental comparison of click position-bias models

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Predicting diverse subsets using structural SVMs

Proceedings of the 25th international conference on Machine learning
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Diversifying search results

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Generating labels from clicks

Proceedings of the Second ACM International Conference on Web Search and Data Mining
An axiomatic approach for result diversification

Proceedings of the 18th international conference on World wide web
Portfolio theory of information retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Minimally invasive randomization for collecting unbiased preferences from clickthrough logs

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
An Effectiveness Measure for Ambiguous and Underspecified Queries

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Expected reciprocal rank for graded relevance

Proceedings of the 18th ACM conference on Information and knowledge management
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Redundancy, diversity and interdependent document relevance

ACM SIGIR Forum
Here or there: preference judgments for relevance

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A user behavior model for average precision and its generalization to graded judgments

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Practical online retrieval evaluation

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Reducing information redundancy in search results

Proceedings of the 28th Annual ACM Symposium on Applied Computing
User intent and assessor disagreement in web search evaluation

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.01

Visualization

Abstract

The web contains many duplicate and near-duplicate documents. Given that user satisfaction is negatively affected by redundant information in search results, a significant amount of research has been devoted to developing duplicate detection algorithms. However, most such algorithms rely solely on document content to detect duplication, ignoring the fact that a primary goal of duplicate detection is to identify documents that contain redundant information with respect to a particular user query. Similarly, although query-dependent result diversification algorithms compute a query-dependent ranking, they tend to do so on the basis of a query-independent content similarity score. In this paper, we bridge the gap between query-dependent redundancy and query-independent duplication by showing how user click behavior following a query provides evidence about the relative novelty of web documents. While most previous work on interpreting user clicks on search results has assumed that they reflect just result relevance, we show that clicks also provide information about duplication between web documents since users consider search results in the context of previously seen documents. Moreover, we find that duplication explains a substantial amount of presentation bias observed in clicking behavior. We identify three distinct types of redundancy that commonly occur on the web and show how click data can be used to detect these different types.