Redundant documents and search effectiveness

Authors:
Yaniv Bernstein;Justin Zobel
Affiliations:
RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
Proceedings of the 14th ACM international conference on Information and knowledge management
Year:
2005

Citing 18
Cited 19

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Information Retrieval

Information Retrieval
Novelty and redundancy detection in adaptive filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Beyond independent relevance: methods and evaluation metrics for subtopic retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Retrieval and novelty detection at the sentence level

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Access-ordered indexes

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Resource-adaptive real-time new event detection

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Novelty and diversity in information retrieval evaluation

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving web information indexing and retrieval based on center block duplication detection

International Journal of Innovative Computing and Applications
Novelty as a form of contextual re-ranking: efficient KLD models and mixture models

Proceedings of the second international symposium on Information interaction in context
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
A framework for corroborating answers from multiple web sources

Information Systems
Rules of thumb for information acquisition from large and redundant data

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Quantifying test collection quality based on the consistency of relevance judgements

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Cross-lingual text fragment alignment using divergence from randomness

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Coreference aware web object retrieval

Proceedings of the 20th ACM international conference on Information and knowledge management
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
The case of the duplicate documents measurement, search, and science

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Clustering near-identical sequences for fast homology search

RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Fast discovery of similar sequences in large genomic collections

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
On aggregating labels from multiple crowd workers to infer relevance of documents

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Measuring the coverage and redundancy of information search services on e-commerce platforms

Electronic Commerce Research and Applications
Extended structural relevance framework: a framework for evaluating structured document retrieval

Information Retrieval
TREC-Style evaluations

PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization

Quantified Score

Hi-index	0.00

Visualization

Abstract

The web contains a great many documents that are content-equivalent, that is, informationally redundant with respect to each other. The presence of such mutually redundant documents in search results can degrade the user search experience. Previous attempts to address this issue, most notably the TREC novelty track, were characterized by difficulties with accuracy and evaluation. In this paper we explore syntactic techniques --- particularly document fingerprinting --- for detecting content equivalence. Using these techniques on the TREC GOV1 and GOV2 corpora revealed a high degree of redundancy; a user study confirmed that our metrics were accurately identifying content-equivalence. We show, moreover, that content-equivalent documents have a significant effect on the search experience: we found that 16.6% of all relevant documents in runs submitted to the TREC 2004 terabyte track were redundant.