Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Reexamining the cluster hypothesis: scatter/gather on retrieval results
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Information Retrieval
Novelty and redundancy detection in adaptive filtering
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Beyond independent relevance: methods and evaluation metrics for subtopic retrieval
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Retrieval and novelty detection at the sentence level
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Resource-adaptive real-time new event detection
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Distributed text retrieval from overlapping collections
ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Novelty and diversity in information retrieval evaluation
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving web information indexing and retrieval based on center block duplication detection
International Journal of Innovative Computing and Applications
Novelty as a form of contextual re-ranking: efficient KLD models and mixture models
Proceedings of the second international symposium on Information interaction in context
Detecting the origin of text segments efficiently
Proceedings of the 18th international conference on World wide web
A framework for corroborating answers from multiple web sources
Information Systems
Rules of thumb for information acquisition from large and redundant data
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Quantifying test collection quality based on the consistency of relevance judgements
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Cross-lingual text fragment alignment using divergence from randomness
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Coreference aware web object retrieval
Proceedings of the 20th ACM international conference on Information and knowledge management
Compact features for detection of near-duplicates in distributed retrieval
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
The case of the duplicate documents measurement, search, and science
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Clustering near-identical sequences for fast homology search
RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Fast discovery of similar sequences in large genomic collections
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
On aggregating labels from multiple crowd workers to infer relevance of documents
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Measuring the coverage and redundancy of information search services on e-commerce platforms
Electronic Commerce Research and Applications
PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
Hi-index | 0.00 |
The web contains a great many documents that are content-equivalent, that is, informationally redundant with respect to each other. The presence of such mutually redundant documents in search results can degrade the user search experience. Previous attempts to address this issue, most notably the TREC novelty track, were characterized by difficulties with accuracy and evaluation. In this paper we explore syntactic techniques --- particularly document fingerprinting --- for detecting content equivalence. Using these techniques on the TREC GOV1 and GOV2 corpora revealed a high degree of redundancy; a user study confirmed that our metrics were accurately identifying content-equivalence. We show, moreover, that content-equivalent documents have a significant effect on the search experience: we found that 16.6% of all relevant documents in runs submitted to the TREC 2004 terabyte track were redundant.