The case of the duplicate documents measurement, search, and science

Authors:
Justin Zobel;Yaniv Bernstein
Affiliations:
School of Computer Science & Information Technology, RMIT University, Melbourne, Australia;School of Computer Science & Information Technology, RMIT University, Melbourne, Australia
Venue:
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Year:
2006

Citing 10
Cited 4

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Should Computer Scientists Experiment More?

Computer
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
When will information retrieval be "good enough"?

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Similarity measures for tracking information flow

Proceedings of the 14th ACM international conference on Information and knowledge management
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Cache-Conscious collision resolution in string hash tables

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
CoDet: sentence-based containment detection in news corpora

Proceedings of the 20th ACM international conference on Information and knowledge management
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.01

Visualization

Abstract

Many of the documents in large text collections are duplicates and versions of each other. In recent research, we developed new methods for finding such duplicates; however, as there was no directly comparable prior work, we had no measure of whether we had succeeded. Worse, the concept of “duplicate” not only proved difficult to define, but on reflection was not logically defensible. Our investigation highlighted a paradox of computer science research: objective measurement of outcomes involves a subjective choice of preferred measure; and attempts to define measures can easily founder in circular reasoning. Also, some measures are abstractions that simplify complex real-world phenomena, so success by a measure may not be meaningful outside the context of the research. These are not merely academic concerns, but are significant problems in the design of research projects. In this paper, the case of the duplicate documents is used to explore whether and when it is reasonable to claim that research is successful.