Variations in relevance judgments and the evaluation of retrieval performance
Information Processing and Management: an International Journal
Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
Variations in relevance assessments and the measurement of retrieval effectiveness
Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Overview of the sixth text REtrieval conference (TREC-6)
Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
Variations in relevance judgments and the measurement of retrieval effectiveness
Information Processing and Management: an International Journal
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Machine Learning
Analysis of lexical signatures for finding lost or related documents
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Novelty and redundancy detection in adaptive filtering
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting similar documents using salient terms
Proceedings of the eleventh international conference on Information and knowledge management
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents
Journal of the American Society for Information Science and Technology - Research Articles
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Hi-index | 0.00 |
As massive document repositories and knowledge management systems continue to expand, in proprietary environments as well as on the Web, the need for duplicate detection becomes increasingly important. In business enterprises such as law firms, effective retrieval applications depend upon such functionality. Today's Internet-savvy users are not interested in search results containing numerous sets of duplicate documents, whether exact duplicates or near variants. This report addresses our work in the domain of legal information retrieval, working with a large, transactional knowledge management system. We specifically explore the occurrence and treatment of identical, near-identical, and fuzzy duplicate sub-documents ('clauses') in a contracts database. To our knowledge, we are the first to use principled methods to construct a test collection of transactional documents for such research purposes, one which identifies a variety of duplicate types and is deployed to establish baseline algorithmic approaches to deduplication. We subsequently investigate the application of digital signature techniques to characterize and compare similar clauses in order to identify duplicates and near duplicates. This approach establishes a baseline using methods and algorithms first developed in a parallel domain. It produces a set of promising results following an extensive assessment phase involving direct comparisons with gold training and test data created by expert attorneys working in the transactional domain.