Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Clustering with Instance-level Constraints
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Near-duplicate detection for eRulemaking
dg.o '05 Proceedings of the 2005 national conference on Digital government research
Similarity measures for tracking information flow
Proceedings of the 14th ACM international conference on Information and knowledge management
DURIAN: a demo for near-duplicate detection
dg.o '06 Proceedings of the 2006 international conference on Digital government research
Identifying and classifying subjective claims
dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
A bootstrapping approach for identifying stakeholders in public-comment corpora
dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Ontology generation for large email collections
dg.o '08 Proceedings of the 2008 international conference on Digital government research
Fixing the threshold for effective detection of near duplicate web documents in web crawling
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Hi-index | 0.00 |
Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. The identification of exact- and near-duplicate texts, and recognition of unique text within near-duplicate documents, is an important component of data cleaning and integration processes for eRulemaking.This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. Experimental results demonstrate that DURIAN is about as effective as human assessors. The paper concludes by discussing challenges to moving near-duplicate detection into operational rulemaking environments.