Next steps in near-duplicate detection for eRulemaking

Authors:
Hui Yang;Jamie Callan;Stuart Shulman
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;University of Pittsburgh
Venue:
dg.o '06 Proceedings of the 2006 international conference on Digital government research
Year:
2006

Citing 9
Cited 5

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Clustering with Instance-level Constraints

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Near-duplicate detection for eRulemaking

dg.o '05 Proceedings of the 2005 national conference on Digital government research
Similarity measures for tracking information flow

Proceedings of the 14th ACM international conference on Information and knowledge management

DURIAN: a demo for near-duplicate detection

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Identifying and classifying subjective claims

dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
A bootstrapping approach for identifying stakeholders in public-comment corpora

dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Ontology generation for large email collections

dg.o '08 Proceedings of the 2008 international conference on Digital government research
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. The identification of exact- and near-duplicate texts, and recognition of unique text within near-duplicate documents, is an important component of data cleaning and integration processes for eRulemaking.This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. Experimental results demonstrate that DURIAN is about as effective as human assessors. The paper concludes by discussing challenges to moving near-duplicate detection into operational rulemaking environments.