Next steps in near-duplicate detection for eRulemaking

  • Authors:
  • Hui Yang;Jamie Callan;Stuart Shulman

  • Affiliations:
  • Carnegie Mellon University;Carnegie Mellon University;University of Pittsburgh

  • Venue:
  • dg.o '06 Proceedings of the 2006 international conference on Digital government research
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. The identification of exact- and near-duplicate texts, and recognition of unique text within near-duplicate documents, is an important component of data cleaning and integration processes for eRulemaking.This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. Experimental results demonstrate that DURIAN is about as effective as human assessors. The paper concludes by discussing challenges to moving near-duplicate detection into operational rulemaking environments.