DURIAN: a demo for near-duplicate detection

  • Authors:
  • Hui Yang;Jamie Callan;Stuart Shulman

  • Affiliations:
  • Carnegie Mellon University;Carnegie Mellon University;University of Pittsburgh

  • Venue:
  • dg.o '06 Proceedings of the 2006 international conference on Digital government research
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently, the move from paper to electronic public comments makes it much easier for individuals to customize form letters while harder for agencies to identify substantive information since there are many near-duplicate comments that express the same viewpoint in slightly different language. The identification of exact- and near-duplicate texts, and recognition of unique text within near-duplicate documents, is an important component of data cleaning and integration processes for eRulemaking.This brief paper describes a demonstration of a near-duplicate detection system, DURIAN (DUplicate Removal In lArge collectioN), that identifies and organizes the near-duplicates for eRulemaking applications.