DURIAN: a demo for near-duplicate detection

Authors:
Hui Yang;Jamie Callan;Stuart Shulman
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;University of Pittsburgh
Venue:
dg.o '06 Proceedings of the 2006 international conference on Digital government research
Year:
2006

Citing 1
Cited 0

Next steps in near-duplicate detection for eRulemaking

dg.o '06 Proceedings of the 2006 international conference on Digital government research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, the move from paper to electronic public comments makes it much easier for individuals to customize form letters while harder for agencies to identify substantive information since there are many near-duplicate comments that express the same viewpoint in slightly different language. The identification of exact- and near-duplicate texts, and recognition of unique text within near-duplicate documents, is an important component of data cleaning and integration processes for eRulemaking.This brief paper describes a demonstration of a near-duplicate detection system, DURIAN (DUplicate Removal In lArge collectioN), that identifies and organizes the near-duplicates for eRulemaking applications.