Next steps in near-duplicate detection for eRulemaking
dg.o '06 Proceedings of the 2006 international conference on Digital government research
Hi-index | 0.00 |
Recently, the move from paper to electronic public comments makes it much easier for individuals to customize form letters while harder for agencies to identify substantive information since there are many near-duplicate comments that express the same viewpoint in slightly different language. The identification of exact- and near-duplicate texts, and recognition of unique text within near-duplicate documents, is an important component of data cleaning and integration processes for eRulemaking.This brief paper describes a demonstration of a near-duplicate detection system, DURIAN (DUplicate Removal In lArge collectioN), that identifies and organizes the near-duplicates for eRulemaking applications.