Near-duplicate detection for eRulemaking

Authors:
Hui Yang;Jamie Callan
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
dg.o '05 Proceedings of the 2005 national conference on Digital government research
Year:
2005

Citing 11
Cited 16

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Duplicate Removal in Information System Dissemination

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Constructing a text corpus for inexact duplicate detection

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Language processing technologies for electronic rulemaking: a project highlight

dg.o '05 Proceedings of the 2005 national conference on Digital government research
Multidimensional text analysis for eRulemaking

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Automatically labeling hierarchical clusters

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Automated classification of congressional legislation

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Next steps in near-duplicate detection for eRulemaking

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Progress in language processing technology for electronic rulemaking

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Identifying and classifying subjective claims

dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
A bootstrapping approach for identifying stakeholders in public-comment corpora

dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Active learning for e-rulemaking: public comment categorization

dg.o '08 Proceedings of the 2008 international conference on Digital government research
A study in rule-specific issue categorization for e-rulemaking

dg.o '08 Proceedings of the 2008 international conference on Digital government research
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Get out the vote: determining support or opposition from congressional floor-debate transcripts

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Reuse in the wild: an empirical and ethnographic study of organizational content reuse

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

U.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing regulations. In recent years, agencies have begun to accept comments via both email and Web forms. The transition from paper to electronic comments makes it much easier for individuals to customize "form" letters, which they do, creating "near-duplicate" comments that express the same viewpoint in slightly different languages. This paper explores the use of simple text clustering and retrieval algorithms for identifying near-duplicate public comments. Experiments with public comments about a recent regulation proposed by the Environmental Protection Agency (EPA) demonstrate the effectiveness of the algorithms.