Active learning for e-rulemaking: public comment categorization

Authors:
Stephen Purpura;Claire Cardie;Jesse Simons
Affiliations:
Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY
Venue:
dg.o '08 Proceedings of the 2008 international conference on Digital government research
Year:
2008

Citing 15
Cited 1

Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
The nature of statistical learning theory

The nature of statistical learning theory
Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Selective Sampling with Redundant Views

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Diverse ensembles for active learning

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Near-duplicate detection for eRulemaking

dg.o '05 Proceedings of the 2005 national conference on Digital government research
Why inverse document frequency?

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Multidimensional text analysis for eRulemaking

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Automated classification of congressional legislation

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Information acquisition using multiple classifications

Proceedings of the 4th international conference on Knowledge capture
A study in rule-specific issue categorization for e-rulemaking

dg.o '08 Proceedings of the 2008 international conference on Digital government research

That is your evidence?: Classifying stance in online political debate

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the e-rulemaking problem of reducing the manual labor required to analyze public comment sets. In current and previous work, for example, text categorization techniques have been used to speed up the comment analysis phase of e-rulemaking --- by classifying sentences automatically, according to the rule-specific issues [2] or general topics that they address [7, 8]. Manually annotated data, however, is still required to train the supervised inductive learning algorithms that perform the categorization. This paper, therefore, investigates the application of active learning methods for public comment categorization: we develop two new, general-purpose, active learning techniques to selectively sample from the available training data for human labeling when building the sentence-level classifiers employed in public comment categorization. Using an e-rulemaking corpus developed for our purposes [2], we compare our methods to the well-known query by committee (QBC) active learning algorithm [5] and to a baseline that randomly selects instances for labeling in each round of active learning. We show that our methods statistically significantly exceed the performance of the random selection active learner and the query by committee (QBC) variation, requiring many fewer training examples to reach the same levels of accuracy on a held-out test set. This provides promising evidence that automated text categorization methods might be used effectively to support public comment analysis.