Using language modeling to select useful annotation data

  • Authors:
  • Dmitriy Dligach;Martha Palmer

  • Affiliations:
  • University of Colorado at Boulder;University of Colorado at Boulder

  • Venue:
  • SRWS '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

An annotation project typically has an abundant supply of unlabeled data that can be drawn from some corpus, but because the labeling process is expensive, it is helpful to pre-screen the pool of the candidate instances based on some criterion of future usefulness. In many cases, that criterion is to improve the presence of the rare classes in the data to be annotated. We propose a novel method for solving this problem and show that it compares favorably to a random sampling baseline and a clustering algorithm.