Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

  • Authors:
  • Michele Banko;Eric Brill

  • Affiliations:
  • Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA

  • Venue:
  • HLT '01 Proceedings of the first international conference on Human language technology research
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we discuss experiments applying machine learning techniques to the task of confusion set disambiguation, using three orders of magnitude more training data than has previously been used for any disambiguation-in-string-context problem. In an attempt to determine when current learning methods will cease to benefit from additional training data, we analyze residual errors made by learners when issues of sparse data have been significantly mitigated. Finally, in the context of our results, we discuss possible directions for the empirical natural language research community.