Automatic Training Data Cleaning for Text Classification

  • Authors:
  • Hassan H. Malik;Vikas S. Bhardwaj

  • Affiliations:
  • -;-

  • Venue:
  • ICDMW '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Supervised text classification algorithms rely on the availability of large quantities of quality training data to achieve their optimal performance. However, not all training data is created equal and the quality of class-labels assigned by human experts may vary greatly with their levels of experience, domain knowledge, and the time available to label each document. In our experiments, focused label validation and correction by expert journalists improved the Micro and Macro-F1 scores achieved by Linear SVMs by as much as 14.5% and 30% respectively, on a corpus of professionally labeled news stories. Manual label correction is an expensive and time consuming process and the classification quality may not linearly improve with the amount of time spent, making it increasingly more expensive to achieve higher classification quality targets. We propose ATDC, a novel evidence-based training data cleaning method that uses training examples with high-quality class labels to automatically validate and correct labels of noisy training data. A subset of these instances are then selected to augment the original training set. On a large noisy dataset with about two million news stories, our method improved the baseline Micro-F1 and Macro-F1 scores by 9% and 13% respectively, without requiring any further human intervention.