Improving text categorization methods for event tracking

  • Authors:
  • Yiming Yang;Tom Ault;Thomas Pierce;Charles W. Lattimer

  • Affiliations:
  • Language Technologies Institute and Computer Science Department, Newell Simon Hall 3612D, Carnegie Mellon University, Pittsburgh, PA;Language Technologies Institute and Computer Science Department, Newell Simon Hall 3612D, Carnegie Mellon University, Pittsburgh, PA;Language Technologies Institute and Computer Science Department, Newell Simon Hall 3612D, Carnegie Mellon University, Pittsburgh, PA;Language Technologies Institute and Computer Science Department, Newell Simon Hall 3612D, Carnegie Mellon University, Pittsburgh, PA

  • Venue:
  • SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is extremely small, the majority of training documents are unlabelled, and most of the events have a short duration in time. We adapted several supervised text categorization methods, specifically several new variants of the k-Nearest Neighbor (kNN) algorithm and a Rocchio approach, to track events. All of these methods showed significant improvement (up to 71% reduction in weighted error rates) over the performance of the original kNN algorithm on TDT benchmark collections, making kNN among the top-performing systems in the recent TDT3 official evaluation. Furthermore, by combining these methods, we significantly reduced the variance in performance of our event tracking system over different data collections, suggesting a robust solution for parameter optimization.