Exploring classification concept drift on a large news text corpus

  • Authors:
  • Artur Šilić;Bojana Dalbelo Bašić

  • Affiliations:
  • Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia;Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia

  • Venue:
  • CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Concept drift has regained research interest during recent years as many applications use data sources that are changing over time. We study the classification task using logistic regression on a large news collection of 248K texts during a period of seven years. We present extrinsic methods of concept drift detection and quantification using training set formation with different windowing techniques. We characterize concept drift on a seven-year-long Le Monde news corpus and show the overestimation of classifier performance if it is neglected. We lay out paths for future work where we plan to refine extrinsic characterization methods and investigate the drifting of learning parameters when few examples are available.