Exploring classification concept drift on a large news text corpus

Authors:
Artur Šilić;Bojana Dalbelo Bašić
Affiliations:
Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia;Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia
Venue:
CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Year:
2012

Citing 12
Cited 0

A vector space model for automatic indexing

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Incremental context mining for adaptive document classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Relevant Data Expansion for Learning Concept Drift from Sparsely Labeled Data

IEEE Transactions on Knowledge and Data Engineering
Accuracy estimation with clustered dataset

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Learning drifting concepts: Example selection vs. example weighting

Intelligent Data Analysis
Understanding temporal aspects in document classification

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Local likelihood modeling of temporal text streams

Proceedings of the 25th international conference on Machine learning
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Exploiting temporal contexts in text classification

Proceedings of the 17th ACM conference on Information and knowledge management
An adaptive personalized news dissemination system

Journal of Intelligent Information Systems
Temporally-aware algorithms for document classification

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Concept drift has regained research interest during recent years as many applications use data sources that are changing over time. We study the classification task using logistic regression on a large news collection of 248K texts during a period of seven years. We present extrinsic methods of concept drift detection and quantification using training set formation with different windowing techniques. We characterize concept drift on a seven-year-long Le Monde news corpus and show the overestimation of classifier performance if it is neglected. We lay out paths for future work where we plan to refine extrinsic characterization methods and investigate the drifting of learning parameters when few examples are available.