Suppressing model overfitting in mining concept-drifting data streams

Authors:
Haixun Wang;Jian Yin;Jian Pei;Philip S. Yu;Jeffrey Xu Yu
Affiliations:
IBM T. J. Watson Research;IBM T. J. Watson Research;Simon Fraser University, Canada;IBM T. J. Watson Research;Chinese University of Hong Kong
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 16
Cited 16

A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining time-changing data streams

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A streaming ensemble algorithm (SEA) for large-scale classification

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Continually evaluating similarity-based pattern queries on a streaming time series

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Machine Learning
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Mining concept-drifting data streams using ensemble classifiers

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Wavelet synopsis for data streams: minimizing non-euclidean error

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Combining proactive and reactive predictions for data streams

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Multi-dimensional regression analysis of time-series data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Deciding what to observe next: adaptive variable selection for regression in multivariate data streams

Proceedings of the 2008 ACM symposium on Applied computing
Non-stationary data sequence classification using online class priors estimation

Pattern Recognition
Intervention Events Detection and Prediction in Data Streams

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Efficient Detection of Discords for Time Series Stream

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
An Aggregate Ensemble for Mining Concept Drifting Data Streams with Noise

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Stream data clustering based on grid density and attraction

ACM Transactions on Knowledge Discovery from Data (TKDD)
Density-based clustering of data streams at multiple resolutions

ACM Transactions on Knowledge Discovery from Data (TKDD)
Unsupervised change analysis using supervised learning

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
An algorithmic approach to event summarization

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Transfer estimation of evolving class priors in data stream classification

Pattern Recognition
An efficient approach for mining segment-wise intervention rules in time-series streams

WAIM'10 Proceedings of the 11th international conference on Web-age information management
The impact of latency on online classification learning with concept drift

KSEM'10 Proceedings of the 4th international conference on Knowledge science, engineering and management
Active learning from stream data using optimal weight classifier ensemble

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Finding semantics in time series

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Time stamping in the presence of latency and drift

ICAIS'11 Proceedings of the Second international conference on Adaptive and intelligent systems
Recentness biased learning for time series forecasting

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical data may increase the chances of learning obsolete models. On the other hand, learning only from the latest data may lead to biased classifiers, as the latest data is often an unrepresentative sample of the current class distribution. The problem is particularly acute in classifying rare events, when, for example, instances of the rare class do not even show up in the most recent training data. In this paper, we use a stochastic model to describe the concept shifting patterns and formulate this problem as an optimization one: from the historical and the current training data that we have observed, find the most-likely current distribution, and learn a classifier based on the most-likely distribution. We derive an analytic solution and approximate this solution with an efficient algorithm, which calibrates the influence of historical data carefully to create an accurate classifier. We evaluate our algorithm with both synthetic and real-world datasets. Our results show that our algorithm produces accurate and efficient classification.