On Appropriate Assumptions to Mine Data Streams: Analysis and Practice

Authors:
Jing Gao;Wei Fan;Jiawei Han
Affiliations:
-;-;-
Venue:
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Year:
2007

Citing 0
Cited 19

Knowledge transfer via multiple model local structure mapping

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Categorizing and mining concept drifting data streams

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Peer to peer botnet detection for cyber-security: a data mining approach

Proceedings of the 4th annual workshop on Cyber security and information intelligence research: developing strategies to meet the cyber security and information intelligence challenges ahead
A Multi-partition Multi-chunk Ensemble Technique to Classify Concept-Drifting Data Streams

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Parameter Estimation in Semi-Random Decision Tree Ensembling on Streaming Data

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
An Aggregate Ensemble for Mining Concept Drifting Data Streams with Noise

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Lacking Labels in the Stream: Classifying Evolving Stream Data with Few Labels

ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
SERA: selectively recursive approach towards nonstationary imbalanced stream data mining

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
The impact of latency on online classification learning with concept drift

KSEM'10 Proceedings of the 4th international conference on Knowledge science, engineering and management
Robust ensemble learning for mining noisy data streams

Decision Support Systems
Building a new classifier in an ensemble using streaming unlabeled data

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part II
Cloud-based malware detection for evolving data streams

ACM Transactions on Management Information Systems (TMIS)
Enabling fast prediction for ensemble models on data streams

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Time stamping in the presence of latency and drift

ICAIS'11 Proceedings of the Second international conference on Adaptive and intelligent systems
Detecting change via competence model

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
A framework for application-driven classification of data streams

Neurocomputing
Automated Anomaly Detector Adaptation using Adaptive Threshold Tuning

ACM Transactions on Information and System Security (TISSEC)
Concept drift detection via competence models

Artificial Intelligence
The CART decision tree for mining data streams

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the "same distribution, and yet this "same distribution evolves over time. We demonstrate that this may not be true, and one actually may never know either "how or "when the distribution changes. Thus, a model that fits well on the observed distribution can have unsatisfactory accuracy on the incoming data. Practically, one can just assume the bare minimum that learning from observed data is better than both random guessing and always predicting exactly the same class label. Importantly, we formally and experimentally demonstrate the robustness of a model averaging and simple voting-based framework for data streams, particularly when incoming data "continuously follows significantly different distributions. On a real streaming data, this framework reduces the expected error of baseline models by 60%, and remains the most accurate compared to those baseline models.