SKIF: a data imputation framework for concept drifting data streams

Authors:
Peng Zhang;Xingquan Zhu;Jianlong Tan;Li Guo
Affiliations:
Chinese Academy of Sciences, Beijing, China;QCIS Center, & University of Technology, Sydney, Sydney, Australia;Chinese Academy of Sciences, Beijing, China;Chinese Academy of Sciences, Beijing, China
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 10
Cited 3

Similarity Search: The Metric Space Approach (Advances in Database Systems)

Similarity Search: The Metric Space Approach (Advances in Database Systems)
Data Streams: Models and Algorithms (Advances in Database Systems)

Data Streams: Models and Algorithms (Advances in Database Systems)
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Using Data Mining to Estimate Missing Sensor Data

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Identifying suspicious URLs: an application of large-scale online learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Evaluating top-k queries over incomplete data streams

Proceedings of the 18th ACM conference on Information and knowledge management
Mining Data Streams with Labeled and Unlabeled Training Examples

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Vague One-Class Learning for Data Streams

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Estimating missing data in data streams

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Missing Value Estimation for Mixed-Attribute Data Sets

IEEE Transactions on Knowledge and Data Engineering

Predictive Data Stream Filtering

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
Mining frequent patterns across multiple data streams

Proceedings of the 20th ACM international conference on Information and knowledge management
A framework for application-driven classification of data streams

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Missing data commonly occurs in many applications. While many data imputation methods exist to handle the missing data problem for large scale databases, when applied to concept drifting data streams, these methods face some common difficulties. First, due to large and continuous data volumes, we are unable to maintain all stream records to form a candidate pool and estimate missing values, as most existing methods commonly do. Second, even if we could maintain all complete stream records using a summary structure, the concept drifting problem would make some information obsolete, and thus deteriorate the imputation accuracy. Third, in data streams, it is necessary to develop a fast yet accurate algorithm to find the most similar data for imputation. Fourth, due to the dynamic and sophisticated data collection environments, the missing rate of most stream data may be much higher than that in generic static databases, so the imputation method should be able to accommodate high missing rate in the data. To tackle these challenges, we propose, in this paper, a Streaming k-Nearest-Neighbors Imputation Framework (SKIF) for concept drifting data streams. To handle concept drifting and large volume problems in data streams, SKIF first summarizes historical complete records in some micro-resources (which are high-level statistical data structures), and maintains these micro-resources in a candidate pool as benchmark data. After that, SKIF employs a novel hybrid-kNN imputation procedure, which uses a hybrid similarity search mechanism, to find the most similar micro-resources from the large scale candidate pool efficiently. Experimental results demonstrate the effectiveness of the proposed SKIF framework for data stream imputation tasks.