Enabling fast prediction for ensemble models on data streams

Authors:
Peng Zhang;Jun Li;Peng Wang;Byron J. Gao;Xingquan Zhu;Li Guo
Affiliations:
Chinese Academy of Sciences, Beijing, China;Chinese Academy of Sciences, Beijing, China;Chinese Academy of Sciences, Beijing, China;Texas State University - San Marcos, SAN MARCOS, TX, China;University of Technology, Sydney, Sydney, China;Chinese Academy of Sciences, Beijing, China
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 32
Cited 6

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
C4.5: programs for machine learning

C4.5: programs for machine learning
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A streaming ensemble algorithm (SEA) for large-scale classification

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
An introduction to spatial database systems

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
Similarity Indexing with the SS-tree

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Incremental Support Vector Machine Construction

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Hilbert R-tree: An Improved R-tree using Fractals

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Multidimensional Access Methods: Trees Have Grown Everywhere

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Mining concept-drifting data streams using ensemble classifiers

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
StreaMon: an adaptive engine for stream query processing

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Rule extraction from linear support vector machines

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Using additive expert ensembles to cope with concept drift

ICML '05 Proceedings of the 22nd international conference on Machine learning
Data Streams: Models and Algorithms (Advances in Database Systems)

Data Streams: Models and Algorithms (Advances in Database Systems)
Ensemble Pruning Via Semi-definite Programming

The Journal of Machine Learning Research
Near-optimal algorithms for shared filter evaluation in data stream systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Categorizing and mining concept drifting data streams

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
On Appropriate Assumptions to Mine Data Streams: Analysis and Practice

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Scalable ranked publish/subscribe

Proceedings of the VLDB Endowment
Cleansing Noisy Data Streams

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Identifying suspicious URLs: an application of large-scale online learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
New ensemble methods for evolving data streams

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Indexing Boolean expressions

Proceedings of the VLDB Endowment
Ensemble pruning via individual contribution ordering

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Robust ensemble learning for mining noisy data streams

Decision Support Systems
Active learning from stream data using optimal weight classifier ensemble

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
The ClusTree: indexing micro-clusters for anytime stream mining

Knowledge and Information Systems

Predictive Data Stream Filtering

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
Mining frequent patterns across multiple data streams

Proceedings of the 20th ACM international conference on Information and knowledge management
Continuous data stream query in the cloud

Proceedings of the 20th ACM international conference on Information and knowledge management
A framework for application-driven classification of data streams

Neurocomputing
Group detection and relation analysis research for web social network

APWeb'12 Proceedings of the 14th international conference on Web Technologies and Applications
Soft-CsGDT: soft cost-sensitive Gaussian decision tree for cost-sensitive classification of data streams

Proceedings of the 2nd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ensemble learning has become a common tool for data stream classification, being able to handle large volumes of stream data and concept drifting. Previous studies focus on building accurate prediction models from stream data. However, a linear scan of a large number of base classifiers in the ensemble during prediction incurs significant costs in response time, preventing ensemble learning from being practical for many real world time-critical data stream applications, such as Web traffic stream monitoring, spam detection, and intrusion detection. In these applications, data streams usually arrive at a speed of GB/second, and it is necessary to classify each stream record in a timely manner. To address this problem, we propose a novel Ensemble-tree (E-tree for short) indexing structure to organize all base classifiers in an ensemble for fast prediction. On one hand, E-trees treat ensembles as spatial databases and employ an R-tree like height-balanced structure to reduce the expected prediction time from linear to sub-linear complexity. On the other hand, E-trees can automatically update themselves by continuously integrating new classifiers and discarding outdated ones, well adapting to new trends and patterns underneath data streams. Experiments on both synthetic and real-world data streams demonstrate the performance of our approach.