The ClusTree: indexing micro-clusters for anytime stream mining

Authors:
Philipp Kranen;Ira Assent;Corinna Baldauf;Thomas Seidl
Affiliations:
RWTH Aachen University, Aachen, Germany;Aarhus University, Aarhus, Denmark;RWTH Aachen University, Aachen, Germany;RWTH Aachen University, Aachen, Germany
Venue:
Knowledge and Information Systems
Year:
2011

Citing 0
Cited 10

Enabling fast prediction for ensemble models on data streams

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Memory-less unsupervised clustering for data streaming by versatile ellipsoidal function

Proceedings of the 20th ACM international conference on Information and knowledge management
A weightless neural network-based approach for stream data clustering

IDEAL'12 Proceedings of the 13th international conference on Intelligent Data Engineering and Automated Learning
A single pass trellis-based algorithm for clustering evolving data streams

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Weighted Fuzzy-Possibilistic C-Means Over Large Data Sets

International Journal of Data Warehousing and Mining
Warped K-Means: An algorithm to cluster sequentially-distributed data

Information Sciences: an International Journal
Data stream clustering: A survey

ACM Computing Surveys (CSUR)
Energy-based function to evaluate data stream clustering

Advances in Data Analysis and Classification
Online fuzzy medoid based clustering algorithms

Neurocomputing
Mining top-k frequent patterns over data streams sliding window

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering streaming data requires algorithms that are capable of updating clustering results for the incoming data. As data is constantly arriving, time for processing is limited. Clustering has to be performed in a single pass over the incoming data and within the possibly varying inter-arrival times of the stream. Likewise, memory is limited, making it impossible to store all data. For clustering, we are faced with the challenge of maintaining a current result that can be presented to the user at any given time. In this work, we propose a parameter-free algorithm that automatically adapts to the speed of the data stream. It makes best use of the time available under the current constraints to provide a clustering of the objects seen up to that point. Our approach incorporates the age of the objects to reflect the greater importance of more recent data. For efficient and effective handling, we introduce the ClusTree, a compact and self-adaptive index structure for maintaining stream summaries. Additionally we present solutions to handle very fast streams through aggregation mechanisms and propose novel descent strategies that improve the clustering result on slower streams as long as time permits. Our experiments show that our approach is capable of handling a multitude of different stream characteristics for accurate and scalable anytime stream clustering.