Improving the offline clustering stage of data stream algorithms in scenarios with variable number of clusters

Authors:
Elaine R. Faria;Rodrigo C. Barros;João Gama;André C. P. L. F. Carvalho
Affiliations:
University of São Paulo and Fed. University of Uberlândia, São Carlos/Uberlândia, Brazil;University of São Paulo, São Carlos, Brazil;University of Porto, Porto, Portugal;University of São Paulo, São Carlos, Brazil
Venue:
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Year:
2012

Citing 4
Cited 1

k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
On the efficiency of evolutionary fuzzy clustering

Journal of Heuristics
Efficiency issues of evolutionary k-means

Applied Soft Computing

Data stream clustering: A survey

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many data stream clustering algorithms operate in two well-defined steps: (i) online statistical data collection stage; and (ii) offline macro-clustering stage. The well-known k-means algorithm is often employed for performing the offline macro-clustering step. The conventional k-means algorithm assumes that the number of clusters (k) is defined a priori by the user. Given the difficulty of defining the value of k a priori in real-world problems, we describe a new approach that allows estimating k dynamically from streams with variable number of clusters, which is a common scenario in data with a non-stationary distribution. In addition, we combine our dynamic approach with two different strategies for initializing the centroids during the offline clustering. Analysis of results suggest that, using the dynamic approach, the method k-means++ for centroids initialization present better results.