Tracking clusters in evolving data streams over sliding windows

  • Authors:
  • Aoying Zhou;Feng Cao;Weining Qian;Cheqing Jin

  • Affiliations:
  • Fudan University, Department of Computer Science and Engineering, 200433, Shanghai, P.R. China;IBM China Research Lab, 100094, Beijing, P.R. China;Fudan University, Department of Computer Science and Engineering, 200433, Shanghai, P.R. China;Fudan University, Department of Computer Science and Engineering, 200433, Shanghai, P.R. China and East China University of Science and Technology, Department of Computer Science, 200237, Shanghai ...

  • Venue:
  • Knowledge and Information Systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Mining data streams poses great challenges due to the limited memory availability and real-time query response requirement. Clustering an evolving data stream is especially interesting because it captures not only the changing distribution of clusters but also the evolving behaviors of individual clusters. In this paper, we present a novel method for tracking the evolution of clusters over sliding windows. In our SWClustering algorithm, we combine the exponential histogram with the temporal cluster features, propose a novel data structure, the Exponential Histogram of Cluster Features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change of the cluster distribution. Our approach has several advantages over existing methods: (1) the quality of the clusters is improved because the EHCF captures the distribution of recent records precisely; (2) compared with previous methods, the mechanism employed to adaptively maintain the in-cluster synopsis can track the cluster evolution better, while consuming much less memory; (3) the EHCF provides a flexible framework for analyzing the cluster evolution and tracking a specific cluster efficiently without interfering with other clusters, thus reducing the consumption of computing resources for data stream clustering. Both the theoretical analysis and extensive experiments show the effectiveness and efficiency of the proposed method.