Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research

  • Authors:
  • Eamonn Keogh;Jessica Lin;Wagner Truppel

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Time series data is perhaps the most frequently encountered typeof data examined by the data mining community. Clustering isperhaps the most frequently used data mining algorithm, beinguseful in it's own right as an exploratory technique, and also as asubroutine in more complex data mining algorithms such as rulediscovery, indexing, summarization, anomaly detection, andclassification. Given these two facts, it is hardly surprising thattime series clustering has attracted much attention. The data to beclustered can be in one of two formats: many individual timeseries, or a single time series, from which individual time seriesare extracted with a sliding window. Given the recent explosion ofinterest in streaming data and online algorithms, the latter casehas received much attention.In this work we make an amazing claim. Clustering of streamingtime series is completely meaningless. More concretely, clustersextracted from streaming time series are forced to obey a certainconstraint that is pathologically unlikely to be satisfied by anydataset, and because of this, the clusters extracted by anyclustering algorithm are essentially random. While this constraintcan be intuitively demonstrated with a simple illustration and issimple to prove, it has never appeared in the literature.We can justify calling our claim surprising, since it invalidatesthe contribution of dozens of previously published papers. We willjustify our claim with a theorem, illustrative examples, and acomprehensive set of experiments on reimplementations ofprevious work.