Clustering of streaming time series is meaningless

Authors:
Jessica Lin;Eamonn Keogh;Wagner Truppel
Affiliations:
University of California - Riverside, Riverside, CA;University of California - Riverside, Riverside, CA;University of California - Riverside, Riverside, CA
Venue:
DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Year:
2003

Citing 20
Cited 10

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
MALM: a framework for mining sequence database at multiple abstraction levels

Proceedings of the seventh international conference on Information and knowledge management
Identifying distinctive subsequences in multivariate time series by clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining the stock market (extended abstract): which measure is best?

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A new approach to analyzing gene expression time series data

Proceedings of the sixth annual international conference on Computational biology
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Mining of Moving Objects from Time-Series Images and its Application to Satellite Weather Imagery

Journal of Intelligent Information Systems
A Survey of Temporal Knowledge Discovery Paradigms and Methods

IEEE Transactions on Knowledge and Data Engineering
Classification Rules + Time = Temporal Rules

ICCS '02 Proceedings of the International Conference on Computational Science-Part I
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Indexing and Mining of the Local Patterns in Sequence Database

IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences

ISMIS '02 Proceedings of the 13th International Symposium on Foundations of Intelligent Systems
Distribution Discovery: Local Analysis of Temporal Rules

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
A Motion Recognition Method by Using Primitive Motions

VDB 5 Proceedings of the Fifth Working Conference on Visual Database Systems: Advances in Visual Information Management
Extraction of Primitive Motion and Discovery of Association Rules from Human Motion Data

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
On the need for time series data mining benchmarks: a survey and empirical demonstration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Data snooping, dredging and fishing: the dark side of data mining a SIGKDD99 panel report

ACM SIGKDD Explorations Newsletter
Efficient rule discovery in a geo-spatial decision support system

dg.o '02 Proceedings of the 2002 annual national conference on Digital government research
Exact indexing of dynamic time warping

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Mining Frequent Spatio-Temporal Sequential Patterns

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Generalized Dimension-Reduction Framework for Recent-Biased Time Series Analysis

IEEE Transactions on Knowledge and Data Engineering
Continuous subspace clustering in streaming time series

Information Systems
Fuzzy prediction architecture using recurrent neural networks

Neurocomputing
A data mining framework for time series estimation

Journal of Biomedical Informatics
Privacy-preserving discovery of frequent patterns in time series

ICDM'07 Proceedings of the 7th industrial conference on Advances in data mining: theoretical aspects and applications
Defining and applying prediction performance metrics on a recurrent NARX time series model

Neurocomputing
DBOD-DS: distance based outlier detection for data

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
Change detection in time series data using wavelet footprints

SSTD'05 Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases
Substructure clustering: a novel mining paradigm for arbitrary data types

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Time series data is perhaps the most frequently encountered type of data examined by the data mining community. Clustering is perhaps the most frequently used data mining algorithm, being useful in it's own right as an exploratory technique, and also as a subroutine in more complex data mining algorithms such as rule discovery, indexing, summarization, anomaly detection, and classification. Given these two facts, it is hardly surprising that time series clustering has attracted much attention. The data to be clustered can be in one of two formats: many individual time series, or a single time series, from which individual time series are extracted with a sliding window. Given the recent explosion of interest in streaming data and online algorithms, the latter case has received much attention.In this work we make a surprising claim. Clustering of streaming time series is completely meaningless. More concretely, clusters extracted from streaming time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature.We can justify calling our claim surprising, since it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method which, based on the concept of time series motifs, is able to meaningfully cluster some streaming time series datasets.