CID: an efficient complexity-invariant distance for time series

Authors:
Gustavo E. Batista;Eamonn J. Keogh;Oben Moses Tataw;Vinícius M. Souza
Affiliations:
University of California, Riverside, Riverside, USA 92521;University of California, Riverside, Riverside, USA 92521;University of California, Riverside, Riverside, USA 92521;Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, Brazil 13560-970
Venue:
Data Mining and Knowledge Discovery
Year:
2014

Citing 17
Cited 0

Fast subsequence matching in time-series databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Searching in metric spaces

ACM Computing Surveys (CSUR)
A symbolic representation of time series, with implications for streaming algorithms

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Indexing multi-dimensional time-series with support for multiple distance measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
Compression-based data mining of sequential data

Data Mining and Knowledge Discovery
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Exact indexing of dynamic time warping

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Querying and mining of time series data: experimental comparison of representations and distance measures

Proceedings of the VLDB Endowment
Disk aware discord discovery: finding unusual time series in terabyte sized datasets

Knowledge and Information Systems
An Introduction to Kolmogorov Complexity and Its Applications

An Introduction to Kolmogorov Complexity and Its Applications
Supporting exact indexing of arbitrarily rotated shapes and periodic time series under Euclidean and warping distance measures

The VLDB Journal — The International Journal on Very Large Data Bases
Finding Time Series Motifs in Disk-Resident Data

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
The anchors hierarchy: using the triangle inequality to survive high dimensional data

UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Shape orientability

ACCV'06 Proceedings of the 7th Asian conference on Computer Vision - Volume Part II
Discovering the Intrinsic Cardinality and Dimensionality of Time Series Using MDL

ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
Searching and mining trillions of time series subsequences under dynamic time warping

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by "suggesting" to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.