Time series similarity measures and time series indexing (abstract only)

  • Authors:
  • Dimitrios Gunopulos;Gautam Das

  • Affiliations:
  • Univ. of California, Riverside;Microsoft Research

  • Venue:
  • SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Time series is the simplest form of temporal data. A time series is a sequence of real numbers collected regularly in time, where each number represents a value. Time series data come up in a variety of domains, including stock market analysis, environmental data, telecommunications data, medical data and financial data. Web data that count the number of clicks on given cites, or model the usage of different pages are also modeled as time series. Therefore time series account for a large fraction of the data stored in commercial databases. There is recently increasing recognition of this fact, and support for time series as a different data type in commercial data bases management systems is increasing. IBM DB2 for example implements support for time series using data-blades.The pervasiveness and importance of time series data has sparked a lot of research work on the topic. While the statistics literature on time series is vast, it has not studied methods that would be appropriate for the time series similarity and indexing problems we discuss here; much of the relevant work on these problems has been done by the computer science community.One interesting problem with time series data is finding whether different time series display similar behavior. More formally, the problem can be stated as: Given two time series X and Y, determine whether they are similar or not (in other words, define and compute a distance function dist(X, Y)). Typically each time series describes the evolution of an object, for example the price of a stock, or the levels of pollution as a function of time at a given data collection station. The objective can be to cluster the different objects to similar groups, or to classify an object based on a set of known object examples. The problem is hard because the similarity model should allow for imprecise matches. One interesting variation is the subsequence similarity problem, where given two time series X and Y, we have to determine those subsequences of X that are similar to pattern Y. To answer these problems, different notions of similarity between time series have been proposed in data mining research.In the tutorial we examine the different time series similarity models that have been proposed, in terms of efficiency and accuracy. The solutions encompass techniques from a wide variety of disciplines, such as databases, signal processing, speech recognition, pattern matching, combinatorics and statistics. We survey proposed similarity techniques, including the Lp norms, time warping, longest common subsequence measures, baselines, moving averaging, or deformable Markov model templates.Another problem that comes up in applications is the indexing problem: given a time series X, and a set of time series S = {Y1,…,YN}, find the time series in S that are most similar to the query X. A variation is the subsequence indexing problem, where given a set of sequences S, and a query sequence (pattern) X, find the sequences in S that contain subsequences that are similar to X. To solve these problems efficiently, appropriate indexing techniques have to be used. Typically, the similarity problem is related to the indexing problem: simple (and possibly inaccurate) similarity measures are usually easy to build indexes for, while more sophisticated similarity measures make the indexing problem hard and interesting.We examine the indexing techniques that can be used for different models, and the dimensionality reduction techniques that have been proposed to improve indexing performance. A time series of length n can be considered as a tuple in an n-dimensional space. Indexing this space directly is inefficient because of the very high dimensionality. The main idea to improve on it is to use a dimensionality reduction technique that takes the n item long time series, and maps it to a lower dimensional space with k dimensions (hopefully, k n).We give a detailed description of the most important techniques used for dimensionality reduction. These include: the SVD decomposition, the Fourier transform (and the similar Discrete Cosine transform), the Wavelet decomposition, Multidimensional Scaling, random projection techniques, FastMap (and variants), and Linear partitioning. These techniques have specific strengths and weaknesses, making some of them better suited for specific applications and settings.Finally we consider extensions to the problem of indexing subsequences, as well as to the problem of finding similar high-dimensional sequences, such as trajectories or video frame sequences.