SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets

Authors:
Patrick Schäfer;Mikael Högqvist
Affiliations:
Zuse Institute Berlin, Berlin, Germany;Zuse Institute Berlin, Berlin, Germany
Venue:
Proceedings of the 15th International Conference on Extending Database Technology
Year:
2012

Citing 21
Cited 0

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Fast subsequence matching in time-series databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Efficiently supporting ad hoc queries in large datasets of time sequences

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Multidimensional access methods

ACM Computing Surveys (CSUR)
Locally adaptive dimensionality reduction for indexing large time series databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
A large database of graphs and its use for benchmarking graph isomorphism algorithms

Pattern Recognition Letters - Special issue: Graph-based representations in pattern recognition
Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping

IEEE Transactions on Knowledge and Data Engineering
Efficient Time Series Matching by Wavelets

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Similarity Search Over Time-Series Data Using Wavelets

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
A symbolic representation of time series, with implications for streaming algorithms

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Online Amnesic Approximation of Streaming Time Series

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Indexing spatio-temporal trajectories with Chebyshev polynomials

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Experiencing SAX: a novel symbolic representation of time series

Data Mining and Knowledge Discovery
Indexable PLA for efficient similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
The TS-tree: efficient time series search and retrieval

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
iSAX: indexing and mining terabyte sized time series

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Biological network comparison using graphlet degree distribution

Bioinformatics
iSAX 2.0: Indexing and Mining One Billion Time Series

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Time series analysis, as an application for high dimensional data mining, is a common task in biochemistry, meteorology, climate research, bio-medicine or marketing. Similarity search in data with increasing dimensionality results in an exponential growth of the search space, referred to as Curse of Dimensionality. A common approach to postpone this effect is to apply approximation to reduce the dimensionality of the original data prior to indexing. However, approximation involves loss of information, which also leads to an exponential growth of the search space. Therefore, indexing an approximation with a high dimensionality, i. e. high quality, is desirable. We introduce Symbolic Fourier Approximation (SFA) and the SFA trie which allows for indexing of not only large datasets but also high dimensional approximations. This is done by exploiting the trade-off between the quality of the approximation and the degeneration of the index by using a variable number of dimensions to represent each approximation. Our experiments show that SFA combined with the SFA trie can scale up to a factor of 5--10 more indexed dimensions than previous approaches. Thus, it provides lower page accesses and CPU costs by a factor of 2--25 respectively 2--11 for exact similarity search using real world and synthetic data.