SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets

  • Authors:
  • Patrick Schäfer;Mikael Högqvist

  • Affiliations:
  • Zuse Institute Berlin, Berlin, Germany;Zuse Institute Berlin, Berlin, Germany

  • Venue:
  • Proceedings of the 15th International Conference on Extending Database Technology
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Time series analysis, as an application for high dimensional data mining, is a common task in biochemistry, meteorology, climate research, bio-medicine or marketing. Similarity search in data with increasing dimensionality results in an exponential growth of the search space, referred to as Curse of Dimensionality. A common approach to postpone this effect is to apply approximation to reduce the dimensionality of the original data prior to indexing. However, approximation involves loss of information, which also leads to an exponential growth of the search space. Therefore, indexing an approximation with a high dimensionality, i. e. high quality, is desirable. We introduce Symbolic Fourier Approximation (SFA) and the SFA trie which allows for indexing of not only large datasets but also high dimensional approximations. This is done by exploiting the trade-off between the quality of the approximation and the degeneration of the index by using a variable number of dimensions to represent each approximation. Our experiments show that SFA combined with the SFA trie can scale up to a factor of 5--10 more indexed dimensions than previous approaches. Thus, it provides lower page accesses and CPU costs by a factor of 2--25 respectively 2--11 for exact similarity search using real world and synthetic data.