Using multiple indexes for efficient subsequence matching in time-series databases

  • Authors:
  • Seung-Hwan Lim;Hee-Jin Park;Sang-Wook Kim

  • Affiliations:
  • College of Information and Communications, Hanyang University, Korea;College of Information and Communications, Hanyang University, Korea;College of Information and Communications, Hanyang University, Korea

  • Venue:
  • DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Time-series subsequence matching is an operation that searches for such data subsequences whose changing patterns are similar to a query sequence from a time-series database. This paper addresses a performance issue of time-series subsequence matching. First, we quantitatively examine the performance degradation caused by the window size effect, and then show that the performance of subsequence matching with a single index is not satisfactory in real applications. We claim that index interpolation is a fairly effective tool to resolve this problem. Index interpolation performs subsequence matching by selecting the most appropriate one from multiple indexes built on windows of their distinct sizes. For index interpolation, we need to decide the sizes of windows for multiple indexes to be built. In this paper, we solve the problem of selecting optimal window sizes in the perspective of physical database design. For this, given a set of pairs 〈length, frequency 〉 of query sequences to be performed in a target application and a set of window sizes for building multiple indexes, we devise a formula that estimates the overall cost of all the subsequence matchings. By using this formula, we propose an algorithm that determines the optimal window sizes for maximizing the performance of entire subsequence matchings. We formally prove the optimality as well as the effectiveness of the algorithm. Finally, we perform a series of experiments with a real-life stock data set and a large volume of a synthetic data set to show the superiority of our approach.