A generic framework for efficient and effective subsequence retrieval

Authors:
Haohan Zhu;George Kollios;Vassilis Athitsos
Affiliations:
Boston University;Boston University;University of Texas at Arlington
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 28
Cited 2

Fast subsequence matching in time-series databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Indexing large metric spaces for similarity search queries

ACM Transactions on Database Systems (TODS)
Scaling up dynamic time warping for datamining applications

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
General match: a subsequence matching method in time-series databases based on generalized windows

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Duality-Based Subsequence Matching in Time-Series Databases

Proceedings of the 17th International Conference on Data Engineering
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Navigating nets: simple algorithms for proximity search

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Cover trees for nearest neighbor

ICML '06 Proceedings of the 23rd international conference on Machine learning
Reference-based indexing of sequence databases

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Exact indexing of dynamic time warping

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
On the marriage of Lp-norms and edit distance

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Fast nGram-based string search over data encoded using algebraic signatures

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Ranked subsequence matching in time-series databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Compressed indexing and local alignment of DNA

Bioinformatics
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Approximate embedding-based subsequence matching of time series

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Reference-based indexing for metric spaces with costly distance measures

The VLDB Journal — The International Journal on Very Large Data Bases
Fast and accurate short read alignment with Burrows–Wheeler transform

Bioinformatics
Reference-based alignment in large sequence databases

Proceedings of the VLDB Endowment
Anticipatory DTW for efficient similarity search in time series databases

Proceedings of the VLDB Endowment
Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A new approach for processing ranked subsequence matching based on ranked union

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Trajectory Analysis and Semantic Region Modeling Using Nonparametric Hierarchical Bayesian Models

International Journal of Computer Vision
Searching and mining trillions of time series subsequences under dynamic time warping

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

RCSI: scalable similarity search in thousand(s) of genomes

Proceedings of the VLDB Endowment
Discovering longest-lasting correlation in sequence databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a general framework for matching similar subsequences in both time series and string databases. The matching results are pairs of query subsequences and database subsequences. The framework finds all possible pairs of similar subsequences if the distance measure satisfies the "consistency" property, which is a property introduced in this paper. We show that most popular distance functions, such as the Euclidean distance, DTW, ERP, the Frechét distance for time series, and the Hamming distance and Levenshtein distance for strings, are all "consistent". We also propose a generic index structure for metric spaces named "reference net". The reference net occupies O(n) space, where n is the size of the dataset and is optimized to work well with our framework. The experiments demonstrate the ability of our method to improve retrieval performance when combined with diverse distance measures. The experiments also illustrate that the reference net scales well in terms of space overhead and query time.