Discovering longest-lasting correlation in sequence databases

Authors:
Yuhong Li;Leong Hou U;Man Lung Yiu;Zhiguo Gong
Affiliations:
Department of Computer and Information Science, University of Macau, Macau;Department of Computer and Information Science, University of Macau, Macau;Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong;Department of Computer and Information Science, University of Macau, Macau
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 33
Cited 0

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Fast subsequence matching in time-series databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Data clustering: a review

ACM Computing Surveys (CSUR)
Multidimensional binary search trees used for associative searching

Communications of the ACM
Analysis of the Clustering Properties of the Hilbert Space-Filling Curve

IEEE Transactions on Knowledge and Data Engineering
Similarity Search without Tears: The OMNI Family of All-purpose Access Methods

Proceedings of the 17th International Conference on Data Engineering
Fast Time Sequence Indexing for Arbitrary Lp Norms

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Mining Motifs in Massive Time Series Databases

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
On Similarity-Based Queries for Time Series Data

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration

Data Mining and Knowledge Discovery
Optimizing Similarity Search for Arbitrary Length Time Series Queries

IEEE Transactions on Knowledge and Data Engineering
Hyperspectral Imaging: Techniques for Spectral Detection and Classification

Hyperspectral Imaging: Techniques for Spectral Detection and Classification
BRAID: stream mining through group lag correlations

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Fast window correlations over uncooperative time series

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Experiencing SAX: a novel symbolic representation of time series

Data Mining and Knowledge Discovery
StatStream: statistical monitoring of thousands of data streams in real time

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Using multiple indexes for efficient subsequence matching in time-series databases

Information Sciences: an International Journal
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
The TS-tree: efficient time series search and retrieval

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Approximate embedding-based subsequence matching of time series

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Querying and mining of time series data: experimental comparison of representations and distance measures

Proceedings of the VLDB Endowment
Comparative Evaluation of Anomaly Detection Techniques for Sequence Data

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Supporting exact indexing of arbitrarily rotated shapes and periodic time series under Euclidean and warping distance measures

The VLDB Journal — The International Journal on Very Large Data Bases
Clustering of time series data-a survey

Pattern Recognition
Fast approximate correlation for massive time-series data

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
iSAX 2.0: Indexing and Mining One Billion Time Series

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Embedding-based subsequence matching in time-series databases

ACM Transactions on Database Systems (TODS)
Logical-shapelets: an expressive primitive for time series classification

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Prominent streak discovery in sequence data

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Searching and mining trillions of time series subsequences under dynamic time warping

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A generic framework for efficient and effective subsequence retrieval

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most existing work on sequence databases use correlation (e.g., Euclidean distance and Pearson correlation) as a core function for various analytical tasks. Typically, it requires users to set a length for the similarity queries. However, there is no steady way to define the proper length on different application needs. In this work we focus on discovering longest-lasting highly correlated subsequences in sequence databases, which is particularly useful in helping those analyses without prior knowledge about the query length. Surprisingly, there has been limited work on this problem. A baseline solution is to calculate the correlations for every possible subsequence combination. Obviously, the brute force solution is not scalable for large datasets. In this work we study a space-constrained index that gives a tight correlation bound for subsequences of similar length and offset by intra-object grouping and inter-object grouping techniques. To the best of our knowledge, this is the first index to support normalized distance metric of arbitrary length subsequences. Extensive experimental evaluation on both real and synthetic sequence datasets verifies the efficiency and effectiveness of our proposed methods.