Fast approximate correlation for massive time-series data

Authors:
Abdullah Mueen;Suman Nath;Jie Liu
Affiliations:
University of California, Riverside, Riverside, CA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 20
Cited 11

Efficient network flow based min-cut balanced partitioning

ICCAD '94 Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design
Fast subsequence matching in time-series databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Similarity-based queries for time series data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The node capacitated graph partitioning problem: a computational study

Mathematical Programming: Series A and B - Special issue on computational integer programming
Fast Approximate Graph Partitioning Algorithms

SIAM Journal on Computing
Multi-way partitioning using bi-partition heuristics

ASP-DAC '00 Proceedings of the 2000 Asia and South Pacific Design Automation Conference
On computing correlated aggregates over continual data streams

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Locally adaptive dimensionality reduction for indexing large time series databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
HierarchyScan: A Hierarchical Similarity Search Algorithm for Databases of Long Sequences

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
A linear-time heuristic for improving network partitions

DAC '82 Proceedings of the 19th Design Automation Conference
Identifying similarities, periodicities and bursts for online search queries

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
BRAID: stream mining through group lag correlations

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Fast window correlations over uncooperative time series

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Streaming pattern discovery in multiple time-series

VLDB '05 Proceedings of the 31st international conference on Very large data bases
StatStream: statistical monitoring of thousands of data streams in real time

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Efficient Similarity Search over Future Stream Time Series

IEEE Transactions on Knowledge and Data Engineering
Multiscale Representations for Fast Pattern Matching in Stream Time Series

IEEE Transactions on Knowledge and Data Engineering
Managing massive time series streams with multi-scale compressed trickles

Proceedings of the VLDB Endowment
Approximate similarity search over multiple stream time series

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications

DataGarage: warehousing massive performance data on commodity servers

Proceedings of the VLDB Endowment
Logical-shapelets: an expressive primitive for time series classification

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Simple and practical algorithm for sparse Fourier transform

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Continuously identifying representatives out of massive streams

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Nearly optimal sparse fourier transform

STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing
CGStream: continuous correlated graph query for data streams

Proceedings of the 21st ACM international conference on Information and knowledge management
sMFCC: exploiting sparseness in speech for fast acoustic feature extraction on mobile devices -- a feasibility study

Proceedings of the 14th Workshop on Mobile Computing Systems and Applications
Efficient sentiment correlation for large-scale demographics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Pattern discovery in data streams under the time warping distance

The VLDB Journal — The International Journal on Very Large Data Bases
Local correlation detection with linearity enhancement in streaming data

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Discovering longest-lasting correlation in sequence databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of computing all-pair correlations in a warehouse containing a large number (e.g., tens of thousands) of time-series (or, signals). The problem arises in automatic discovery of patterns and anomalies in data intensive applications such as data center management, environmental monitoring, and scientific experiments. However, with existing techniques, solving the problem for a large stream warehouse is extremely expensive, due to the problem's inherent quadratic I/O and CPU complexities. We propose novel algorithms, based on Discrete Fourier Transformation (DFT) and graph partitioning, to reduce the end-to-end response time of an all-pair correlation query. To minimize I/O cost, we partition a massive set of input signals into smaller batches such that caching the signals one batch at a time maximizes data reuse and minimizes disk I/O. To reduce CPU cost, we propose two approximation algorithms. Our first algorithm efficiently computes approximate correlation coefficients of similar signal pairs within a given error bound. The second algorithm efficiently identifies, without any false positives or negatives, all signal pairs with correlations above a given threshold. For many real applications, our approximate solutions are as useful as corresponding exact solutions, due to our strict error guarantees. However, compared to the state-of-the-art exact algorithms, our algorithms are up to 17x faster for several real datasets.