Shared execution strategy for neighbor-based pattern mining requests over streaming windows

Authors:
Di Yang;Elke A. Rundensteiner;Matthew O. Ward
Affiliations:
WPI;WPI;WPI
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2012

Citing 35
Cited 1

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
NiagaraCQ: a scalable continuous query system for Internet databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Continuously adaptive continuous queries over streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Incremental Clustering for Mining in a Data Warehousing Environment

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Maintaining variance and k-medians over data stream windows

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Design and Evaluation of Alternative Selection Placement Strategies in Optimizing Continuous Queries

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
No pane, no gain: efficient evaluation of sliding-window aggregates over data streams

ACM SIGMOD Record
Multiple aggregations over data streams

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

ACM Transactions on Database Systems (TODS)
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Efficient computation of the skyline cube

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Efficient reverse k-nearest neighbor search in arbitrary metric spaces

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
On-the-fly sharing for streamed aggregation

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
The CQL continuous query language: semantic foundations and query execution

The VLDB Journal — The International Journal on Very Large Data Bases
Online outlier detection in sensor data using non-parametric models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
State-slice: new paradigm of multi-query optimization of window-based stream queries

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Continuous Nearest Neighbor Queries over Sliding Windows

IEEE Transactions on Knowledge and Data Engineering
Density-based clustering for real-time stream data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Scheduling for shared window joins over data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Resource sharing in continuous sliding-window aggregates

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Approximate NN queries on streams with guaranteed error/performance bounds

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
The case for precision sharing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Detecting distance-based outliers in streams of data

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Near-optimal algorithms for shared filter evaluation in data stream systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
The V*-Diagram: a query-dependent approach to moving KNN queries

Proceedings of the VLDB Endowment
Neighbor-based pattern detection for windows over streaming data

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Reverse k-nearest neighbor search in dynamic and general metric databases

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Scalable skyline computation using object-based space partitioning

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A shared execution strategy for multiple pattern mining requests over streaming data

Proceedings of the VLDB Endowment
Interactive visual exploration of neighbor-based patterns in data streams

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
AEC algorithm: a heuristic approach to calculating density-based clustering Eps parameter

ADVIS'06 Proceedings of the 4th international conference on Advances in Information Systems

Mining and linking patterns across live data streams and stream archives

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In diverse applications ranging from stock trading to traffic monitoring, data streams are continuously monitored by multiple analysts for extracting patterns of interest in real time. These analysts often submit similar pattern mining requests yet customized with different parameter settings. In this work, we present shared execution strategies for processing a large number of neighbor-based pattern mining requests of the same type yet with arbitrary parameter settings. Such neighbor-based pattern mining requests cover a broad range of popular mining query types, including detection of clusters, outliers, and nearest neighbors. Given the high algorithmic complexity of the mining process, serving multiple such queries in a single system is extremely resource intensive. The naive method of detecting and maintaining patterns for different queries independently is often infeasible in practice, as its demands on system resources increase dramatically with the cardinality of the query workload. In order to maximize the efficiency of the system resource utilization for executing multiple queries simultaneously, we analyze the commonalities of the neighbor-based pattern mining queries, and identify several general optimization principles which lead to significant system resource sharing among multiple queries. In particular, as a preliminary sharing effort, we observe that the computation needed for the range query searches (the process of searching the neighbors for each object) can be shared among multiple queries and thus saves the CPU consumption. Then we analyze the interrelations between the patterns identified by queries with different parameters settings, including both pattern-specific and window-specific parameters. For that, we first introduce an incremental pattern representation, which represents the patterns identified by queries with different pattern-specific parameters within a single compact structure. This enables integrated pattern maintenance for multiple queries. Second, by leveraging the potential overlaps among sliding windows, we propose a metaquery strategy which utilizes a single query to answer multiple queries with different window-specific parameters. By combining these three techniques, namely the range query search sharing, integrated pattern maintenance, and metaquery strategy, our framework realizes fully shared execution of multiple queries with arbitrary parameter settings. It achieves significant savings of computational and memory resources due to shared execution. Our comprehensive experimental study, using real data streams from domains of stock trades and moving object monitoring, demonstrates that our solution is significantly faster than the independent execution strategy, while using only a small portion of memory space compared to the independent execution. We also show that our solution scales in handling large numbers of queries in the order of hundreds or even thousands under high input data rates.