Approximate Processing of Massive Continuous Quantile Queries over High-Speed Data Streams

Authors:
Xuemin Lin;Jian Xu;Qing Zhang;Hongjun Lu;Jeffrey Xu Yu;Xiaofang Zhou;Yidong Yuan
Affiliations:
-;-;-;-;IEEE Computer Society;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2006

Citing 30
Cited 7

Computational geometry: an introduction

Computational geometry: an introduction
Introduction to algorithms

Introduction to algorithms
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Approximate medians and other quantiles in one pass and with limited memory

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Random sampling techniques for space efficient online computation of order statistics of large datasets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
On Finding the Maxima of a Set of Vectors

Journal of the ACM (JACM)
NiagaraCQ: a scalable continuous query system for Internet databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Fast stabbing of boxes in high dimensions

Theoretical Computer Science
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Processing complex aggregate queries over data streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Continual Queries for Internet Scale Event-Driven Information Delivery

IEEE Transactions on Knowledge and Data Engineering
Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Correlating XML data streams using tree-edit distance embeddings

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Stream processing of XPath queries with predicates

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive filters for continuous queries over distributed data streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Effective Computation of Biased Quantiles over Data Streams

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Approximate counts and quantiles over sliding windows

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Multi-dimensional regression analysis of time-series data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
StatStream: statistical monitoring of thousands of data streams in real time

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Maximizing the output rate of multi-way join queries over streaming information sources

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Scheduling for shared window joins over data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Finding hierarchical heavy hitters in data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Continuously maintaining order statistics over data streams: extended abstract

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Efficient temporal counting with bounded error

The VLDB Journal — The International Journal on Very Large Data Bases
Probabilistic Inverse Ranking Queries over Uncertain Data

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Aggregate computation over data streams

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
An Ω(1/ε log 1/ε) space lower bound for finding ε-approximate quantiles in a data stream

FAW'10 Proceedings of the 4th international conference on Frontiers in algorithmics
Experimental study on fighters behaviors mining

Expert Systems with Applications: An International Journal
Quality and efficiency for kernel density estimates in large data

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Quantile computation has many applications including data mining and financial data analysis. It has been shown that an \epsilon{\hbox{-}}{\rm{approximate}} summary can be maintained so that, given a quantile query (\phi, \epsilon), the data item at rank \lceil \phi N \rceil may be approximately obtained within the rank error precision \epsilon N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different \phi and \epsilon poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.