Processing top k queries from samples

Authors:
Edith Cohen;Nadav Grossaug;Haim Kaplan
Affiliations:
AT&T Labs--Research, Florham Park, NJ;Tel Aviv University, Tel Aviv, Israel;Tel Aviv University, Tel Aviv, Israel
Venue:
CoNEXT '06 Proceedings of the 2006 ACM CoNEXT conference
Year:
2006

Citing 19
Cited 3

Combining fuzzy information from multiple systems

Journal of Computer and System Sciences
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating flow distributions from sampled flow statistics

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Distributed top-k monitoring

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Inverting sampled traffic

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Data streaming algorithms for efficient and accurate estimation of flow size distribution

Proceedings of the joint international conference on Measurement and modeling of computer systems
Efficient top-K query calculation in distributed networks

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Identifying elephant flows through periodically sampled packets

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
A data streaming algorithm for estimating subpopulation flow size distribution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A robust system for accurate real-time summaries of internet traffic

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Sampling algorithms in a stream operator

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Ranking flows from sampled traffic

CoNEXT '05 Proceedings of the 2005 ACM conference on Emerging network experiment and technology
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Top-k query evaluation with probabilistic guarantees

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Confident estimation for multistage measurement sampling and aggregation

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Processing top-k queries from samples

Computer Networks: The International Journal of Computer and Telecommunications Networking
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Top-k queries are desired aggregation operations on data sets. Examples of queries on network data include the top 100 source AS's, top 100 ports, or top Domain names over IP packets or over IP flow records. Since the complete dataset is often not available or not feasible to examine, we are interested in processing top-k queries from samples. If all records can be processed, the top-k items can be obtained by counting the frequency of each item. Even when the full dataset is observed, however, resources are often insufficient for such counting and techniques were developed to overcome this issue. When we can observe only a random sample of the records, an orthogonal complication arises: The top frequencies in the sample are biased estimates of the actual top-k frequencies. This bias as depends on the distribution and must be accounted for when seeking the actual value. We address this by designing and evaluating several schemes that derive rigorous confidence bounds for top-k estimates. Simulations on various data sets that include IP flows data, show that schemes that exploit more of the structure of the sample distribution produce much tight confidence intervals with an order of magnitude fewer samples than simpler schemes that utilize only the sampled top-k frequencies. The simpler schemes, however, are more efficient in terms of computation. Our work is basic and is widely applicable to all applications that process top-k and heavy hitters queries over a random sample of the actual records.