Distributed threshold querying of general functions by a difference of monotonic representation

Authors:
Guy Sagy;Daniel Keren;Izchak Sharfman;Assaf Schuster
Affiliations:
Technion, Haifa, Israel;Haifa University, Haifa, Israel;Technion, Haifa, Israel;Technion, Haifa, Israel
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 31
Cited 3

Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Communication-efficient distributed mining of association rules

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Minimal probing: supporting expensive predicates for top-k queries

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
The Skyline Operator

Proceedings of the 17th International Conference on Data Engineering
Efficient Progressive Skyline Computation

Proceedings of the 27th International Conference on Very Large Data Bases
Effect of Data Skewness in Parallel Mining of Association Rules

PAKDD '98 Proceedings of the Second Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining
Adaptive filters for continuous queries over distributed data streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Evaluating Top-k Queries over Web-Accessible Databases

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Efficient top-K query calculation in distributed networks

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Diagnosing network-wide traffic anomalies

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Communication Efficient Construction of Decision Trees Over Heterogeneously Distributed Data

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Progressive Distributed Top-k Retrieval in Peer-to-Peer Networks

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Progressive skyline computation in database systems

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
KLEE: a framework for distributed top-k query algorithms

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Finding global icebergs over distributed data sets

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Communication-efficient distributed monitoring of thresholded counts

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Boolean + ranking: querying a database by k-constrained optimization

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Algorithms and analyses for maximal vector computation

The VLDB Journal — The International Journal on Very Large Data Bases
Finding highly correlated pairs efficiently with powerful pruning

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees

International Journal of Hybrid Intelligent Systems
Progressive and selective merge: computing top-k with ad-hoc ranking functions

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Shooting stars in the sky: an online algorithm for skyline queries

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A geometric approach to monitoring threshold functions over distributed data streams

ACM Transactions on Database Systems (TODS)
Extreme data mining

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Shape sensitive geometric monitoring

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
MINERVA∞: a scalable efficient peer-to-peer search engine

Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Randomized multi-pass streaming skyline algorithms

Proceedings of the VLDB Endowment
Efficient processing of distributed top-k queries

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications

Prediction-based geometric monitoring over distributed data streams

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Spatio-temporal random fields: compressible representation and distributed estimation

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of a threshold query is to detect all objects whose score exceeds a given threshold. This type of query is used in many settings, such as data mining, event triggering, and top-k selection. Often, threshold queries are performed over distributed data. Given database relations that are distributed over many nodes, an object's score is computed by aggregating the value of each attribute, applying a given scoring function over the aggregation, and thresholding the function's value. However, joining all the distributed relations to a central database might incur prohibitive overheads in bandwidth, CPU, and storage accesses. Efficient algorithms required to reduce these costs exist only for monotonic aggregation threshold queries and certain specific scoring functions. We present a novel approach for efficiently performing general distributed threshold queries. To the best of our knowledge, this is the first solution to the problem of performing such queries with general scoring functions. We first present a solution for monotonic functions, and then introduce a technique to solve for other functions by representing them as a difference of monotonic functions. Experiments with real-world data demonstrate the method's effectiveness in achieving low communication and access costs.