Efficient processing of exact top-k queries over disk-resident sorted lists

Authors:
Hweehwa Pang;Xuhua Ding;Baihua Zheng
Affiliations:
School of Information Systems, Singapore Management University, Singapore, Singapore;School of Information Systems, Singapore Management University, Singapore, Singapore;School of Information Systems, Singapore Management University, Singapore, Singapore
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2010

Citing 48
Cited 5

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Searching the Web: the public and their queries

Journal of the American Society for Information Science and Technology
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Theory of keyblock-based image retrieval

ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval

Modern Information Retrieval
Minimal probing: supporting expensive predicates for top-k queries

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'

IEEE Transactions on Knowledge and Data Engineering
Optimizing Multi-Feature Queries for Image Databases

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Optimal aggregation algorithms for middleware

Journal of Computer and System Sciences - Special issu on PODS 2001
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Evaluating top-k queries over web-accessible databases

ACM Transactions on Database Systems (TODS)
Optimizing Top-k Selection Queries over Multimedia Repositories

IEEE Transactions on Knowledge and Data Engineering
Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions

IEEE Transactions on Knowledge and Data Engineering
KLEE: a framework for distributed top-k query algorithms

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Efficient Aggregation of Ranked Inputs

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Continuous monitoring of top-k queries over sliding windows

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
IO-Top-k: index-access optimized top-k query processing

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Optimizing top-k queries for middleware access: A unified cost-based approach

ACM Transactions on Database Systems (TODS)
Operating System Concepts

Operating System Concepts
Progressive and selective merge: computing top-k with ad-hoc ranking functions

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The Threshold Algorithm: From Middleware Systems to the Relational Engine

IEEE Transactions on Knowledge and Data Engineering
Efficient Skyline and Top-k Retrieval in Subspaces

IEEE Transactions on Knowledge and Data Engineering
Efficient top-k aggregation of ranked inputs

ACM Transactions on Database Systems (TODS)
Pruning policies for two-tiered inverted index with correctness guarantee

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Joining ranked inputs in practice

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Efficient Process of Top-k Range-Sum Queries over Multiple Streams with Minimized Global Error

IEEE Transactions on Knowledge and Data Engineering
Top-k query evaluation with probabilistic guarantees

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
TopX: efficient and versatile top-k query processing for semistructured data

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient processing of top-k dominating queries on multi-dimensional data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Best position algorithms for top-k queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Sum-max monotonic ranked joins for evaluating top-k twig queries on weighted data graphs

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficiently answering top-k typicality queries on large databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Depth estimation for ranking query optimization

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient online top-K retrieval with arbitrary similarity measures

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
On efficient top-k query processing in highly distributed environments

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Probabilistic top-k and ranking-aggregate queries

ACM Transactions on Database Systems (TODS)
Sliding-window top-k queries on uncertain streams

Proceedings of the VLDB Endowment
Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations

IEEE Transactions on Knowledge and Data Engineering
Top-k dominating queries in uncertain databases

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Anytime measures for top-k algorithms on exact and fuzzy data sets

The VLDB Journal — The International Journal on Very Large Data Bases
Dominant Graph: An Efficient Indexing Structure to Answer Top-K Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Top-k Set Similarity Joins

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Best-Effort Top-k Query Processing Under Budgetary Constraints

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Multi-dimensional top-k dominating queries

The VLDB Journal — The International Journal on Very Large Data Bases
Top-k typicality queries and efficient query answering methods on large databases

The VLDB Journal — The International Journal on Very Large Data Bases
Robust and efficient algorithms for rank join evaluation

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Efficient and generic evaluation of ranked queries

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Adaptive parallel approximate similarity search for responsive multimedia retrieval

Proceedings of the 20th ACM international conference on Information and knowledge management
TJJE: An efficient algorithm for top-k join on massive data

Information Sciences: an International Journal
Efficient processing of top-k join queries by attribute domain refinement

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Subspace top-k query processing using the hybrid-layer index with a tight bound

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The top-k query is employed in a wide range of applications to generate a ranked list of data that have the highest aggregate scores over certain attributes. As the pool of attributes for selection by individual queries may be large, the data are indexed with per-attribute sorted lists, and a threshold algorithm (TA) is applied on the lists involved in each query. The TA executes in two phases--find a cut-off threshold for the top-k result scores, then evaluate all the records that could score above the threshold. In this paper, we focus on exact top-k queries that involve monotonic linear scoring functions over disk-resident sorted lists. We introduce a model for estimating the depths to which each sorted list needs to be processed in the two phases, so that (most of) the required records can be fetched efficiently through sequential or batched I/Os. We also devise a mechanism to quickly rank the data that qualify for the query answer and to eliminate those that do not, in order to reduce the computation demand of the query processor. Extensive experiments with four different datasets confirm that our schemes achieve substantial performance speed-up of between two times and two orders of magnitude over existing TAs, at the expense of a memory overhead of 4.8 bits per attribute value. Moreover, our scheme is robust to different data distributions and query characteristics.