RanKloud: a scalable ranked query processing framework on hadoop

Authors:
K. Selçuk Candan;Parth Nagarkar;Mithila Nagendra;Renwei Yu
Affiliations:
CIDSE, Arizona State University, Tempe, AZ;CIDSE, Arizona State University, Tempe, AZ;CIDSE, Arizona State University, Tempe, AZ;CIDSE, Arizona State University, Tempe, AZ
Venue:
Proceedings of the 14th International Conference on Extending Database Technology
Year:
2011

Citing 10
Cited 2

Closest pair queries in spatial databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Optimal aggregation algorithms for middleware

Journal of Computer and System Sciences - Special issu on PODS 2001
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Angle-based space partitioning for efficient parallel skyline computation

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

RanKloud: scalable multimedia and social media retrieval and analysis in the cloud

Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Efficient parallel kNN joins for large data in MapReduce

Proceedings of the 15th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The popularity of batch-oriented cluster architectures like Hadoop is on the rise. These batch-based systems successfully achieve high degrees of scalability by carefully allocating resources and leveraging opportunities to parallelize basic processing tasks. However, they are known to fall short in certain application domains such as large scale media analysis. In these applications, the utility of a given data element plays a vital role in a particular analysis task, and this utility most often depends on the way the data is collected or interpreted. However, existing batch data processing frameworks do not consider data utility in allocating resources, and hence fail to optimize for ranked/top-k query processing in which the user is interested in obtaining a relatively small subset of the best result instances. A naïve implementation of these operations on an existing system would need to enumerate more candidates than needed, before it can filter out the k best results. We note that such waste can be avoided by utilizing utility-aware task partitioning and resource allocation strategies that can prune unpromising objects from consideration. In this demonstration, we introduce RanKloud, an efficient and scalable utility-aware parallel processing system built for the analysis of large media datasets. RanKloud extends Hadoop's MapReduce paradigm to provide support for ranked query operations, such as k-nearest neighbor and k-closest pair search, skylines, skyline-joins, and top-k join processing.