Autonomously improving query evaluations over multidimensional data in distributed hash tables

Authors:
Matthew Malensek;Sangmi Pallickara;Shrideep Pallickara
Affiliations:
Colorado State University, Fort Collins, CO;Colorado State University, Fort Collins, CO;Colorado State University, Fort Collins, CO
Venue:
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Year:
2013

Citing 17
Cited 1

Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A scalable content-addressable network

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Data Management: NetCDF: an Interface for Scientific Data Access

IEEE Computer Graphics and Applications
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
The Statistical Properties of Hoast Load

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Canon in G Major: Designing DHTs with Hierarchical Structure

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Brief announcement: prefix hash tree

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Mercury: supporting scalable multi-attribute range queries

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Workload-Aware Load Balancing for Clustered Web Servers

IEEE Transactions on Parallel and Distributed Systems
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
A demonstration of SciDB: a science-oriented DBMS

Proceedings of the VLDB Endowment
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
Overview of sciDB: large scale array storage, processing and analysis

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Replication, load balancing and efficient range query processing in DHTs

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Galileo: A Framework for Distributed Storage of High-Throughput Data Streams

UCC '11 Proceedings of the 2011 Fourth IEEE International Conference on Utility and Cloud Computing
Expressive Query Support for Multidimensional Data in Distributed Hash Tables

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals

Future Generation Computer Systems

Polygon-Based Query Evaluation over Geospatial Data Using Distributed Hash Tables

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The proliferation of observational devices and sensors with networking capabilities has led to growth in both the rates and sources of data that ultimately contribute to extreme-scale data volumes. Datasets generated in such settings are often multidimensional, with each dimension accounting for a feature of interest. We posit that efficient evaluation of queries over such datasets must account for both the distribution of data values and the patterns in the queries themselves. Configuring query evaluation by hand is infeasible given the data volumes, dimensionality, and the rates at which new data and queries arrive. In this paper, we describe our algorithm to autonomously improve query evaluations over voluminous, distributed datasets. Our approach autonomously tunes for the most dominant query patterns and distribution of values across a dimension. We evaluate our algorithm in the context of our system, Galileo, which is a hierarchical distributed hash table used for managing geospatial, time-series data. Our system strikes a balance between memory utilization, fast evaluations, and search space reductions. Empirical evaluations reported here are performed on a dataset that is multidimensional and comprises a billion files. The schemes described in this work are broadly applicable to any system that leverages distributed hash tables as a storage mechanism.