Autonomously improving query evaluations over multidimensional data in distributed hash tables

  • Authors:
  • Matthew Malensek;Sangmi Pallickara;Shrideep Pallickara

  • Affiliations:
  • Colorado State University, Fort Collins, CO;Colorado State University, Fort Collins, CO;Colorado State University, Fort Collins, CO

  • Venue:
  • Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The proliferation of observational devices and sensors with networking capabilities has led to growth in both the rates and sources of data that ultimately contribute to extreme-scale data volumes. Datasets generated in such settings are often multidimensional, with each dimension accounting for a feature of interest. We posit that efficient evaluation of queries over such datasets must account for both the distribution of data values and the patterns in the queries themselves. Configuring query evaluation by hand is infeasible given the data volumes, dimensionality, and the rates at which new data and queries arrive. In this paper, we describe our algorithm to autonomously improve query evaluations over voluminous, distributed datasets. Our approach autonomously tunes for the most dominant query patterns and distribution of values across a dimension. We evaluate our algorithm in the context of our system, Galileo, which is a hierarchical distributed hash table used for managing geospatial, time-series data. Our system strikes a balance between memory utilization, fast evaluations, and search space reductions. Empirical evaluations reported here are performed on a dataset that is multidimensional and comprises a billion files. The schemes described in this work are broadly applicable to any system that leverages distributed hash tables as a storage mechanism.