DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

  • Authors:
  • Jiang Du

  • Affiliations:
  • University of Toronto, Toronto, ON, Canada

  • Venue:
  • Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

While originally proposed to provide fault-tolerance and scalability for data analysis queries on unstructured data over massive clusters, MapReduce systems today are being used for analysis of rich combinations of unstructured, semi-structured and structured data. To achieve performance on these new workloads, MapReduce systems (and the distributed file systems on which they are built) can no longer rely on static data placement strategies. In this thesis, we propose new physical data independence and adaptive data tuning solutions that can greatly improve the performance of analysis queries in systems where workloads are not static and where workloads may include complex queries with overlapping or related computations (subqueries). While profiting from the work on physical data independence in relational systems, we propose novel strategies that recognize the central role of data partitioning (and co-partitioning) in shared-nothing distributed file systems.