DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

Authors:
Jiang Du
Affiliations:
University of Toronto, Toronto, ON, Canada
Venue:
Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Year:
2013

Citing 21
Cited 0

Multiple-query optimization

ACM Transactions on Database Systems (TODS)
Optimizing queries using materialized views: a practical, scalable solution

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
The Skyline Operator

Proceedings of the 17th International Conference on Data Engineering
The GMAP: a versatile tool for physical data independence

The VLDB Journal — The International Journal on Very Large Data Bases
Answering queries using views: A survey

The VLDB Journal — The International Journal on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
AutoPart: Automating Schema Design for Large Scientific Databases Using Data Partitioning

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
The making of TPC-DS

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
DBToaster: a SQL compiler for high-performance delta processing in main-memory databases

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
An architecture for recycling intermediates in a column-store

ACM Transactions on Database Systems (TODS)
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
ReStore: reusing results of MapReduce jobs

Proceedings of the VLDB Endowment
DBToaster: higher-order delta processing for dynamic, frequently fresh views

Proceedings of the VLDB Endowment
BigBench: towards an industry standard benchmark for big data analytics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

While originally proposed to provide fault-tolerance and scalability for data analysis queries on unstructured data over massive clusters, MapReduce systems today are being used for analysis of rich combinations of unstructured, semi-structured and structured data. To achieve performance on these new workloads, MapReduce systems (and the distributed file systems on which they are built) can no longer rely on static data placement strategies. In this thesis, we propose new physical data independence and adaptive data tuning solutions that can greatly improve the performance of analysis queries in systems where workloads are not static and where workloads may include complex queries with overlapping or related computations (subqueries). While profiting from the work on physical data independence in relational systems, we propose novel strategies that recognize the central role of data partitioning (and co-partitioning) in shared-nothing distributed file systems.