CoScan: cooperative scan sharing in the cloud

Authors:
Xiaodan Wang;Christopher Olston;Anish Das Sarma;Randal Burns
Affiliations:
Johns Hopkins University, Baltimore, MD;Yahoo! Research Sunnyvale, CA;Google Research Mountain View, CA;Johns Hopkins University Baltimore, MD
Venue:
Proceedings of the 2nd ACM Symposium on Cloud Computing
Year:
2011

Citing 28
Cited 8

Multiple-query optimization

ACM Transactions on Database Systems (TODS)
Scheduling real-time transactions

ACM SIGMOD Record - Special Issue on Real-Time Database Systems
Query optimization for parallel execution

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Red brick warehouse: a read-mostly RDBMS for open SMP platforms

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Efficient execution of multiple query workloads in data analysis applications

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Query Processing in Tertiary Memory Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Query Scheduling in Multi Query Optimization

IDEAS '01 Proceedings of the International Database Engineering & Applications Symposium
Query Pre-Execution and Batching in Paradise: A Two-Pronged Approach to the Efficient Processing of Queries on Tape-Resident Raster Images

SSDBM '97 Proceedings of the Ninth International Conference on Scientific and Statistical Database Management
Relational Joins for Data on Tertiary Storage

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Scheduling Algorithms

Scheduling Algorithms
Estimating progress of execution for SQL queries

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Optimal File-Bundle Caching Algorithms for Data-Grids

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
QPipe: a simultaneously pipelined relational query engine

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Tycoon: An implementation of a distributed, market-based resource allocation system

Multiagent and Grid Systems
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Cooperative scans: dynamic bandwidth sharing in a DBMS

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Scheduling shared scans of large data files

Proceedings of the VLDB Endowment
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
A scalable, predictable join operator for highly concurrent data warehouses

Proceedings of the VLDB Endowment
Predictable performance for unpredictable workloads

Proceedings of the VLDB Endowment
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Meeting service level objectives of Pig programs

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Optimizing Completion Time and Resource Provisioning of Pig Programs

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Stubby: a transformation-based optimizer for MapReduce workflows

Proceedings of the VLDB Endowment
Automated profiling and resource management of pig programs for meeting service level objectives

Proceedings of the 9th international conference on Autonomic computing
On the optimization of schedules for MapReduce workloads in the presence of shared scans

The VLDB Journal — The International Journal on Very Large Data Bases
Modeling I/O interference for data intensive distributed applications

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Performance Modeling and Optimization of Deadline-Driven Pig Programs

ACM Transactions on Autonomous and Adaptive Systems (TAAS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present CoScan, a scheduling framework that eliminates redundant processing in workflows that scan large batches of data in a map-reduce computing environment. CoScan merges Pig programs from multiple users at runtime to reduce I/O contention while adhering to soft deadline requirements in scheduling. This includes support for join workflows that operate on multiple data sources. Our solution maps well to workflows at many Internet companies which reuse data from a common set of inputs. Experiments on the PigMix data analytics benchmark exhibit orders of magnitude reduction in resource contention with minimal impact on latency.