Red brick warehouse: a read-mostly RDBMS for open SMP platforms
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
ACM Transactions on Computer Systems (TOCS)
Query Scheduling in Multi Query Optimization
IDEAS '01 Proceedings of the International Database Engineering & Applications Symposium
Developments from a June 1996 seminar on Online algorithms: the state of the art
Online Scheduling to Minimize Average Stretch
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
QPipe: a simultaneously pipelined relational query engine
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Models and Algorithms for Stochastic Online Scheduling
Mathematics of Operations Research
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Cooperative scans: dynamic bandwidth sharing in a DBMS
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Quincy: fair scheduling for distributed computing clusters
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Comet: batched stream processing for data intensive distributed computing
Proceedings of the 1st ACM symposium on Cloud computing
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
MRShare: sharing across multiple queries in MapReduce
Proceedings of the VLDB Endowment
Nectar: automatic management of data and computation in datacenters
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
FLEX: a slot allocation scheduling optimizer for MapReduce workloads
Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
CoScan: cooperative scan sharing in the cloud
Proceedings of the 2nd ACM Symposium on Cloud Computing
I/O streaming evaluation of batch queries for data-intensive computational turbulence
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
CIRCUMFLEX: a scheduling optimizer for MapReduce workloads with shared scans
ACM SIGOPS Operating Systems Review
ReStore: reusing results of MapReduce jobs
Proceedings of the VLDB Endowment
Investigation of data locality and fairness in MapReduce
Proceedings of third international workshop on MapReduce and its Applications Date
Efficient multi-way theta-join processing using MapReduce
Proceedings of the VLDB Endowment
On the optimization of schedules for MapReduce workloads in the presence of shared scans
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
We study how best to schedule scans of large data files, in the presence of many simultaneous requests to a common set of files. The objective is to maximize the overall rate of processing these files, by sharing scans of the same file as aggressively as possible, without imposing undue wait time on individual jobs. This scheduling problem arises in batch data processing environments such as Map-Reduce systems, some of which handle tens of thousands of processing requests daily, over a shared set of files. As we demonstrate, conventional scheduling techniques such as shortest-job-first do not perform well in the presence of cross-job sharing opportunities. We derive a new family of scheduling policies specifically targeted to sharable workloads. Our scheduling policies revolve around the notion that, all else being equal, it is good to schedule nonsharable scans ahead of ones that can share IO work with future jobs, if the arrival rate of sharable future jobs is expected to be high. We evaluate our policies via simulation over varied synthetic and real workloads, and demonstrate significant performance gains compared with conventional scheduling approaches.