Modeling and synthesizing task placement constraints in Google compute clusters

Authors:
Bikash Sharma;Victor Chudnovsky;Joseph L. Hellerstein;Rasekh Rifaat;Chita R. Das
Affiliations:
Pennsylvania State University, University Park;Google Inc., Seattle;Google Inc., Seattle;Google Inc., Seattle;Pennsylvania State University, University Park
Venue:
Proceedings of the 2nd ACM Symposium on Cloud Computing
Year:
2011

Citing 19
Cited 15

Utopia: a load sharing facility for large, heterogeneous distributed computer systems

Software—Practice & Experience
Web server workload characterization: the search for invariants

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Generating representative Web workloads for network and server performance evaluation

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A sensitivity study of the clustering approach to workload modeling (extended abstract)

SIGMETRICS '85 Proceedings of the 1985 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Benchmarks and Standards for the Evaluation of Parallel Job Schedulers

IPPS/SPDP '99/JSSPP '99 Proceedings of the Job Scheduling Strategies for Parallel Processing
Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance

IPPS/SPDP '99/JSSPP '99 Proceedings of the Job Scheduling Strategies for Parallel Processing
Matchmaking: Distributed Resource Management for High Throughput Computing

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
The workload on parallel supercomputers: modeling the characteristics of rigid jobs

Journal of Parallel and Distributed Computing
Predicting bounds on queuing delay for batch-scheduled parallel machines

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
A comprehensive model of the supercomputer workload

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Characterizing Network Traffic in a Cluster-based, Multi-tier Data Center

ICDCS '07 Proceedings of the 27th International Conference on Distributed Computing Systems
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Measurement and analysis of large-scale network file system workloads

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
MRBench: A Benchmark for MapReduce Framework

ICPADS '08 Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Towards characterizing cloud backend workloads: insights from Google compute clusters

ACM SIGMETRICS Performance Evaluation Review
QBETS: queue bounds estimation from time series

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
An Analysis of Traces from a Production MapReduce Cluster

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

D-factor: a quantitative model of application slow-down in multi-resource shared systems

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Dynamic energy-aware capacity provisioning for cloud computing environments

Proceedings of the 9th international conference on Autonomic computing
The XtreemOS Resource Selection Service

ACM Transactions on Autonomous and Adaptive Systems (TAAS) - Special Section: Extended Version of SASO 2011 Best Paper
Host load prediction in a Google compute cloud with a Bayesian model

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Heterogeneity and dynamicity of clouds at scale: Google trace analysis

Proceedings of the Third ACM Symposium on Cloud Computing
True elasticity in multi-tenant data-intensive compute clusters

Proceedings of the Third ACM Symposium on Cloud Computing
alsched: algebraic scheduling of mixed workloads in heterogeneous clouds

Proceedings of the Third ACM Symposium on Cloud Computing
Performance implications of multi-tier application deployments on Infrastructure-as-a-Service clouds: Towards performance modeling

Future Generation Computer Systems
Omega: flexible, scalable schedulers for large compute clusters

Proceedings of the 8th ACM European Conference on Computer Systems
Choosy: max-min fair sharing for datacenter jobs with constraints

Proceedings of the 8th ACM European Conference on Computer Systems
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Sparrow: distributed, low latency scheduling

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment
Google hostload prediction based on Bayesian model with optimized feature combination

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Evaluating the performance of large compute clusters requires benchmarks with representative workloads. At Google, performance benchmarks are used to obtain performance metrics such as task scheduling delays and machine resource utilizations to assess changes in application codes, machine configurations, and scheduling algorithms. Existing approaches to workload characterization for high performance computing and grids focus on task resource requirements for CPU, memory, disk, I/O, network, etc. Such resource requirements address how much resource is consumed by a task. However, in addition to resource requirements, Google workloads commonly include task placement constraints that determine which machine resources are consumed by tasks. Task placement constraints arise because of task dependencies such as those related to hardware architecture and kernel version. This paper develops methodologies for incorporating task placement constraints and machine properties into performance benchmarks of large compute clusters. Our studies of Google compute clusters show that constraints increase average task scheduling delays by a factor of 2 to 6, which often results in tens of minutes of additional task wait time. To understand why, we extend the concept of resource utilization to include constraints by introducing a new metric, the Utilization Multiplier (UM). UM is the ratio of the resource utilization seen by tasks with a constraint to the average utilization of the resource. UM provides a simple model of the performance impact of constraints in that task scheduling delays increase with UM. Last, we describe how to synthesize representative task constraints and machine properties, and how to incorporate this synthesis into existing performance benchmarks. Using synthetic task constraints and machine properties generated by our methodology, we accurately reproduce performance metrics for benchmarks of Google compute clusters with a discrepancy of only 13% in task scheduling delay and 5% in resource utilization.