Analyzing throughput and utilization on trestles

Authors:
Richard L. Moore;Adam Jundt;Leonard K. Carson;Kenneth Yoshimoto;Amin Ghadersohi;William S. Young
Affiliations:
San Diego Supercomputer Center, U California San Diego, La Jolla, CA;San Diego Supercomputer Center, U California San Diego, La Jolla, CA;San Diego Supercomputer Center, U California San Diego, La Jolla, CA;San Diego Supercomputer Center, U California San Diego, La Jolla, CA;Ctr for Computational Research, SUNY Buffalo, Buffalo, New York;San Diego Supercomputer Center, U California San Diego, La Jolla, CA
Venue:
Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Year:
2012

Citing 4
Cited 0

Attacking the bottlenecks of backfilling schedulers

Cluster Computing
Benefits of Global Grid Computing for Job Scheduling

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Trestles: a high-productivity HPC system targeted to modest-scale and gateway users

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Scheduling diverse high performance computing systems with the goal of maximizing utilization

HIPC '11 Proceedings of the 2011 18th International Conference on High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Trestles system is targeted to modest-scale and gateway users, and is designed to enhance users' productivity by maintaining good turnaround time as well as other user-friendly features such as long run times and user reservations. However, the goal of maintaining good throughput competes with the goal of high system utilization. This paper analyzes one year of Trestles operations to characterize the empirical relationship between utilization and throughput, with the objectives of understanding their relationship, and informing allocations and scheduling policies to optimize their tradeoff. There is considerable scatter in the correlation between utilization and throughput, as measured by expansion factor. There are periods of good throughput at both low and high utilizations, while there are other periods when throughput degrades significantly not only at high utilization but even at low utilization. However, throughput consistently degrades above ~90% utilization. User behavior clearly impacts the expansion factor metrics: the great majority of jobs with extreme expansion factors are associated with a very small fraction of users who either (1) flood the queue with many jobs or (2) request job run times far in excess of actual run times. While the former is a user workflow choice, the latter clearly demonstrates the benefit of matching requested time to actual run time. Utilization and throughput metrics derived from XDMoD are compared for Trestles with two other XSEDE systems, Ranger and Kraken, with different sizes and allocation/scheduling policies. Both Ranger and Kraken have generally higher utilization and, not surprisingly, higher expansion factors than Trestles over the analysis period. As a result of this analysis, we intend to increase the target allocation fraction from the current 70% to ~75-80%, and strongly advise users to reasonably match requested run times to actual run times.