Algorithms for clustering data
Algorithms for clustering data
Time Series Analysis: Forecasting and Control
Time Series Analysis: Forecasting and Control
Predicting Queue Times on Space-Sharing Parallel Computers
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
The ANL/IBM SP Scheduling System
IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860
IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Parallel Job Scheduling: Issues and Approaches
IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Towards Convergence in Job Schedulers for Parallel Supercomputers
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Dynamic vs. Static Quantum-Based Parallel Processor Allocation
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Using Queue Time Predictions for Processor Allocation
IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
A unified framework for model-based clustering
The Journal of Machine Learning Research
Predicting bounds on queuing delay for batch-scheduled parallel machines
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Modeling machine availability in enterprise and wide-area distributed computing environments
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Probabilistic advanced reservations for batch-scheduled parallel machines
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
VARQ: virtual advance reservations for queues
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
On the Efficacy of Computation Offloading Decision-Making Strategies
International Journal of High Performance Computing Applications
Using historical accounting information to predict the resource usage of grid jobs
Future Generation Computer Systems
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
TeraGrid resource selection tools: a road test
Proceedings of the 2010 TeraGrid Conference
Case study for running HPC applications in public clouds
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Deadline-sensitive workflow orchestration without explicit resource control
Journal of Parallel and Distributed Computing
Network-aware meta-scheduling in advance with autonomous self-tuning system
Future Generation Computer Systems
Optimal resource allocation for time-reservation systems
Performance Evaluation
Automated grid probe system to improve end-to-end grid reliability for a science gateway
Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Modeling and synthesizing task placement constraints in Google compute clusters
Proceedings of the 2nd ACM Symposium on Cloud Computing
Coordinated rescheduling of Bag-of-Tasks for executions on multiple resource providers
Concurrency and Computation: Practice & Experience
Proceedings of the 15th ACM international conference on Modeling, analysis and simulation of wireless and mobile systems
A comparative study of high-performance computing on the cloud
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Hi-index | 0.00 |
Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. Because these machines are typically "space-shared," each job must wait in a queue until sufficient processor resources become available to service it. In production computing settings, the queuing delay (experienced by users as the time between when the job is submitted and when it begins execution) is highly variable. Users often find this variability a drag on productivity as it makes planning difficult and intellectual continuity hard to maintain. In this work, we introduce an on-line system for predicting batch-queue delay and show that it generates correct and accurate bounds for queuing delay for batch jobs from 11 machines over a 9-year period. Our system comprises 4 novel and interacting components: a predictor based on nonparametric inference; an automated change-point detector; machine-learned, model-based clustering of jobs having similar characteristics; and an automatic downtime detector to identify systemic failures that affect job queuing delay. We compare the correctness and accuracy of our system against various previously used prediction techniques and show that our new method outperforms them for all machines we have available for study.