Adaptive Memory Allocations in Clusters to Handle Unexpectedly Large Data-Intensive Jobs

Authors:
Li Xiao;Songquing Chen;Xiaodong Zhang
Affiliations:
Dept. of Comput. Sci., Michigan State Univ., East Lansing, MI, USA;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2004

Citing 19
Cited 7

The limited performance benefits of migrating active processes for load sharing

SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Implementing global memory management in a workstation cluster

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Exploiting process lifetime distributions for dynamic load balancing

ACM Transactions on Computer Systems (TOCS)
Availability and utility of idle memory in workstation clusters

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The impact of job memory requirements on gang-scheduling performance

ACM SIGMETRICS Performance Evaluation Review
The impact of job arrival patterns on parallel scheduling

ACM SIGMETRICS Performance Evaluation Review
An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster

IEEE Transactions on Parallel and Distributed Systems
A hierarchical load-balancing framework for dynamic multithreaded computations

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Operating System Concepts, 4th Ed.

Operating System Concepts, 4th Ed.
Dynamic Cluster Resource Allocations for Jobs with Known and Unknown Memory Demands

IEEE Transactions on Parallel and Distributed Systems
The Interaction between Memory Allocation and Adaptive Partitioning in Message-Passing Multicomputers

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
TPF: a dynamic system thrashing protection facility

Software—Practice & Experience
Effects of clock resolution on the scheduling of interactive and soft real-time processes

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Classifying scheduling policies with respect to unfairness in an M/GI/1

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Incorporating Job Migration and Network RAM to Share Cluster Memory Resources

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Gang Scheduling with Memory Considerations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Improving Distributed Workload Performance by Sharing Both CPU and Memory Resources

ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Adaptive and Virtual Reconfigurations for Effective Dynamic Job Scheduling in Cluster Systems

ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)

System Support to Balance the Resource Supply and Demand in High-end Computing

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
A general framework to understand parallel performance in heterogeneous clusters: analysis of a new adaptive parallel genetic algorithm

Journal of Parallel and Distributed Computing
Effectively Utilizing Global Cluster Memory for Large Data-Intensive Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Winner Price Monotonicity for Approximated Combinatorial Auctions

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Fine-grained efficient resource allocation using approximated combinatorial auctions: A parallel greedy winner approximation for large-scale problems

Web Intelligence and Agent Systems
A novel adaptive fuzzy load balancer for heterogeneous LAM/MPI clusters applied to evolutionary learning in neuro-fuzzy systems

FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
An experimental analysis of biased parallel greedy approximation for combinatorial auctions

International Journal of Intelligent Information and Database Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a cluster system with dynamic load sharing support, a job submission or migration to a workstation is determined by the availability of CPU and memory resources of the workstation at the time [21]. In such a system, a small number of running jobs with unexpectedly large memory allocation requirements may significantly increase the queuing delay times of the rest of jobs with normal memory requirements, slowing down execution of each individual job and decreasing the system throughput. We call this phenomenon the job blocking problem because the big jobs block the execution pace of majority jobs in the cluster. Since the memory demand of jobs may not be known in advance and may change dynamically, the possibility of unsuitable job submissions/migrations to cause the blocking problem is high, and existing load sharing schemes are unable to effectively handle this problem. We propose two schemes to address this problem. The first scheme, Network RAM supported load sharing, combines job migrations with network RAM, which uses remote execution to initially allocate a job to the most lightly loaded workstation and, if necessary, network RAM to provide a global memory space for the job larger than it would be available otherwise. This scheme has the merits of both job migrations and network RAM. Our experiments show its effectiveness and scalability. However, this scheme requires a network RAM facility in the cluster, which may cause additional overhead and increase cluster network traffic. In order to address this limit, we propose a second scheme, memory reservation, incorporated with dynamic load sharing, which adaptively reserves a small set of workstations to provide special services to the jobs demanding large memory allocations. As soon as the blocking problem is resolved by the memory reservation scheme, the system will adaptively switch back to the normal load sharing state. Both schemes target on handling large data-intensive jobs in clusters, and are mutually complementary. The network RAM supported load sharing scheme can fully utilize the cluster global memory space, while the memory reservation scheme has the advantage of simple implementations and low overhead. Thus, they both can be effective alternatives, and practically deployed in cluster computing under different system conditions.