Middleware support for many-task computing

  • Authors:
  • Ioan Raicu;Ian Foster;Mike Wilde;Zhao Zhang;Kamil Iskra;Peter Beckman;Yong Zhao;Alex Szalay;Alok Choudhary;Philip Little;Christopher Moretti;Amitabh Chaudhary;Douglas Thain

  • Affiliations:
  • Northwestern University, Evanston, USA;University of Chicago, Chicago, USA and Argonne National Laboratory, Argonne, USA;University of Chicago, Chicago, USA and Argonne National Laboratory, Argonne, USA;University of Chicago, Chicago, USA;University of Chicago, Chicago, USA and Argonne National Laboratory, Argonne, USA;University of Chicago, Chicago, USA and Argonne National Laboratory, Argonne, USA;Microsoft, Redmond, USA;John Hopkins University, Baltimore, USA;Northwestern University, Evanston, USA;University of Notre Dame, Notre Dame, USA;University of Notre Dame, Notre Dame, USA;University of Notre Dame, Notre Dame, USA;University of Notre Dame, Notre Dame, USA

  • Venue:
  • Cluster Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many-task computing denotes high-performance computations comprising multiple distinct activities, coupled via file system operations. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. Traditional techniques found in production systems in the scientific community to support many-task computing do not scale to today's largest systems, due to issues in local resource manager scalability and granularity, efficient utilization of the raw hardware, long wait queue times, and shared/parallel file system contention and scalability. To address these limitations, we adopted a "top-down" approach to building a middleware called Falkon, to support the most demanding many-task computing applications at the largest scales. Falkon (Fast and Light-weight tasK executiON framework) integrates (1) multi-level scheduling to enable dynamic resource provisioning and minimize wait queue times, (2) a streamlined task dispatcher able to achieve orders-of-magnitude higher task dispatch rates than conventional schedulers, and (3) data diffusion which performs data caching and uses a data-aware scheduler to co-locate computational and storage resources. Micro-benchmarks have shown Falkon to achieve over 15K+ tasks/s throughputs, scale to hundreds of thousands of processors and to millions of queued tasks, and execute billions of tasks per day. Data diffusion has also shown to improve applications scalability and performance, with its ability to achieve hundreds of Gb/s I/O rates on modest sized clusters, with Tb/s I/O rates on the horizon. Falkon has shown orders of magnitude improvements in performance and scalability than traditional approaches to resource management across many diverse workloads and applications at scales of billions of tasks on hundreds of thousands of processors across clusters, specialized systems, Grids, and supercomputers. Falkon's performance and scalability have enabled a new class of applications called Many-Task Computing to operate at previously so-believed impossible scales with high efficiency.