Easing the management of data-parallel systems via adaptation

  • Authors:
  • David Petrou;Khalil Amiri;Gregory R. Ganger;Garth A. Gibson

  • Affiliations:
  • Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University

  • Venue:
  • EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
  • Year:
  • 2000

Quantified Score

Hi-index 0.02

Visualization

Abstract

In recent years we have seen an enormous growth in the size and prevalence of data processing workloads [Fayyad 1998, Gray 1997]. The picture that is becoming increasingly common is depicted in Figure 1. In it, organizations or resourceful individuals provide services via a set of loosely-coupled workstation nodes. The service is usually some form of data-mining like searching, filtering, or image recognition. Clients, which could be machines running web browsers, not only initiate requests, but also partake in the processing, with the goal of reducing the request turnaround. That is, when the servers are overloaded, clients with spare cycles take some of the computational burden. Naturally, many aspects of such a system cannot be determined at design time. E.g., exactly how much work a client should do depends on the computational resources available at the client and server cluster, the network bandwidth unused between them, and the workload demand. This position paper is interested in this and other aspects that must be divined at run-time to provide high performance and availability in data-parallel systems.What makes system tuning especially hard is that it's not possible to find the right knob-settings once and for all. A system upgrade or component failure may change the appropriate degree of data-parallelism. Changes in usable bandwidth may ask for a different partitioning of code among the client and server cluster. Moreover, an application may go through distinct phases during its execution. We should checkpoint the application for fault-tolerance less often during those phases in which checkpointing takes longer. Finally, the system needs to effectively allocate resources to concurrent applications, which can start at any time and which benefit differently from having these resources. In summary, we argue that in the future a significant fraction of computing will happen on architectures like Figure 1, and that, due to the architectures' inherent complexity, high availability and fast turnaround can only be realized by dynamically tuning a number of system parameters.Our position is that this tuning should be provided automatically by the system. The contrasting, application-specific view, contends that, to the extent possible, policies should be made by applications since they can make more informed optimizations. However, this requires a great deal of sophistication from the programmer. Further, it requires programmer time, one of the most scarce resources in systems building today.Toward our goal, we contribute a framework that is sufficiently rich to express a variety of interesting data-parallel applications, but which is also restricted enough so that the system can tune itself. These applications are built atop the ABACUS migration system, whose object placement algorithms are extended to reason about how many nodes should participate in a data-parallel computation, how to split up application objects among a client and server cluster, how often program state should be checkpointed, and the interaction (sometimes conflicting) between these questions. By automatically determining a number of critical parameters at runtime, we are minimizing the management costs which have in recent years given system administrators the howling fantods [Satyanarayanan 1999].