Communications of the ACM
Xlib programming manual (3rd ed.)
Xlib programming manual (3rd ed.)
A New Approach to Parallel Debugger Architecture
PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
OCM - An OMIS Compliant Monitoring System
EuroPVM '96 Proceedings of the Third European PVM Conference on Parallel Virtual Machine
The Globus Project: A Status Report
HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Legion-a view from 50,000 feet
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Lilith: Scalable Execution of User Code for Distributed Computing
HPDC '97 Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
TAG: a Tiny AGgregation service for Ad-Hoc sensor networks
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Logging kernel events on clusters
Future Generation Computer Systems
Logging kernel events on clusters
Future Generation Computer Systems
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Hi-index | 0.00 |
Run-time tools are crucial to program development. In our desktop computer environments, we take for granted the availability of tools for operations such as debugging, profiling, tracing, checkpointing, and visualization. When programs move into distributed or Grid environments, it is difficult to find such tools. This difficulty is caused by the complex interactions necessary between application program, operating system and layers of job scheduling and process management software. As a result, each run-time tool must be individually ported to run under a particular job management system; for m tools and n environments, the problem becomes an m \times n effort, rather than the hoped-for m + n effort. Variations in underlying operating systems can make this problem even worse. The consequence of this situation is a paucity of tools in distributed and Grid computing environments. In response to the problem, we have analyzed a variety of job scheduling environments and run-time tools to better understand their interactions. From this analysis, we isolated what we believe are the essential interactions between the run-time tool, job scheduler and resource manager, and application program. We are proposing a standard interface, called the Tool Dæmon Protocol (TDP) that codifies these interactions and provides the necessary communication functions. We have implemented a pilot TDP library and experimented with Parador, a prototype using the Paradyn Parallel Performance tools profiling jobs running under the Condor batch-scheduling environment.