Performance metrics and auditing framework for high performance computer systems

  • Authors:
  • Thomas R. Furlani;Matthew D. Jones;Steven M. Gallo;Andrew E. Bruno;Charng-Da Lu;Amin Ghadersohi;Ryan J. Gentner;Abani K. Patra;Robert L. DeLeon;Gregor von Laszewski;Lizhe Wang;Ann Zimmerman

  • Affiliations:
  • Center for Computational Research, SUNY at Buffalo, Buffalo, NY;Center for Computational Research, SUNY at Buffalo, Buffalo, NY;Center for Computational Research, SUNY at Buffalo, Buffalo, NY;Center for Computational Research, SUNY at Buffalo, Buffalo, NY;Center for Computational Research, SUNY at Buffalo, Buffalo, NY;Center for Computational Research, SUNY at Buffalo, Buffalo, NY;Center for Computational Research, SUNY at Buffalo, Buffalo, NY;SUNY at Buffalo, Buffalo, NY;Center for Computational Research, SUNY at Buffalo, Buffalo, NY;Pervasive Technology Institute, Bloomington, IN;Pervasive Technology Institute, Bloomington, IN;University of Michigan, Ann Arbor, MI

  • Venue:
  • Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a comprehensive auditing framework, XDMoD, for use by high performance computing centers to readily provide metrics regarding resource utilization (CPU hours, job size, wait time, etc), resource performance, and the center's impact in terms of scholarship and research. This role-based auditing framework is designed to meet the following objectives: (1) provide the user community with an easy to use tool to oversee their allocations and optimize their use of resources, (2) provide staff with easy access to performance metrics and diagnostics to monitor and tune resource performance for the benefit of the users, (3) provide senior management with a tool to easily monitor utilization, user base, and performance of resources, and (4) help ensure that the resources are effectively enabling research and scholarship. XDMoD is initially focused on the NSF TeraGrid (TG) and follow-on XSEDE (XD) program, where it will become a key component of the TG/XSEDE User Portal. However, this auditing system is intended to have a general applicability to any HPC system or center. The XDMoD auditing system is architected using a set of modular components that facilitate the utilization of community contributed components information. It includes an active and reactive (as opposed to passive) service set accessible through a variety of endpoints such as web-based user interface, RESTful web services, and provided development tools. One component also provides a computationally lightweight and flexible application kernel auditing system that reflects best-in-class performance kernels to measure overall system performance with respect to existing applications that are actually being run by users. This allows continuous resource auditing to monitor all aspects of system performance, most critically from a completely user-centric point of view.