Cell broadband engine processor performance optimization: tracing tools implementation and use

Authors:
M. Biberstein;S. Dori-Hacohen;Y. Harel;A. Heilper;B. Mendelson;U. Shvadron;E. Treister;J. Turek;M. S. Chang
Affiliations:
IBM Haifa Research Laboratory, Haifa, Israel;IBM Haifa Research Laboratory, Haifa, Israel;IBM Haifa Research Laboratory, Haifa, Israel;IBM Haifa Research Laboratory, Haifa, Israel;IBM Haifa Research Laboratory, Haifa, Israel;IBM Haifa Research Laboratory, Haifa, Israel;IBM Haifa Research Laboratory, Haifa, Israel;IBM Haifa Research Laboratory, Haifa, Israel;VMware Inc., Research & Development Performance, Palo Alto, California
Venue:
IBM Journal of Research and Development
Year:
2009

Citing 12
Cited 0

On efficiently implementing global time for performance evaluation on multiprocessor systems

Journal of Parallel and Distributed Computing
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Introduction to Algorithms

Introduction to Algorithms
Delta Coherence Protocols

IEEE Concurrency
The Paradyn Parallel Performance Measurement Tool

Computer
An Adaptive Cost System for Parallel Program Instrumentation

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing - Volume I
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Compensation of Measurement Overhead in Parallel Performance Profiling

International Journal of High Performance Computing Applications
CellSs: making it easier to program the cell broadband engine processor

IBM Journal of Research and Development
Trace-based Performance Analysis on Cell BE

ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software
MapReduce for the cell broadband engine architecture

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Optimizing performance on multicore processors is a daunting task M. S. Chang because of the increased importance of such factors as thread communication, memory contention, and memory access latency. This paper presents two tools that programmers and performance analysts can use to understand application performance on the Cell Broadband Engine® (Cell/B.E.) processor: the Performance Debugging Tool (PDT) and the Trace Analyzer (TA). PDT traces user-space events, augmenting them with scheduling data from the operating system; those traces are then read, analyzed, and presented visually by the TA. This paper describes the implementation issues arising from the fact that a common lowoverhead clock shared by all cores, essential for analysis and visualization, is not available on the Cell/B.E. processor. The TA employs an offline analysis to align the collected events to a common time based only on thread-local timestamps, event order, and context switch information. We also discuss the overhead of tracing and its impact on execution and performance analysis. We illustrate the use of the PDT and TA by analyzing several significant Cell/B.E. processor workloads, including native code and higher-level abstractions offered by the Data Communication and Synchronization services. We show how trace analysis can help identify performance issues in these workloads and how it can be used by programmers to spot performance antipatterns (common programming practices leading to suboptimal performance).