Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation

Authors:
Harish Patil;Robert Cohn;Mark Charney;Rajiv Kapoor;Andrew Sun;Anand Karunanidhi
Affiliations:
Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation
Venue:
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2004

Citing 12
Cited 60

ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
RecPlay: a fully integrated practical record/replay system

ACM Transactions on Computer Systems (TOCS)
Measuring Experimental Error in Microprocessor Simulation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Complete Computer System Simulation: The SimOS Approach

IEEE Parallel & Distributed Technology: Systems & Technology
SPEC CPU2000: Measuring CPU Performance in the New Millennium

Computer
Simics: A Full System Simulation Platform

Computer
Asim: A Performance Model Framework

Computer
Reducing State Loss For Effective Trace Sampling of Superscalar Processors

ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
Large System Performance of SPEC OMP2001 Benchmarks

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling

Proceedings of the 30th annual international symposium on Computer architecture
Picking Statistically Valid and Early Simulation Points

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques

Toward an Evaluation Infrastructure for Power and Energy Optimizations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 11 - Volume 12
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Computing Architectural Vulnerability Factors for Address-Based Structures

Proceedings of the 32nd annual international symposium on Computer Architecture
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks

Proceedings of the 19th annual international conference on Supercomputing
Simulating Commercial Java Throughput Workloads: A Case Study

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Automatic logging of operating system effects to guide application-level architecture simulation

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Phase guided sampling for efficient parallel application simulation

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Selective predicate prediction for out-of-order processors

Proceedings of the 20th annual international conference on Supercomputing
Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques

ACM Transactions on Architecture and Code Optimization (TACO)
Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization

Proceedings of the 34th annual international symposium on Computer architecture
Applying Statistical Sampling for Fast and Efficient Simulation of Commercial Workloads

IEEE Transactions on Computers
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Distributed order scheduling and its application to multi-core dram controllers

Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
Prefetch-Aware DRAM Controllers

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
A case for bufferless routing in on-chip networks

Proceedings of the 36th annual international symposium on Computer architecture
Thread motion: fine-grained power management for multi-core systems

Proceedings of the 36th annual international symposium on Computer architecture
Architecture Design for Soft Errors

Architecture Design for Soft Errors
Application-aware prioritization mechanisms for on-chip networks

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Coordinated control of multiple prefetchers in multi-core systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Improving memory bank-level parallelism in the presence of prefetching

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
PinPlay: a framework for deterministic replay and reproducible analysis of parallel programs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Using dynamic binary instrumentation to generate multi-platform SimPoints: methodology and accuracy

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Phase complexity surfaces: characterizing time-varying program behavior

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
The design and architecture of MAQAOAdvisor: a live tuning guide

HiPC'08 Proceedings of the 15th international conference on High performance computing
High performance cache replacement using re-reference interval prediction (RRIP)

Proceedings of the 37th annual international symposium on Computer architecture
Aérgia: exploiting packet latency slack in on-chip networks

Proceedings of the 37th annual international symposium on Computer architecture
QuaLe: A Quantum-Leap Inspired Model for Non-stationary Analysis of NoC Traffic in Chip Multi-processors

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Detecting phases in parallel applications on shared memory architectures

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
F2BFLY: an on-chip free-space optical network with wavelength-switching

Proceedings of the international conference on Supercomputing
FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
CRIB: consolidated rename, issue, and bypass

Proceedings of the 38th annual international symposium on Computer architecture
Prefetch-aware shared resource management for multi-core systems

Proceedings of the 38th annual international symposium on Computer architecture
Efficient sampling startup for sampled processor simulation

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
When Prefetching Works, When It Doesn’t, and Why

ACM Transactions on Architecture and Code Optimization (TACO)
CRUISE: cache replacement and utility-aware scheduling

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
CRAM: coded registers for amplified multiporting

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A register-file approach for row buffer caches in die-stacked DRAMs

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SHiP: signature-based hit predictor for high performance caching

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
PACMan: prefetch-aware cache management for high performance caching

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multicore Memory Systems

ACM Transactions on Computer Systems (TOCS)
Characterizing time-varying program behavior using phase complexity surfaces

Transactions on High-Performance Embedded Architectures and Compilers IV
Trace-driven simulation of memory system scheduling in multithread application

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
XIOSim: power-performance modeling of mobile x86 cores

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)

Proceedings of the 39th Annual International Symposium on Computer Architecture
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
On-chip networks from a networking perspective: congestion and scalability in many-core interconnects

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Optimal bypass monitor for high performance last-level caches

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
On-chip networks from a networking perspective: congestion and scalability in many-core interconnects

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Software data-triggered threads

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Predicting Performance Impact of DVFS for Realistic Memory Systems

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Catnap: energy proportional multiple network-on-chip

Proceedings of the 40th Annual International Symposium on Computer Architecture
Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Fairness-aware scheduling on single-ISA heterogeneous multi-cores

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture

Journal of Parallel and Distributed Computing
Linearly compressed pages: a low-complexity, low-latency main memory compression framework

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detailed modeling of the performance of commercial applications is difficult. The applications can take a very long time to run on real hardware and it is impractical to simulate them to completion on performance models. Furthermore, these applications have complex execution environments that cannot easily be reproduced on a simulator, making porting the applications to simulators difficult. We attack these problems using the well-known SimPoint methodology to find representative portions of an application to simulate, and a dynamic instrumentation framework called Pin to avoid porting altogether. Our system uses dynamic instrumentation instead of simulation to find representative portions - called Pin-Points - for simulation. Wehave developed a toolkit that automatically detects PinPoints, validates whether they are representative using hardware performance counters, and generates traces for large Itanium® programs. We compared SimPoint-based selection to random selection of simulation points. We found for 95% of the SPEC2000 programs we tested, the PinPoints prediction was within 8% of the actual whole-program CPI, as opposed to 18% for random selection. We measure the end-to-end error, comparing real hardware to a performance model, and have a simple and efficient methodology to determine the step that introduced the error. Finally, we evaluate the system in the context of multiple configurations of real hardware, commercial applications, and industrial-strength performance models to understand the behavior of a complete and practical workload collection system. We have successfully used our system with many commercial Itanium® programs, some running for trillions of instructions, and have used the resulting traces for predicting performance of those applications on future Itanium processors.