ATOM: a system for building customized program analysis tools
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
RecPlay: a fully integrated practical record/replay system
ACM Transactions on Computer Systems (TOCS)
Measuring Experimental Error in Microprocessor Simulation
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Automatically characterizing large scale program behavior
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Complete Computer System Simulation: The SimOS Approach
IEEE Parallel & Distributed Technology: Systems & Technology
Asim: A Performance Model Framework
Computer
Reducing State Loss For Effective Trace Sampling of Superscalar Processors
ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
Large System Performance of SPEC OMP2001 Benchmarks
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling
Proceedings of the 30th annual international symposium on Computer architecture
Picking Statistically Valid and Early Simulation Points
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Toward an Evaluation Infrastructure for Power and Energy Optimizations
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 11 - Volume 12
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Computing Architectural Vulnerability Factors for Address-Based Structures
Proceedings of the 32nd annual international symposium on Computer Architecture
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks
Proceedings of the 19th annual international conference on Supercomputing
Simulating Commercial Java Throughput Workloads: A Case Study
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Automatic logging of operating system effects to guide application-level architecture simulation
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Phase guided sampling for efficient parallel application simulation
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Selective predicate prediction for out-of-order processors
Proceedings of the 20th annual international conference on Supercomputing
Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques
ACM Transactions on Architecture and Code Optimization (TACO)
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization
Proceedings of the 34th annual international symposium on Computer architecture
Applying Statistical Sampling for Fast and Efficient Simulation of Commercial Workloads
IEEE Transactions on Computers
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Distributed order scheduling and its application to multi-core dram controllers
Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
Prefetch-Aware DRAM Controllers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
A case for bufferless routing in on-chip networks
Proceedings of the 36th annual international symposium on Computer architecture
Thread motion: fine-grained power management for multi-core systems
Proceedings of the 36th annual international symposium on Computer architecture
Architecture Design for Soft Errors
Architecture Design for Soft Errors
Application-aware prioritization mechanisms for on-chip networks
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Coordinated control of multiple prefetchers in multi-core systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Improving memory bank-level parallelism in the presence of prefetching
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
PinPlay: a framework for deterministic replay and reproducible analysis of parallel programs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Using dynamic binary instrumentation to generate multi-platform SimPoints: methodology and accuracy
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Phase complexity surfaces: characterizing time-varying program behavior
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
The design and architecture of MAQAOAdvisor: a live tuning guide
HiPC'08 Proceedings of the 15th international conference on High performance computing
High performance cache replacement using re-reference interval prediction (RRIP)
Proceedings of the 37th annual international symposium on Computer architecture
Aérgia: exploiting packet latency slack in on-chip networks
Proceedings of the 37th annual international symposium on Computer architecture
NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Detecting phases in parallel applications on shared memory architectures
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
F2BFLY: an on-chip free-space optical network with wavelength-switching
Proceedings of the international conference on Supercomputing
NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
CRIB: consolidated rename, issue, and bypass
Proceedings of the 38th annual international symposium on Computer architecture
Prefetch-aware shared resource management for multi-core systems
Proceedings of the 38th annual international symposium on Computer architecture
Efficient sampling startup for sampled processor simulation
HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
When Prefetching Works, When It Doesn’t, and Why
ACM Transactions on Architecture and Code Optimization (TACO)
CRUISE: cache replacement and utility-aware scheduling
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
CRAM: coded registers for amplified multiporting
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A register-file approach for row buffer caches in die-stacked DRAMs
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SHiP: signature-based hit predictor for high performance caching
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
PACMan: prefetch-aware cache management for high performance caching
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
ACM Transactions on Computer Systems (TOCS)
Characterizing time-varying program behavior using phase complexity surfaces
Transactions on High-Performance Embedded Architectures and Compilers IV
Trace-driven simulation of memory system scheduling in multithread application
Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
XIOSim: power-performance modeling of mobile x86 cores
Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)
Proceedings of the 39th Annual International Symposium on Computer Architecture
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Optimal bypass monitor for high performance last-level caches
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Software data-triggered threads
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Predicting Performance Impact of DVFS for Realistic Memory Systems
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Catnap: energy proportional multiple network-on-chip
Proceedings of the 40th Annual International Symposium on Computer Architecture
Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures
ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Fairness-aware scheduling on single-ISA heterogeneous multi-cores
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture
Journal of Parallel and Distributed Computing
Linearly compressed pages: a low-complexity, low-latency main memory compression framework
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Detailed modeling of the performance of commercial applications is difficult. The applications can take a very long time to run on real hardware and it is impractical to simulate them to completion on performance models. Furthermore, these applications have complex execution environments that cannot easily be reproduced on a simulator, making porting the applications to simulators difficult. We attack these problems using the well-known SimPoint methodology to find representative portions of an application to simulate, and a dynamic instrumentation framework called Pin to avoid porting altogether. Our system uses dynamic instrumentation instead of simulation to find representative portions - called Pin-Points - for simulation. Wehave developed a toolkit that automatically detects PinPoints, validates whether they are representative using hardware performance counters, and generates traces for large Itanium® programs. We compared SimPoint-based selection to random selection of simulation points. We found for 95% of the SPEC2000 programs we tested, the PinPoints prediction was within 8% of the actual whole-program CPI, as opposed to 18% for random selection. We measure the end-to-end error, comparing real hardware to a performance model, and have a simple and efficient methodology to determine the step that introduced the error. Finally, we evaluate the system in the context of multiple configurations of real hardware, commercial applications, and industrial-strength performance models to understand the behavior of a complete and practical workload collection system. We have successfully used our system with many commercial Itanium® programs, some running for trillions of instructions, and have used the resulting traces for predicting performance of those applications on future Itanium processors.