Performance debugging shared memory multiprocessor programs with MTOOL
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
MemSpy: analyzing memory system bottlenecks in programs
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Integrating performance monitoring and communication in parallel computers
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Shared-memory performance profiling
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Continuous profiling: where have all the cycles gone?
Proceedings of the sixteenth ACM symposium on Operating systems principles
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Hardware Support for Flexible Distributed Shared Memory
IEEE Transactions on Computers
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
TEST: a tracer for extracting speculative threads
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A Programmable Co-processor for Profiling
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Transactional Memory Coherence and Consistency
Proceedings of the 31st annual international symposium on Computer architecture
Programming with transactional coherence and consistency (TCC)
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Unbounded Transactional Memory
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Characterization of TCC on Chip-Multiprocessors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Characterization of TCC on Chip-Multiprocessors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
A practical FPGA-based framework for novel CMP research
Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays
Transactional collection classes
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
ATLAS: a chip-multiprocessor with transactional memory support
Proceedings of the conference on Design, automation and test in Europe
Dependence-aware transactional memory for increased concurrency
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Debugging programs that use atomic blocks and transactional memory
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Discovering and understanding performance bottlenecks in transactional applications
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Transactional event profiling in a best-effort hardware transactional memory system
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Evaluation of two formulations of the conjugate gradients method with transactional memory
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Hi-index | 0.00 |
Transactional Coherence and Consistency (TCC) provides a new parallel programming model that uses transactions as the basic unit of parallel work and communication. TCC simplifies the development of correct parallel code because hardware provides transaction atomicity and ordering. Nevertheless, the programmer or a dynamic compiler must still optimize the parallel code for performance.This paper presents TAPE, a hardware and software infrastructure for profiling in TCC systems. TAPE extends the hardware for transactional execution to identify performance impediments such as dependence violations, buffer overflows, and work imbalance. It filters infrequent events to reduce resource requirements and allows the programmer to focus on the most important bottlenecks. We demonstrate that TAPE introduces minimal die area and performance overhead and can be used continuously, even for production runs. Moreover, we demonstrate how to leverage the profiling information to guide optimization for a set of parallel applications. TAPE accurately identifies the source code location and type of the most important bottlenecks, allowing a programmer to achieve maximum parallel speedup with a few profiling steps.