The detection and elimination of useless misses in multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Reducing false sharing on shared memory multiprocessors through compile time data transformations
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache-conscious data placement
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache-conscious structure layout
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Cache-conscious structure definition
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Automated data-member layout of heap objects to improve memory-hierarchy performance
ACM Transactions on Programming Languages and Systems (TOPLAS)
An efficient profile-analysis framework for data-layout optimizations
POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
HP Caliper: A Framework for Performance Analysis Tools
IEEE Concurrency
False Sharing and Spatial Locality in Multiprocessor Caches
IEEE Transactions on Computers
Cautions, Machine-Independent Performance Tuning for Shared-Memory Multiprocessors
Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing - Volume I
Improving Cache Behavior of Dynamically Allocated Data Structures
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
SYZYGY - A Framework for Scalable Cross-Module IPO
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Array regrouping and structure splitting using whole-program reference affinity
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Practical Structure Layout Optimization and Advice
Proceedings of the International Symposium on Code Generation and Optimization
Cache-conscious coallocation of hot data streams
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Whole-program optimization of global variable layout
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Polymorphing Software by Randomizing Data Structure Layout
DIMVA '09 Proceedings of the 6th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment
A compiler framework for general memory layout optimizations targeting structures
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Tackling cache-line stealing effects using run-time adaptation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Trace-Based data layout optimizations for multi-core processors
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Hi-index | 0.00 |
Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multithreaded applications that optimizes both for improved spatial locality and reduced false sharing, simultaneously. We develop a semi-automatic tool that produces actual structure layouts for multi-threaded programs and outputs the key factors contributing to the layout decisions. We apply this tool on the HP-UX kernel and demonstrate the effects of these transformations for a variety of already highly hand-tuned key structures with different set of properties. We show that na篓ýve heuristics can result in massive performance degradations on such a highly tuned application, while our technique generally avoids those pitfalls. The improved structures produced by our tool improve performance by up to 3.2% over a highly tuned baseline.