SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
Tradeoffs in two-level on-chip caching
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Evaluating the performance of cache-affinity scheduling in shared-memory multiprocessors
Journal of Parallel and Distributed Computing
On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Region-based caching: an energy-delay efficient memory architecture for embedded processors
CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
Hoard: a scalable memory allocator for multithreaded applications
ACM SIGPLAN Notices
Affinity scheduling of unbalanced workloads
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
On the effectiveness of address-space randomization
Proceedings of the 11th ACM conference on Computer and communications security
A localizing directory coherence protocol
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
Maximizing CMP Throughput with Mediocre Cores
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Exploring the cache design space for large scale CMPs
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Scalable locality-conscious multithreaded memory allocation
Proceedings of the 5th international symposium on Memory management
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
The M5 Simulator: Modeling Networked Systems
IEEE Micro
Reducing Verification Complexity of a Multicore Coherence Protocol Using Assume/Guarantee
FMCAD '06 Proceedings of the Formal Methods in Computer Aided Design
Adaptive insertion policies for high performance caching
Proceedings of the 34th annual international symposium on Computer architecture
A novel technique to use scratch-pad memory for stack management
Proceedings of the conference on Design, automation and test in Europe
IEEE Micro
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing
Corey: an operating system for many cores
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Reducing off-chip memory traffic by selective cache management scheme in GPGPUs
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Acceleration of bulk memory operations in a heterogeneous multicore architecture
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
Without high-bandwidth broadcast, large numbers of cores require a scalable point-to-point interconnect and a directory protocol. In such cases, a shared, inclusive last level cache (LLC) can improve data sharing and avoid three-way communication for shared reads. However, if inclusion encompasses thread-private data, two problems arise with the shared LLC. First, current memory allocators align stack bases on page boundaries, which emerges as a source of severe conflict misses for large numbers of threads on data-parallel applications. Second, correctness does not require the private data to reside in the shared directory or the LLC. This paper advocates stack-base randomization that eliminates the major source of conflict misses for large numbers of threads. However, when capacity becomes a limitation for the directory or last-level cache, this is not sufficient. We then propose non-inclusive, semi-coherent cache organization (NISC) that removes the requirement for inclusion of private data and reduces capacity misses. Our data-parallel benchmarks show that these limitations prevent scaling beyond 8 cores, while our techniques allow scaling to at least 32 cores for most benchmarks. At 8 cores, stack randomization provides a mean speedup of 1.2X, but stack randomization with 32 cores gives a speedup of 2.7X over the best baseline configuration. Comparing to conventional performance with a 2 MB LLC, our technique achieves similar performance with a 256 KB LLC, suggesting LLCs may be typically overprovisioned. When very limited LLC resources are available, NISC can further improve system performance by 1.8X.