The effect of sharing on the cache and bus performance of parallel programs
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
MemSpy: analyzing memory system bottlenecks in programs
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Cache Invalidation Patterns in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Global optimizations for parallelism and locality on scalable parallel machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Simulation of multiprocessors: accuracy and performance
Simulation of multiprocessors: accuracy and performance
The detection and elimination of useless misses in multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Managing pages in shared virtual memory systems: getting the compiler into the game
ICS '93 Proceedings of the 7th international conference on Supercomputing
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Analyzing and tuning memory performance in sequential and parallel programs
Analyzing and tuning memory performance in sequential and parallel programs
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
Reducing false sharing on shared memory multiprocessors through compile time data transformations
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Evaluating the impact of advanced memory systems on compiler-parallelized codes
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
False Sharing and Spatial Locality in Multiprocessor Caches
IEEE Transactions on Computers
Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs
IEEE Transactions on Parallel and Distributed Systems
The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor
IEEE Transactions on Parallel and Distributed Systems
The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors
The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors
The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors
The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors
Extended design reuse trade-offs in hardware-software architecture mapping
CODES '00 Proceedings of the eighth international workshop on Hardware/software codesign
Flexible hardware acceleration for multimedia oriented microprocessors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Power and Speed-Efficient Code Transformation of Video Compression Algorithms for RISC Processors
Journal of VLSI Signal Processing Systems - Special issue on multimedia signal processing
Power-efficient flexible processor architecture for embedded applications
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2001 international conference on computer design (ICCD)
Embedded Systems Design
Hi-index | 0.00 |
Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advanced memory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from standard benchmark suites, we measure statistics such as speedups, synchronization and load imbalance, causes of cache misses, cache line utilization, data traffic, and memory costs.This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharing or other types of poor memory system performance, and we suggest methods for improving them. Overall, this study offers both an important snapshot of the behavior of applications compiled by state-of-the-art compilers, as well as an increased understanding of the interplay between cache line size, program granularity, and memory performance in moderate- scale multiprocessors.