On the effective bandwidth of interleaved memories in vector processor systems
IEEE Transactions on Computers
Characterizing Distributed Shared Memory Performance: A Case Study of the Convex SPP1000
IEEE Transactions on Parallel and Distributed Systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Introducing the IA-64 Architecture
IEEE Micro
Effective Hardware-Based Data Prefetching for High-Performance Processors
IEEE Transactions on Computers
Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes
IEEE Transactions on Computers
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Unfavorable Strides in Cache Memory Systems (RNR Technical Report RNR-92-015)
Scientific Programming
POWER4 system microarchitecture
IBM Journal of Research and Development
Loop Optimization using Hierarchical Compilation and Kernel Decomposition
Proceedings of the International Symposium on Code Generation and Optimization
Proceedings of the 6th ACM conference on Computing frontiers
Iterative compilation with kernel exploration
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Introducing a performance model for bandwidth-limited loop kernels
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Hi-index | 0.01 |
Memory hierarchies are a key component in obtaining high performance on modern microprocessors. To satisfy the ever-increasing demand on data rate access, they are also becoming increasingly complex: multilevel caches, non-blocking caches, sophisticated instructions for supporting prefetch and cache control, etc. If all of these advanced features promise to offer large performance gains, they also generate in some cases performance "anomalies" (i.e. bad performance triggered by specific code patterns). For precisely locating and understanding these anomalies, a new set of microbenchmarks called WBTK is introduced. We show through systematic experimentation on Alpha 21264, Power4 and Itanium1 that this microbenchmark first allowed us to detect most of the anomalies encountered on simple BLAS1 type codes. Secondly, it led us to demonstrate that vectorization of memory access was an efficient workaround for most of these anomalies.