Optimizing thread throughput for multithreaded workloads on memory constrained CMPs

Authors:
Major Bhadauria;Sally A. McKee
Affiliations:
Cornell University, Ithaca, NY, USA;Cornell University, Ithaca, NY, USA
Venue:
Proceedings of the 5th conference on Computing frontiers
Year:
2008

Citing 22
Cited 1

Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A performance comparison of contemporary DRAM architectures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
The Cache DRAM Architecture: A DRAM with an On-Chip Cache Memory

IEEE Micro
Cached DRAM for ILP Processor Memory Access Latency Reduction

IEEE Micro
The Hierarchical Multi-Bank DRAM: A High-Performance Architecture for Memory Integrated with Processors

ARVLSI '97 Proceedings of the 17th Conference on Advanced Research in VLSI (ARVLSI '97)
Impulse: Building a Smarter Memory Controller

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
A Case for Studying DRAM Issues at the System Level

IEEE Micro
Guest Editors' Introduction: Power-Aware Computing

Computer
Leakage Current: Moore's Law Meets Static Power

Computer
Adaptive History-Based Memory Schedulers

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Modern dram memory systems: performance analysis and scheduling algorithm

Modern dram memory systems: performance analysis and scheduling algorithm
DRAMsim: a memory system simulator

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Identifying energy-efficient concurrency levels using machine learning

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing

An approach to resource-aware co-scheduling for CMPs

Proceedings of the 24th ACM International Conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multi-core designs have become the industry imperative, replacing our reliance on increasingly complicated micro-architectural designs and VLSI improvements to deliver increased performance at lower power budgets. Performance of these multi-core chips will be limited by the DRAM memory system: we demonstrate this by modeling a cycle-accurate DDR2 memory controller with SPLASH-2 workloads. Surprisingly, benchmarks that appear to scale well with the number of processors fail to do so when memory is accurately modeled. We frequently find that the most efficient configuration is not the one with the most threads. By choosing the most efficient number of threads for each benchmark, average energy delay efficiency improves by a factor of 3.39, and performance improves by 19.7%, on average. We also introduce a shadow row of sense amplifiers, an alternative to cached DRAM, to explore potential power/performance impacts. The shadow row works in conjunction with the L2 Cache to leverage temporal and spatial locality across memory accesses, thus attaining average and peak speedups of 13% and 43%, respectively, when compared to a state-of-the-art DRAM memory scheduler.