An analytical model to exploit memory task scheduling

Authors:
Hsiang-Yun Cheng;Jian Li;Chia-Lin Yang
Affiliations:
National Taiwan University, Taipei, Taiwan, R.O.C. and IBM Austin Research Laboratory, Austin, TX;IBM Austin Research Laboratory, Austin, TX;National Taiwan University, Taipei, Taiwan, R.O.C.
Venue:
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Year:
2010

Citing 19
Cited 0

Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
The Stream Virtual Machine

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A study of performance impact of memory controller features in multi-processor server environment

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Stream Programming on General-Purpose Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Effective Management of DRAM Bandwidth in Multicore Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Architectural Support for the Stream Execution Model on General-Purpose Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Memory scheduling for modern microprocessors

ACM Transactions on Computer Systems (TOCS)
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Core-aware memory access scheduling schemes

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Memory Wall has been a well-known obstacle to processor performance improvement. The dawn of many-core processors will further exaggerate the problem. As a result, efficient memory task scheduling has been one important means to sustaining the performance growth. In this paper, we first develop an analytical model to capture the essence of on-chip compute and off-chip communication as shown in the stream programming model. It estimates the potential speedup that can be achieved by restricting the number of simultaneous memory tasks to reduce memory bandwidth contention. We then corroborate the analytical model with experimental results from task scheduling on real hardware. Correlation between the analytical and experimental results offers both insight into the benchmarks running on the hardware and opportunities to extend the analytical model. Our results show that restricting the number of simultaneous memory tasks achieves up to 60% performance improvement with a pool of synthetic workloads.