The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
An optimal memory allocation scheme for scratch-pad-based embedded systems
ACM Transactions on Embedded Computing Systems (TECS)
Assigning Program and Data Objects to Scratchpad for Energy Reduction
Proceedings of the conference on Design, automation and test in Europe
MiBench: A free, commercially representative embedded benchmark suite
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Dynamic allocation for scratch-pad memory using compile-time decisions
ACM Transactions on Embedded Computing Systems (TECS)
Efficient emulation of hardware prefetchers via event-driven helper threading
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Identifying energy-efficient concurrency levels using machine learning
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Hi-index | 0.00 |
This paper proposes the concept of performance balancing, and reports its performance impact on a Chip multiprocessor (CMP). Integrating multiple processor cores into a single chip, or CMPs, can achieve higher peak performance by means of exploiting thread level parallelism. However, the off-chip memory bandwidth which does not scale with the number of cores tends to limit the potential of CMPs. To solve this issue, the technique proposed in this paper attempts to make a good balance between computation and memorization. Unlike conventional parallel executions, this approach exploits some cores to improve the memory performance. These cores devote the on-chip memory hardware resources to the remaining cores executing the parallelized threads. In our evaluation, it is observed that our approach can achieve 31% of performance improvement compared to a conventional parallel execution model in the specified program.