IEEE Transactions on Computers
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Characterizing a new class of threads in scientific applications for high end supercomputers
Proceedings of the 18th annual international conference on Supercomputing
Of Piglets and Threadlets: Architectures for Self-Contained, Mobile, Memory Programming
IWIA '04 Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems
Understanding the effects of wrong-path memory references on processor performance
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
PIM lite: a multithreaded processor-in-memory prototype
GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
The implications of working set analysis on supercomputing memory hierarchy design
Proceedings of the 19th annual international conference on Supercomputing
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Compilers: Principles, Techniques, and Tools (2nd Edition)
Compilers: Principles, Techniques, and Tools (2nd Edition)
Reducing Cache Pollution via Dynamic Data Prefetch Filtering
IEEE Transactions on Computers
Traveling threads: a new multithreaded execution model
Traveling threads: a new multithreaded execution model
Programming future architectures: dusty decks, memory walls, and the speed of light
Programming future architectures: dusty decks, memory walls, and the speed of light
A compiler cost model for speculative parallelization
ACM Transactions on Architecture and Code Optimization (TACO)
Proceedings of the 21st annual international conference on Supercomputing
Optimal multistream sequential prefetching in a shared cache
ACM Transactions on Storage (TOS)
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Introduction to the cell broadband engine architecture
IBM Journal of Research and Development
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Compiler-directed scratchpad memory management via graph coloring
ACM Transactions on Architecture and Code Optimization (TACO)
Multicore is bad news for supercomputers
IEEE Spectrum
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
As heterogeneous multicore processors become more widespread, many options are emerging for producing efficient parallel code for such processors. Although parallel programming languages are improving, manual partitioning of computations and data across heterogeneous processing resources is proving extraordinarily difficult. Further, it is becoming increasingly important to consider locality when producing parallel code, as data transport is a primary source of performance overhead and energy consumption. To address these problems, we propose a novel model for extracting parallel computations from sequential code for a hierarchical multi-level heterogeneous processor which we present called the Passive/Active Multicore (PAM). The computations take the form of short, fine-grained threads, which are generated with consideration to locality through cache profiling and have the ability to migrate from core to core up through the memory hierarchy based on the location of operands. Experimental results across both integer and floating point intensive standard and scientific workloads show that the architecture, execution model, and computational extraction techniques together offer computational offloads of up to 24% (5.8% on average). Through simulation, we estimate these offloads may translate into speedups of up to 19% (4.0% on average) and that negative effects on performance are negligible. Floating point applications seem to be most aided by these techniques.