Models for generating locality-tuned traveling threads for a hierarchical multi-level heterogeneous multicore

Authors:
Patrick Anthony La Fratta;Peter M. Kogge
Affiliations:
University of Notre Dame, Notre Dame, IN, USA;University of Notre Dame, Notre Dame, IN, USA
Venue:
Proceedings of the 7th ACM international conference on Computing frontiers
Year:
2010

Citing 20
Cited 1

The Impulse Memory Controller

IEEE Transactions on Computers
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Characterizing a new class of threads in scientific applications for high end supercomputers

Proceedings of the 18th annual international conference on Supercomputing
Of Piglets and Threadlets: Architectures for Self-Contained, Mobile, Memory Programming

IWIA '04 Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems
Understanding the effects of wrong-path memory references on processor performance

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
PIM lite: a multithreaded processor-in-memory prototype

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
The implications of working set analysis on supercomputing memory hierarchy design

Proceedings of the 19th annual international conference on Supercomputing
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Compilers: Principles, Techniques, and Tools (2nd Edition)

Compilers: Principles, Techniques, and Tools (2nd Edition)
Reducing Cache Pollution via Dynamic Data Prefetch Filtering

IEEE Transactions on Computers
Traveling threads: a new multithreaded execution model

Traveling threads: a new multithreaded execution model
Programming future architectures: dusty decks, memory walls, and the speed of light

Programming future architectures: dusty decks, memory walls, and the speed of light
A compiler cost model for speculative parallelization

ACM Transactions on Architecture and Code Optimization (TACO)
Active memory operations

Proceedings of the 21st annual international conference on Supercomputing
Optimal multistream sequential prefetching in a shared cache

ACM Transactions on Storage (TOS)
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Introduction to the cell broadband engine architecture

IBM Journal of Research and Development
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Compiler-directed scratchpad memory management via graph coloring

ACM Transactions on Architecture and Code Optimization (TACO)
Multicore is bad news for supercomputers

IEEE Spectrum

Energy-efficient multithreading for a hierarchical heterogeneous multicore through locality-cognizant thread generation

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As heterogeneous multicore processors become more widespread, many options are emerging for producing efficient parallel code for such processors. Although parallel programming languages are improving, manual partitioning of computations and data across heterogeneous processing resources is proving extraordinarily difficult. Further, it is becoming increasingly important to consider locality when producing parallel code, as data transport is a primary source of performance overhead and energy consumption. To address these problems, we propose a novel model for extracting parallel computations from sequential code for a hierarchical multi-level heterogeneous processor which we present called the Passive/Active Multicore (PAM). The computations take the form of short, fine-grained threads, which are generated with consideration to locality through cache profiling and have the ability to migrate from core to core up through the memory hierarchy based on the location of operands. Experimental results across both integer and floating point intensive standard and scientific workloads show that the architecture, execution model, and computational extraction techniques together offer computational offloads of up to 24% (5.8% on average). Through simulation, we estimate these offloads may translate into speedups of up to 19% (4.0% on average) and that negative effects on performance are negligible. Floating point applications seem to be most aided by these techniques.