Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures

Authors:
Michael Chu;Rajiv Ravindran;Scott Mahlke
Affiliations:
-;-;-
Venue:
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2007

Citing 0
Cited 19

Copy or Discard execution model for speculative parallelization on multicores

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Multi-execution: multicore caching for data-similar executions

Proceedings of the 36th annual international symposium on Computer architecture
Adapting application execution in CMPs using helper threads

Journal of Parallel and Distributed Computing
Speculative parallelization of sequential loops on multicores

International Journal of Parallel Programming
Models for generating locality-tuned traveling threads for a hierarchical multi-level heterogeneous multicore

Proceedings of the 7th ACM international conference on Computing frontiers
Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications

Proceedings of the 37th annual international symposium on Computer architecture
Hardware/software support for adaptive work-stealing in on-chip multiprocessor

Journal of Systems Architecture: the EUROMICRO Journal
Compiler-assisted data distribution for chip multiprocessors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
An adaptive cache coherence protocol for chip multiprocessors

Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Architectural support for thread communications in multi-core processors

Parallel Computing
Algorithms for optimally arranging multicore memory structures

EURASIP Journal on Embedded Systems
Shared Register File Based ILP for Multicore

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Studying inter-core data reuse in multicores

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Studying inter-core data reuse in multicores

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Single thread program parallelism with dataflow abstracting thread

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Neighborhood-aware data locality optimization for NoC-based multicores

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Communication-aware HW/SW co-design for heterogeneous multicore platforms

Proceedings of the 2012 Workshop on Dynamic Analysis
A constraint programming approach for integrated spatial and temporal scheduling for clustered architectures

ACM Transactions on Embedded Computing Systems (TECS)
Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

The recent design shift towards multicore processors has spawned a significant amount of research in the area of program paralleliza- tion. The future abundance of cores on a single chip requires pro- grammer and compiler intervention to increase the amount of par- allel work possible. Much of the recent work has fallen into the areas of coarse-grain parallelization: new programming models and different ways to exploit threads and data-level parallelism. This work focuses on a complementary direction, improving per- formance through automated fine-grain parallelization. The main difficulty in achieving a performance benefit from fine-grain paral- lelism is the distribution of data memory accesses across the data caches of each core. Poor choices in the placement of data ac- cesses can lead to increased memory stalls and low resource utiliza- tion. We propose a profile-guided method for partitioning mem- ory accesses across distributed data caches. First, a profile deter- mines affinity relationships between memory accesses and work- ing set characteristics of individual memory operations in the pro- gram. Next, a program-level partitioning of the memory opera- tions is performed to divide the memory accesses across the data caches. As a result, the data accesses are proactively dispersed to reduce memory stalls and improve computation parallelization. A final detailed partitioning of the computation instructions is per- formed with knowledge of the cache location of their associated data. Overall, our data partitioning reduces stall cycles by up to 51% versus data-incognizant partitioning, and has an overall speedup average of 30% over a single core processor.