Copy or Discard execution model for speculative parallelization on multicores
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Multi-execution: multicore caching for data-similar executions
Proceedings of the 36th annual international symposium on Computer architecture
Adapting application execution in CMPs using helper threads
Journal of Parallel and Distributed Computing
Speculative parallelization of sequential loops on multicores
International Journal of Parallel Programming
Proceedings of the 7th ACM international conference on Computing frontiers
Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications
Proceedings of the 37th annual international symposium on Computer architecture
Hardware/software support for adaptive work-stealing in on-chip multiprocessor
Journal of Systems Architecture: the EUROMICRO Journal
Compiler-assisted data distribution for chip multiprocessors
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
An adaptive cache coherence protocol for chip multiprocessors
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Architectural support for thread communications in multi-core processors
Parallel Computing
Algorithms for optimally arranging multicore memory structures
EURASIP Journal on Embedded Systems
Shared Register File Based ILP for Multicore
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Studying inter-core data reuse in multicores
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Studying inter-core data reuse in multicores
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Single thread program parallelism with dataflow abstracting thread
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Neighborhood-aware data locality optimization for NoC-based multicores
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Communication-aware HW/SW co-design for heterogeneous multicore platforms
Proceedings of the 2012 Workshop on Dynamic Analysis
ACM Transactions on Embedded Computing Systems (TECS)
Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors
International Journal of Parallel Programming
Hi-index | 0.00 |
The recent design shift towards multicore processors has spawned a significant amount of research in the area of program paralleliza- tion. The future abundance of cores on a single chip requires pro- grammer and compiler intervention to increase the amount of par- allel work possible. Much of the recent work has fallen into the areas of coarse-grain parallelization: new programming models and different ways to exploit threads and data-level parallelism. This work focuses on a complementary direction, improving per- formance through automated fine-grain parallelization. The main difficulty in achieving a performance benefit from fine-grain paral- lelism is the distribution of data memory accesses across the data caches of each core. Poor choices in the placement of data ac- cesses can lead to increased memory stalls and low resource utiliza- tion. We propose a profile-guided method for partitioning mem- ory accesses across distributed data caches. First, a profile deter- mines affinity relationships between memory accesses and work- ing set characteristics of individual memory operations in the pro- gram. Next, a program-level partitioning of the memory opera- tions is performed to divide the memory accesses across the data caches. As a result, the data accesses are proactively dispersed to reduce memory stalls and improve computation parallelization. A final detailed partitioning of the computation instructions is per- formed with knowledge of the cache location of their associated data. Overall, our data partitioning reduces stall cycles by up to 51% versus data-incognizant partitioning, and has an overall speedup average of 30% over a single core processor.