The Stanford Dash Multiprocessor
Computer
Cache Invalidation Patterns in Shared-Memory Multiprocessors
IEEE Transactions on Computers
The KSR1: experimentation and modeling of poststore
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Data forwarding in scalable shared-memory multiprocessors
ICS '95 Proceedings of the 9th international conference on Supercomputing
Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs
ICS '98 Proceedings of the 12th international conference on Supercomputing
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs
Proceedings of the 32nd annual international symposium on Computer Architecture
Store-Ordered Streaming of Shared Memory
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Thread-associative memory for multicore and multithreaded computing
Proceedings of the 2006 international symposium on Low power electronics and design
Cachier: A Tool for Automatically Inserting CICO Annotations
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02
CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Proactive Use of Shared L3 Caches to Enhance Cache Communications in Multi-Core Processors
IEEE Computer Architecture Letters
Hi-index | 0.00 |
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination's cache before it is needed, eliminating cache misses in the destination's cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors.