Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Write buffer design for cache-coherent shared-memory multiprocessors
ICCD '95 Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors
Reducing Memory Traffic Via Redundant Store Instructions
HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Design Issues and Tradeoffs for Write Buffers
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
SCIMA: Software Controlled Integrated Memory Architecture for High Performance Computing
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
The Memory Bandwidth Bottleneck and its Amelioration by a Compiler
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Eliminating Useless Messages in Write-Update Protocols on Scalable Multiprocessors
Eliminating Useless Messages in Write-Update Protocols on Scalable Multiprocessors
Technology Independent Area and Delay Estimations for MicroprocessorBuilding Blocks
Technology Independent Area and Delay Estimations for MicroprocessorBuilding Blocks
Comparing memory systems for chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Flattened Butterfly Topology for On-Chip Networks
IEEE Computer Architecture Letters
Network Interface Sharing Techniques for Area Optimized NoC Architectures
DSD '08 Proceedings of the 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools
Application-aware prioritization mechanisms for on-chip networks
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 20th symposium on Great lakes symposium on VLSI
WAYPOINT: scaling coherence to thousand-core architectures
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Throughput-Effective On-Chip Networks for Manycore Accelerators
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Memory system performance in a NUMA multicore multiprocessor
Proceedings of the 4th Annual International Conference on Systems and Storage
Memory systems in the many-core era: challenges, opportunities, and solution directions
Proceedings of the international symposium on Memory management
Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees
Proceedings of the 38th annual international symposium on Computer architecture
A Micro Threading Based Concurrency Model for Parallel Computing
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
SPATL: Honey, I Shrunk the Coherence Directory
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
SCD: A scalable coherence directory with flexible sharer set encoding
HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
Why on-chip cache coherence is here to stay
Communications of the ACM
A Bandwidth-Optimized Multi-core Architecture for Irregular Applications
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
UTLEON3: Exploring Fine-Grain Multi-Threading in FPGAs
UTLEON3: Exploring Fine-Grain Multi-Threading in FPGAs
DSD '12 Proceedings of the 2012 15th Euromicro Conference on Digital System Design
Apple-CORE: Harnessing general-purpose many-cores with hardware concurrency management
Microprocessors & Microsystems
Hi-index | 0.00 |
When hardware cache coherence scales to many cores on chip, over saturated traffic of the shared memory system may offset the benefit from massive hardware concurrency. In this article, we investigate the cost of a write-update protocol in terms of on-chip memory network traffic and its adverse effects on the system performance based on a multithreaded many-core architecture with distributed caches. We discuss possible software and hardware solutions to alleviate the network pressure. We find that in the context of massive concurrency, by introducing a write-merging buffer with 0.46% area overhead to each core, applications with good locality and concurrency are boosted up by 18.74% in performance on average. Other applications also benefit from this addition and even achieve a throughput increase of 5.93%. In addition, this improvement indicates that higher levels of concurrency per core can be exploited without impacting performance, thus tolerating latency better and giving higher processor efficiencies compared to other solutions.