Power-driven Design of Router Microarchitectures in On-chip Networks
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A NUCA substrate for flexible CMP cache sharing
Proceedings of the 19th annual international conference on Supercomputing
Design tradeoffs for tiled CMP on-chip networks
Proceedings of the 20th annual international conference on Supercomputing
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Flattened Butterfly Topology for On-Chip Networks
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Reactive NUCA: near-optimal block placement and replication in distributed caches
Proceedings of the 36th annual international symposium on Computer architecture
A case for bufferless routing in on-chip networks
Proceedings of the 36th annual international symposium on Computer architecture
Low-cost router microarchitecture for on-chip networks
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
TurboTag: lookup filtering to reduce coherence directory power
Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration
Proceedings of the Conference on Design, Automation and Test in Europe
Throughput-Effective On-Chip Networks for Manycore Accelerators
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees
Proceedings of the 38th annual international symposium on Computer architecture
Clearing the clouds: a study of emerging scale-out workloads on modern hardware
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
NOCS '12 Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip
Proceedings of the 39th Annual International Symposium on Computer Architecture
Proceedings of the 40th Annual International Symposium on Computer Architecture
Designing on-chip networks for throughput accelerators
ACM Transactions on Architecture and Code Optimization (TACO)
Jigsaw: scalable software-defined caches
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
SHIFT: shared history instruction fetch for lean-core server processors
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Scale-out server workloads benefit from many-core processor organizations that enable high throughput thanks to abundant request-level parallelism. A key characteristic of these workloads is the large instruction footprint that exceeds the capacity of private caches. While a shared last-level cache (LLC) can capture the instruction working set, it necessitates a low-latency interconnect fabric to minimize the core stall time on instruction fetches serviced by the LLC. Many-core processors with a mesh interconnect sacrifice performance on scale-out workloads due to NOC-induced delays. Low-diameter topologies can overcome the performance limitations of meshes through rich inter-node connectivity, but at a high area expense. To address the drawbacks of existing designs, this work introduces NOC-Out--a many-core processor organization that affords low LLC access delays at a small area cost. NOC-Out is tuned to accommodate the bilateral core-to-cache access pattern, characterized by minimal coherence activity and lack of inter-core communication, that is dominant in scale-out workloads. Optimizing for the bilateral access pattern, NOC-Out segregates cores and LLC banks into distinct network regions and reduces costly network connectivity by eliminating the majority of inter-core links. NOC-Out further simplifies the interconnect through the use of low-complexity tree-based topologies. A detailed evaluation targeting a 64-core CMP and a set of scale-out workloads reveals that NOC-Out improves system performance by 17% and reduces network area by 28% over a tiled mesh-based design. Compared to a design with a richly-connected flattened butterfly topology, NOC-Out reduces network area by 9x while matching the performance.