An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Design methodology for a modular service-driven network processor architecture
Computer Networks: The International Journal of Computer and Telecommunications Networking - Network processors
Interface Design Techniques for Single-Chip Systems
VLSID '03 Proceedings of the 16th International Conference on VLSI Design
Guided region prefetching: a cooperative hardware/software approach
Proceedings of the 30th annual international symposium on Computer architecture
Guaranteeing the quality of services in networks on chip
Networks on chip
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence
Proceedings of the 32nd annual international symposium on Computer Architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
Fast and fair: data-stream quality of service
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Maximizing CMP Throughput with Mediocre Cores
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Design and analysis of an NoC architecture from performance, reliability and energy perspective
Proceedings of the 2005 ACM symposium on Architecture for networking and communications systems
The RASE (Rapid, Accurate Simulation Environment) for chip multiprocessors
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
A chip prototyping substrate: the flexible architecture for simulation and testing (FAST)
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Power-performance considerations of parallel computing on chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Core architecture optimization for heterogeneous chip multiprocessors
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
A flexible data to L2 cache mapping approach for future multicore processors
Proceedings of the 2006 workshop on Memory system performance and correctness
Supporting microthread scheduling and synchronisation in CMPs
International Journal of Parallel Programming
Design space exploration for multicore architectures: a power/performance/thermal view
Proceedings of the 20th annual international conference on Supercomputing
Coherence Ordering for Ring-based Chip Multiprocessors
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Accelerating sequential programs on Chip Multiprocessors via Dynamic Prefetching Thread
Microprocessors & Microsystems
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Tradeoffs in designing accelerator architectures for visual computing
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Optimized Pipelined Parallel Merge Sort on the Cell BE
Euro-Par 2008 Workshops - Parallel Processing
Scaling the bandwidth wall: challenges in and avenues for CMP scaling
Proceedings of the 36th annual international symposium on Computer architecture
Area-efficiency in CMP core design: co-optimization of microarchitecture and physical design
ACM SIGARCH Computer Architecture News
Reusability-aware cache memory sharing for chip multiprocessors with private L2 caches
Journal of Systems Architecture: the EUROMICRO Journal
The SKB: a semi-completely-connected bus for on-chip systems
NPC'07 Proceedings of the 2007 IFIP international conference on Network and parallel computing
Moguls: a model to explore the memory hierarchy for bandwidth improvements
Proceedings of the 38th annual international symposium on Computer architecture
L2-Cache hierarchical organizations for multi-core architectures
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Bandwidth-aware reconfigurable cache design with hybrid memory technologies
Proceedings of the International Conference on Computer-Aided Design
A hybrid hardware/software generated prefetching thread mechanism on chip multiprocessors
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
A memory bandwidth effective cache store miss policy
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
A high performance adaptive miss handling architecture for chip multiprocessors
Transactions on High-Performance Embedded Architectures and Compilers IV
Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
The dynamic granularity memory system
Proceedings of the 39th Annual International Symposium on Computer Architecture
Scalability-based manycore partitioning
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs
ACM Transactions on Computer Systems (TOCS)
Studying multicore processor scaling via reuse distance analysis
Proceedings of the 40th Annual International Symposium on Computer Architecture
Journal of Parallel and Distributed Computing
Asymmetric scaling on network packet processors in the dark silicon era
ANCS '13 Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems
Hi-index | 0.01 |
Abstract: In this paper, we study the space of chip multiprocessor (CMP) organizations. We compare the area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issue, and how big the per-processor on-chip caches should be. We find that, contrary to some conventional wisdom, out-of-order processing cores will maximize job throughput on future CMPs. As technology shrinks, limited off-chip bandwidth will begin to curtail the number of cores that can be effective on a single die. Current projections show that the transistor/signal pin ratio will increase by a factor of 45 between 180 and 35 nanometer technologies. That disparity will force increases in per-processor cache capacities as technology shrinks, from 128KB at 100nm, to 256KB at 70nm, and to 1MB at 50 and 35nm, reducing the number of cores that would otherwise be possible.