Maximizing CMP Throughput with Mediocre Cores
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Store Memory-Level Parallelism Optimizations for Commercial Applications
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
The RASE (Rapid, Accurate Simulation Environment) for chip multiprocessors
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Exploring the cache design space for large scale CMPs
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Overlapping dependent loads with addressless preload
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Long-latency branches: how much do they matter?
ACM SIGARCH Computer Architecture News
RMIS: middleware for transparent object-oriented modeling in multi-simulator systems
WSC '05 Proceedings of the 37th conference on Winter simulation
Software—Practice & Experience
Supporting microthread scheduling and synchronisation in CMPs
International Journal of Parallel Programming
Fairness and Throughput in Switch on Event Multithreading
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Fairness enforcement in switch on event multithreading
ACM Transactions on Architecture and Code Optimization (TACO)
Evaluating design tradeoffs in on-chip power management for CMPs
ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Hiding the misprediction penalty of a resource-efficient high-performance processor
ACM Transactions on Architecture and Code Optimization (TACO)
The shared-thread multiprocessor
Proceedings of the 22nd annual international conference on Supercomputing
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Improving error tolerance for multithreaded register files
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Prefetch-Aware DRAM Controllers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the SUN UltraSparc T2+ Processor for Computational Science
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Transparent multi-core cryptographic support on Niagara CMT Processors
IWMSE '09 Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering
Allocation wall: a limiting factor of Java applications on emerging multi-core platforms
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Real-time 3-D ultrasound scan conversion using a multicore processor
IEEE Transactions on Information Technology in Biomedicine - Special section on biomedical informatics
Finding representative workloads for computer system design
Finding representative workloads for computer system design
The Journal of Supercomputing
Scaling power/ground solvers on multi-core with memory bandwidth awareness
Proceedings of the 20th symposium on Great lakes symposium on VLSI
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
SCMP architecture: an asymmetric multiprocessor system-on-chip for dynamic applications
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Evaluating OpenMP on chip multithreading platforms
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Coterminous locality and coterminous group data prefetching on chip-multiprocessors
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance evaluation of a chip-multithreading server for high performance computing applications
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
PMA: Pixel-based multi-anchor algorithm for image recognition on multi-core systems
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Performance and power aware CMP thread allocation modeling
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Trends and challenges in operating systems---from parallel computing to cloud computing
Concurrency and Computation: Practice & Experience
Return data interleaving for multi-channel embedded CMPs systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Apple-CORE: Harnessing general-purpose many-cores with hardware concurrency management
Microprocessors & Microsystems
Accelerating sequential programs on commodity multi-core processors
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Chip Multi-Threaded (CMT) processors provide support for many simultaneous hardware threads of execution in various ways, including Simultaneous Multithreading (SMT) and Chip Multiprocessing (CMP). CMT processors are especially suited to server workloads, which generally have high levels of Thread-Level Parallelism (TLP). In this paper, we describe the evolution of CMT chips in industry and highlight the pervasiveness of CMT designs in upcoming general-purpose processors. The CMT design space accommodates a range of designs between the extremes represented by the SMT and CMP designs and a variety of attractive design options are currently unexplored. Though there has been extensive research on utilizing multiple hardware threads to speed up single-threaded applications via speculative parallelization, there are many challenges in designing CMT processors, even when sufficient TLP is present. This paper describes some of these challenges including, hot sets, hot banks, speculative prefetching strategies, request prioritization and off-chip bandwidth reduction.