Performance evaluation of memory consistency models for shared-memory multiprocessors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An architecture for software-controlled data prefetching
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
Hiding memory latency using dynamic scheduling in shared-memory multiprocessors
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Prefetching using Markov predictors
Proceedings of the 24th annual international symposium on Computer architecture
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Simultaneous subordinate microthreading (SSMT)
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Delaying physical register allocation through virtual-physical registers
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
A new switch chip for IBM RS/6000 SP systems
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
ACM Computing Surveys (CSUR)
Execution-based prediction using speculative slices
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Using a user-level memory thread for correlation prefetching
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dynamic speculative precomputation
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Journal of Parallel and Distributed Computing
Spider: A High-Speed Network Interconnect
IEEE Micro
Cost-Effective Compiler Directed Memory Prefetching and Bypassing
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
On the Design of a High-Performance Adaptive Router for CC-NUMA Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
The Alpha 21364 Network Architecture
HOTI '01 Proceedings of the The Ninth Symposium on High Performance Interconnects
Speculative Data-Driven Multithreading
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
A Flow Control Mechanism to Avoid Message Deadlock in k-ary n-cube Networks
HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing
Scalable Hardware Memory Disambiguation for High ILP Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Out-of-Order Commit Processors
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
A case for resource-conscious out-of-order processors
IEEE Computer Architecture Letters
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Toward kilo-instruction processors
ACM Transactions on Architecture and Code Optimization (TACO)
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Chip multi-processor scalability for single-threaded applications
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Implicit transactional memory in kilo-instruction multiprocessors
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Hi-index | 0.00 |
The ever increasing gap between processor and memory speed, sometimes referred to as the Memory Wall problem [42], has a very negative impact on performance. This mismatch will be more severe in future processor's generation. Modern cache organizations and prefetching techniques will not be able to solve this problem. A very novel and promising technique to deal with the Memory Wall consists on designing processors able to maintain thousands of in-flight instructions. An example of this kind of processors has been denoted as Kilo-instruction processors [8]. When running numerical applications, Kilo-instruction processors have demonstrated its ability to effectively maintain high values of IPC while increasing memory latencies.In this paper, we will study for the first time, the influence of Kilo-instruction processors on the performance of small-scale CC-NUMA multiprocessors. Our first results, using an ideal network, show the enormous potential of the Kilo-instruction processors, when using them as computing nodes, not only for hiding local DRAM latencies but also for the remote ones. A deeper analysis, using realistic networks, reveals the existence of heavy demands on packet throughput required by each node, since larger re-order buffers translate on higher density of remote accesses. Next, we show that current interconnection networks cannot cope with this high traffic levels, so newer and faster networks have to be designed. In short, our results show dramatic performance gains over multiprocessors based on current microprocessors and dictate a possible way to build future shared-memory multiprocessor systems.