Evaluating kilo-instruction multiprocessors

Authors:
Marco Galluzzi;Ramón Beivide;Valentin Puente;José-Ángel Gregorio;Adrian Cristal;Mateo Valero
Affiliations:
DAC, UPC, Barcelona, Spain;ATC, UC, Santander, Spain;ATC, UC, Santander, Spain;ATC, UC, Santander, Spain;DAC, UPC, Barcelona, Spain;DAC, UPC, Barcelona, Spain
Venue:
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Year:
2004

Citing 19
Cited 1

Performance evaluation of memory consistency models for shared-memory multiprocessors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches

IEEE Transactions on Computers
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
A new switch chip for IBM RS/6000 SP systems

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Cache Memories

ACM Computing Surveys (CSUR)
HIPIQS: A High-Performance Switch Architecture Using Input Queuing

IEEE Transactions on Parallel and Distributed Systems
The adaptive bubble router

Journal of Parallel and Distributed Computing
Spider: A High-Speed Network Interconnect

IEEE Micro
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
On the Design of a High-Performance Adaptive Router for CC-NUMA Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
A Flow Control Mechanism to Avoid Message Deadlock in k-ary n-cube Networks

HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing
Out-of-Order Commit Processors

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
A case for resource-conscious out-of-order processors

IEEE Computer Architecture Letters
SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing

Bulk Disambiguation of Speculative Threads in Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ever increasing gap in processor and memory speeds has a very negative impact on performance. One possible solution to overcome this problem is the Kilo-instruction processor. It is a recent proposed architecture able to hide large memory latencies by having thousands of in-flight instructions. Current multiprocessor systems also have to deal with this increasing memory latency while facing other sources of latencies: those coming from communication among processors. What we propose, in this paper, is the use of Kilo-instruction processors as computing nodes for small-scale CCNUMA multiprocessors. We evaluate what we appropriately call Kilo-instruction Multiprocessors. This kind of systems appears to achieve very good performance while showing two interesting behaviours. First, the great amount of in-flight instructions makes the system not just to hide the latencies coming from the memory accesses but also the inherent communication latencies involved in remote memory accesses. Second, the significant pressure imposed by many in-flight instructions translates into a very high contention for the interconnection network, what indicates us that more efforts need to be employed in designing routers capable of managing high traffic levels.