A low cost, multithreaded processing-in-memory system

Authors:
Jay B. Brockman;Shyamkumar Thoziyoor;Shannon K. Kuntz;Peter M. Kogge
Affiliations:
University of Notre Dame, Notre Dame, IN;University of Notre Dame, Notre Dame, IN;University of Notre Dame, Notre Dame, IN;University of Notre Dame, Notre Dame, IN
Venue:
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Year:
2004

Citing 18
Cited 5

Can dataflow subsume von Neumann computing?

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
TAM—a compiler controlled threaded abstract machine

Journal of Parallel and Distributed Computing - Special issue on dataflow and multithreaded architectures
The J-machine multicomputer: an architectural evaluation

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Missing the memory wall: the case for processor/memory integration

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory

Proceedings of the 25th annual international symposium on Computer architecture
Comparison of Single- and Dual-Pass Multiply-Add Fused Floating-Point Units

IEEE Transactions on Computers
A design analysis of a hybrid technology multithreaded architecture for petaflops scale computation3

ICS '99 Proceedings of the 13th international conference on Supercomputing
Microservers: a new memory semantics for massively parallel computing

ICS '99 Proceedings of the 13th international conference on Supercomputing
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The architecture of the DIVA processing-in-memory chip

ICS '02 Proceedings of the 16th international conference on Supercomputing
Processing in Memory: The Terasys Massively Parallel PIM Array

Computer
The Message-Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms

IEEE Micro
The Message Driven Processor: An Integrated Multicomputer Processing Element

ICCD '92 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
Gilgamesh: a multithreaded processor-in-memory architecture for petaflops computing

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Programming the FlexRAM parallel intelligent memory system

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Dynamic Associative Access Memory Chip and Its Application to SIMD Processing and Full-Text Database Retrieval

MTDT '99 Proceedings of the 1999 IEEE International Workshop on Memory Technology, Design, and Testing

Enhancing NIC Performance for MPI using Processing-in-Memory

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
PIM lite: a multithreaded processor-in-memory prototype

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Embedded floating-point units in FPGAs

Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays
Architectural modifications to enhance the floating-point performance of FPGAs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
The challenges of massive on-chip concurrency

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper discusses die cost vs. performance tradeoffs for a PIM system that could serve as the memory system of a host processor. For an increase of less than twice the cost of a commodity DRAM part, it is possible to realize a performance speedup of nearly a factor of 4 on irregular applications. This cost efficiency derives from developing a custom multithreaded processor architecture and implementation style that is well-suited for embedding in a memory. Specifically, it takes advantage of the low latency and high row bandwidth to both simplify processor design --- reducing area --- as well as to improve processing throughput. To support our claims of cost and performance, we have used simulation, analysis of existing chips, and also designed and fully implemented a prototype chip, PIM Lite.