Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Missing the memory wall: the case for processor/memory integration
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory
Proceedings of the 25th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Performance of image and video processing with general-purpose processors and media ISA extensions
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Microservers: a new memory semantics for massively parallel computing
ICS '99 Proceedings of the 13th international conference on Supercomputing
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Computational RAM: Implementing Processors in Memory
IEEE Design & Test
Subword Parallelism with MAX-2
IEEE Micro
IEEE Micro
Parallelizing Applications into Silicon
FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Impulse: Building a Smarter Memory Controller
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
FlexRAM: Toward an Advanced Intelligent Memory System
ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
Gilgamesh: a multithreaded processor-in-memory architecture for petaflops computing
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems
IEEE Transactions on Computers
Design and Optimization of Large Size and Low Overhead Off-Chip Caches
IEEE Transactions on Computers
Data forwarding through in-memory precomputation threads
Proceedings of the 18th annual international conference on Supercomputing
Memory in processor: a novel design paradigm for supercomputing architectures
MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Superword-Level Parallelism in the Presence of Control Flow
Proceedings of the international symposium on Code generation and optimization
A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System
Journal of VLSI Signal Processing Systems
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14 - Volume 15
A low cost, multithreaded processing-in-memory system
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
SCIMA-SMP: on-chip memory processor architecture for SMP
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
PIM lite: a multithreaded processor-in-memory prototype
GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
2 Gbps SerDes design based on IBM Cu-11 (130nm) standard cell technology
GLSVLSI '06 Proceedings of the 16th ACM Great Lakes symposium on VLSI
Processing-in-memory technology for knowledge discovery algorithms
DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
ALP: Efficient support for all levels of parallelism for complex media applications
ACM Transactions on Architecture and Code Optimization (TACO)
Future generation supercomputers I: a paradigm for node architecture
ACM SIGARCH Computer Architecture News - Special issue: ALPS '07---advanced low power systems
Communications of the ACM - Web science
Destructive-read in embedded DRAM, impact on power consumption
Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Evaluating compiler technology for control-flow optimizations for multimedia extension architectures
Microprocessors & Microsystems
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
Empirical study for optimization of power-performance with on-chip memory
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Preliminary design examination of the ParalleX system from a software and hardware perspective
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Cache write-back schemes for embedded destructive-read DRAM
ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
A new perspective on processing-in-memory architecture design
Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Hi-index | 0.01 |
The DIVA (Data IntensiVe Architecture) system incorporates a collection of Processing-In-Memory (PIM) chips as smart-memory co-processors to a conventional microprocessor. We have recently fabricated prototype DIVA PIMs. These chips represent the first smart-memory devices designed to support virtual addressing and capable of executing multiple threads of control. In this paper, we describe the prototype PIM architecture. We emphasize three unique features of DIVA PIMs, namely, the memory interface to the host processor, the 256-bit wide datapaths for exploiting on-chip bandwidth, and the address translation unit. We present detailed simulation results on eight benchmark applications. When just a single PIM chip is used, we achieve an average speedup of 3.3X over host-only execution, due to lower memory stall times and increased fine-grain parallelism. These 1-PIM results suggest that a PIM-based architecture with many such chips yields significantly higher performance than a multiprocessor of a similar scale and at a much reduced hardware cost.