Missing the memory wall: the case for processor/memory integration

Authors:
Ashley Saulsbury;Fong Pong;Andreas Nowatzyk
Affiliations:
Swedish Institute of Computer Science;Sun Microsystems Computer Corporation;Sun Microsystems Computer Corporation
Venue:
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Year:
1996

Citing 10
Cited 46

A class of generalized stochastic Petri nets for the performance evaluation of multiprocessor systems

ACM Transactions on Computer Systems (TOCS)
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
The design and analysis of DASH: a scalable directory-based multiprocessor

The design and analysis of DASH: a scalable directory-based multiprocessor
The detection and elimination of useless misses in multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
The memory wall and the CMOS end-point

ACM SIGARCH Computer Architecture News
S-connect: from networks of workstations to supercomputer performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Exploiting Parallelism in Cache Coherency Protocol Engines

Euro-Par '95 Proceedings of the First International Euro-Par Conference on Parallel Processing
An argument for simple COMA

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture

An extended addressing mode for low power

ISLPED '97 Proceedings of the 1997 international symposium on Low power electronics and design
CP-PACS: a massively parallel processor for large scale scientific calculations

ICS '97 Proceedings of the 11th international conference on Supercomputing
Designing high bandwidth on-chip caches

Proceedings of the 24th annual international symposium on Computer architecture
The energy efficiency of IRAM architectures

Proceedings of the 24th annual international symposium on Computer architecture
DataScalar architectures

Proceedings of the 24th annual international symposium on Computer architecture
Optimizing the DRAM refresh count for merged DRAM/logic LSIs

ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
Functional Implementation Techniques for CPU Cache Memories

IEEE Transactions on Computers - Special issue on cache memory and related problems
A performance comparison of contemporary DRAM architectures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Microservers: a new memory semantics for massively parallel computing

ICS '99 Proceedings of the 13th international conference on Supercomputing
Efficient management of memory hierarchies in embedded DRAM systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Data Locality Exploitation in the Decomposition of Regular Domain Problems

IEEE Transactions on Parallel and Distributed Systems
High Bandwidth On-Chip Cache Design

IEEE Transactions on Computers
High-Performance DRAMs in Workstation Environments

IEEE Transactions on Computers
Leveraging cache coherence in active memory systems

ICS '02 Proceedings of the 16th international conference on Supercomputing
The architecture of the DIVA processing-in-memory chip

ICS '02 Proceedings of the 16th international conference on Supercomputing
Avoiding initialization misses to the heap

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Two techniques for reconciling algorithm parallelism with memory constraints

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Supporting parallel applications on clusters of workstations: The Virtual Communication Machine-based architecture

Cluster Computing
A Case for Intelligent RAM

IEEE Micro
Exploiting Instruction- and Data-Level Parallelism

IEEE Micro
Analytic Evaluation of Shared-Memory Architectures

IEEE Transactions on Parallel and Distributed Systems
Hardware Versus Software Implementation of COMA

ICPP '97 Proceedings of the international Conference on Parallel Processing
A Case Study of Load Distribution in Parallel View Frustum Culling and Collision Detection

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Memory Management in a PIM-Based Architecture

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
The Hierarchical Multi-Bank DRAM: A High-Performance Architecture for Memory Integrated with Processors

ARVLSI '97 Proceedings of the 17th Conference on Advanced Research in VLSI (ARVLSI '97)
The Illinois Aggressive Coma Multiprocessor project (I-ACOMA)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Hierarchical processors-and-memory architecture for high performance computing

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
A Case for Studying DRAM Issues at the System Level

IEEE Micro
Design and Optimization of Large Size and Low Overhead Off-Chip Caches

IEEE Transactions on Computers
Analysis and Modeling of Advanced PIM Architecture Design Tradeoffs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System

Journal of VLSI Signal Processing Systems
A low cost, multithreaded processing-in-memory system

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Memory-side prefetching for linked data structures for processor-in-memory systems

Journal of Parallel and Distributed Computing
SMP-SoC is the answer if you ask the right questions

SAICSIT '06 Proceedings of the 2006 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Destructive-read in embedded DRAM, impact on power consumption

Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
A Token-Managed Admission Control System for QoS Provision on a Best-Effort GALS Interconnect

Fundamenta Informaticae - Application of Concurrency to System Design
A pattern based instruction encoding technique for high performance architectures

International Journal of High Performance Systems Architecture
Characterization of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Pinned to the walls: impact of packaging and application properties on the memory and power walls

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
Cache write-back schemes for embedded destructive-read DRAM

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
A Token-Managed Admission Control System for QoS Provision on a Best-Effort GALS Interconnect

Fundamenta Informaticae - Application of Concurrency to System Design
Active memory controller

The Journal of Supercomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU-centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds. Yet, many large applications do not operate well on these systems and are limited by the memory subsystem performance.This paper argues for an integrated system approach that uses less-powerful CPUs that are tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity. Based on a design study using the next generation 0.25µm, 256Mbit dynamic random-access memory (DRAM) process and on the analysis of existing machines, we show that processor memory integration can be used to build competitive, scalable and cost-effective MP systems.We present results from execution driven uni- and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor. In this system, small direct mapped instruction caches with long lines are very effective, as are column buffer data caches augmented with a victim cache.