ACM Transactions on Computer Systems (TOCS)
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
The design and analysis of DASH: a scalable directory-based multiprocessor
The design and analysis of DASH: a scalable directory-based multiprocessor
The detection and elimination of useless misses in multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
The memory wall and the CMOS end-point
ACM SIGARCH Computer Architecture News
S-connect: from networks of workstations to supercomputer performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Exploiting Parallelism in Cache Coherency Protocol Engines
Euro-Par '95 Proceedings of the First International Euro-Par Conference on Parallel Processing
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
An extended addressing mode for low power
ISLPED '97 Proceedings of the 1997 international symposium on Low power electronics and design
CP-PACS: a massively parallel processor for large scale scientific calculations
ICS '97 Proceedings of the 11th international conference on Supercomputing
Designing high bandwidth on-chip caches
Proceedings of the 24th annual international symposium on Computer architecture
The energy efficiency of IRAM architectures
Proceedings of the 24th annual international symposium on Computer architecture
Proceedings of the 24th annual international symposium on Computer architecture
Optimizing the DRAM refresh count for merged DRAM/logic LSIs
ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
Functional Implementation Techniques for CPU Cache Memories
IEEE Transactions on Computers - Special issue on cache memory and related problems
A performance comparison of contemporary DRAM architectures
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Microservers: a new memory semantics for massively parallel computing
ICS '99 Proceedings of the 13th international conference on Supercomputing
Efficient management of memory hierarchies in embedded DRAM systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Data Locality Exploitation in the Decomposition of Regular Domain Problems
IEEE Transactions on Parallel and Distributed Systems
High Bandwidth On-Chip Cache Design
IEEE Transactions on Computers
High-Performance DRAMs in Workstation Environments
IEEE Transactions on Computers
Leveraging cache coherence in active memory systems
ICS '02 Proceedings of the 16th international conference on Supercomputing
The architecture of the DIVA processing-in-memory chip
ICS '02 Proceedings of the 16th international conference on Supercomputing
Avoiding initialization misses to the heap
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Two techniques for reconciling algorithm parallelism with memory constraints
Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
IEEE Micro
Analytic Evaluation of Shared-Memory Architectures
IEEE Transactions on Parallel and Distributed Systems
Hardware Versus Software Implementation of COMA
ICPP '97 Proceedings of the international Conference on Parallel Processing
A Case Study of Load Distribution in Parallel View Frustum Culling and Collision Detection
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Memory Management in a PIM-Based Architecture
IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
ARVLSI '97 Proceedings of the 17th Conference on Advanced Research in VLSI (ARVLSI '97)
The Illinois Aggressive Coma Multiprocessor project (I-ACOMA)
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Hierarchical processors-and-memory architecture for high performance computing
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Design and Optimization of Large Size and Low Overhead Off-Chip Caches
IEEE Transactions on Computers
Analysis and Modeling of Advanced PIM Architecture Design Tradeoffs
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System
Journal of VLSI Signal Processing Systems
A low cost, multithreaded processing-in-memory system
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Memory-side prefetching for linked data structures for processor-in-memory systems
Journal of Parallel and Distributed Computing
SMP-SoC is the answer if you ask the right questions
SAICSIT '06 Proceedings of the 2006 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Destructive-read in embedded DRAM, impact on power consumption
Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
A Token-Managed Admission Control System for QoS Provision on a Best-Effort GALS Interconnect
Fundamenta Informaticae - Application of Concurrency to System Design
A pattern based instruction encoding technique for high performance architectures
International Journal of High Performance Systems Architecture
Characterization of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Pinned to the walls: impact of packaging and application properties on the memory and power walls
Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
Cache write-back schemes for embedded destructive-read DRAM
ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
A Token-Managed Admission Control System for QoS Provision on a Best-Effort GALS Interconnect
Fundamenta Informaticae - Application of Concurrency to System Design
The Journal of Supercomputing
Hi-index | 0.01 |
Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU-centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds. Yet, many large applications do not operate well on these systems and are limited by the memory subsystem performance.This paper argues for an integrated system approach that uses less-powerful CPUs that are tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity. Based on a design study using the next generation 0.25µm, 256Mbit dynamic random-access memory (DRAM) process and on the analysis of existing machines, we show that processor memory integration can be used to build competitive, scalable and cost-effective MP systems.We present results from execution driven uni- and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor. In this system, small direct mapped instruction caches with long lines are very effective, as are column buffer data caches augmented with a victim cache.