A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Software caching and computation migration in Olden
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory
Proceedings of the 25th annual international symposium on Computer architecture
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory sharing predictor: the key to a speculative coherent DSM
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Architecture and design of AlphaServer GS320
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
FLASH vs. (Simulated) FLASH: closing the simulation loop
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
IEEE Transactions on Computers
Leveraging cache coherence in active memory systems
ICS '02 Proceedings of the 16th international conference on Supercomputing
The architecture of the DIVA processing-in-memory chip
ICS '02 Proceedings of the 16th international conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Architectural Support for Parallel Reductions in Scalable Shared-Memory Multiprocessors
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Cache Coherence Protocol Design for Active Memory Systems
PDPTA '02 Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications - Volume 1
Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Architectural Implications of a Family of Irregular Applications
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
FlexRAM: Toward an Advanced Intelligent Memory System
ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
Differential FCM: Increasing Value Prediction Accuracy by Improving Table Usage Efficiency
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
SMTp: An Architecture for Next-generation Scalable Multi-threading
Proceedings of the 31st annual international symposium on Computer architecture
Proceedings of the 21st annual international conference on Supercomputing
Creating a transparent, distributed, and resilient computing environment: the OpenRTE project
The Journal of Supercomputing
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance
Proceedings of the 24th ACM International Conference on Supercomputing
Data layout optimization for multi-valued containers in OpenCL
Journal of Parallel and Distributed Computing
The Journal of Supercomputing
Hi-index | 14.98 |
Abstract--We introduce an architectural approach to improve memory system performance in both uniprocessor and multiprocessor systems. The architectural innovation is a flexible active memory controller backed by specialized cache coherence protocols that permit the transparent use of address remapping techniques. The resulting system shows significant performance improvement across a spectrum of machine configurations, from uniprocessors through single-node multiprocessors (SMPs) to distributed shared memory clusters (DSMs). Address remapping techniques exploit the data access patterns of applications to enhance their cache performance. However, they create coherence problems since the processor is allowed to refer to the same data via more than one address. While most active memory implementations require cache flushes, we present a new approach to solve the coherence problem. We leverage and extend the cache coherence protocol so that our techniques work transparently to efficiently support uniprocessor, SMP and DSM active memory systems. We detail the coherence protocol extensions to support our active memory techniques and present simulation results that show uniprocessor speedup from 1.3 to 7.6 on a range of applications and microbenchmarks. We also show remarkable performance improvement on small to medium-scale SMP and DSM multiprocessors, allowing some parallel applications to continue to scale long after their performance levels off on normal systems.