TLB consistency on highly-parallel shared-memory multiprocessors
Proceedings of the Twenty-First Annual Hawaii International Conference on Architecture Track
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory
Proceedings of the 25th annual international symposium on Computer architecture
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
ACM Transactions on Programming Languages and Systems (TOPLAS)
IEEE Transactions on Computers
Benchmark Handbook: For Database and Transaction Processing Systems
Benchmark Handbook: For Database and Transaction Processing Systems
Using a user-level memory thread for correlation prefetching
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
International Journal of Parallel Programming
IEEE Micro
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Database Architecture Optimized for the New Bottleneck: Memory Access
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
DBMSs on a Modern Processor: Where Does Time Go?
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
What Will Have the Greatest Impact in 2010: The Processor, the Memory, or the Interconnect?
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems
IEEE Transactions on Computers
Scatter-Add in Data Parallel Architectures
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Active memory operations
Combinable memory-block transactions
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Atomic Vector Operations on Chip Multiprocessors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Multiprocessor System-on-Chip designs with active memory processors for higher memory efficiency
Proceedings of the 46th Annual Design Automation Conference
Flexible architectural support for fine-grain scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Proceedings of the 7th ACM international conference on Computing frontiers
Reducing contention through priority updates
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Hi-index | 0.00 |
The performance of modern microprocessors is increasingly limited by their inability to hide main memory latency. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose the use of Active Memory Operations (AMOs), in which select operations can be sent to and executed on the home memory controller of data. AMOs can eliminate significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper we present architectural and programming models for AMOs, and compare its performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50X faster barriers, 12X faster spinlocks, 8.5X-15X faster stream/array operations, and 3X faster database queries. Based on a standard cell implementation, we predict that the circuitry required to support AMOs is less than 1% of the typical chip area of a high performance microprocessor.