Hitting the memory wall: implications of the obvious

Authors:
Wm. A. Wulf;Sally A. McKee
Affiliations:
Department of Computer Science, University of Virginia;Department of Computer Science, University of Virginia
Venue:
ACM SIGARCH Computer Architecture News
Year:
1995

Citing 2
Cited 199

Increasing memory bandwidth for vector computations

Proceedings of the international conference on Programming languages and system architectures
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach

Missing the memory wall: the case for processor/memory integration

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Evaluation of multithreaded uniprocessors for commercial application environments

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Designing high bandwidth on-chip caches

Proceedings of the 24th annual international symposium on Computer architecture
The energy efficiency of IRAM architectures

Proceedings of the 24th annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory

Proceedings of the 25th annual international symposium on Computer architecture
Development and validation of a hierarchical memory model incorporating CPU- and memory-operation overlap model

Proceedings of the 1st international workshop on Software and performance
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Active disks: programming model, algorithms and evaluation

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Efficient management of memory hierarchies in embedded DRAM systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
The processor-memory bottleneck: problems and solutions

Crossroads - Computer architecture
Data Locality Exploitation in the Decomposition of Regular Domain Problems

IEEE Transactions on Parallel and Distributed Systems
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Performance evaluation of the SGI Origin2000: a memory-centric characterization of LANL ASCI applications

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Avoiding initialization misses to the heap

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Early cancellation: an active NIC optimization for time-warp

Proceedings of the sixteenth workshop on Parallel and distributed simulation
Two techniques for reconciling algorithm parallelism with memory constraints

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Scalable parallel coset enumeration: bulk definition and the memory wall

Journal of Symbolic Computation - Computer algebra: Selected papers from ISSAC 2001
A Case for Intelligent RAM

IEEE Micro
Deep-Submicron Microprocessor Design Issues

IEEE Micro
Overcoming the memory wall in symbolic algebra: a faster permutation multiplication

ACM SIGSAM Bulletin
A Memory Controller for Improved Performance of Streamed Computations on Symmetric Multiprocessors

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Architectural Support for Data-intensive Applications

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
The New DRAM Interfaces: SDRAM, RDRAM and Variants

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Technology Trends and Adaptive Computing

FPL '01 Proceedings of the 11th International Conference on Field-Programmable Logic and Applications
HAGAR: Efficient Multi-context Graph Processors

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
A proposal for a new hardware cache monitoring architecture

Proceedings of the 2002 workshop on Memory system performance
Exploring Microprocessor Architectures for Gigascale Integration

ARVLSI '99 Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI
MORPH: a system architecture for robust high performance using customization (an NSF 100 TeraOps point design study)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Hierarchical processors-and-memory architecture for high performance computing

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Distributed Prefetch-buffer/Cache Design for High Performance Memory Systems

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
A Case for Studying DRAM Issues at the System Level

IEEE Micro
FPGAs vs. CPUs: trends in peak floating-point performance

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems

IEEE Transactions on Computers
Reflections on the memory wall

Proceedings of the 1st conference on Computing frontiers
A first glance at Kilo-instruction based multiprocessors

Proceedings of the 1st conference on Computing frontiers
Profile guided code positioning

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Optimizing application performance: a case study using LMbench

Crossroads
Inter-reference gap distribution replacement: an improved replacement algorithm for set-associative caches

Proceedings of the 18th annual international conference on Supercomputing
CQoS: a framework for enabling QoS in shared caches of CMP platforms

Proceedings of the 18th annual international conference on Supercomputing
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
Fast, predictable and low energy memory references through architecture-aware compilation

Proceedings of the 2004 Asia and South Pacific Design Automation Conference
TCP Onloading for Data Center Servers

Computer
Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software

Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Bandwidth Management with a Reconfigurable Data Cache

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
An Address Dependence Model of Computation for Hierarchical Memories with Pipelined Transfer

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 8 - Volume 09
Enhancing NIC Performance for MPI using Processing-in-Memory

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Evaluating kilo-instruction multiprocessors

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Statistical geometry representation for efficient transmission and rendering

ACM Transactions on Graphics (TOG)
Kilo-Instruction Processors: Overcoming the Memory Wall

IEEE Micro
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Overcoming the memory wall in packet processing: hammers or ladders?

Proceedings of the 2005 ACM symposium on Architecture for networking and communications systems
The TM3270 Media-Processor Data Cache

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
DRAMsim: a memory system simulator

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Predicting the Performance of a 3D Processor-Memory Chip Stack

IEEE Design & Test
Queue Usage and Memory-Level Parallelism Sensitive Scheduling

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Dynamic memory instruction bypassing

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
SCMP: a single-chip message-passing parallel computer

The Journal of Supercomputing - Special issue: Parallel and distributed processing and applications
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Kilo-instruction processors, runahead and prefetching

Proceedings of the 3rd conference on Computing frontiers
The bit-reversal SDRAM address mapping

SCOPES '05 Proceedings of the 2005 workshop on Software and compilers for embedded systems
Braids and fibers: language constructs with architectural support for adaptive responses to memory latencies

IBM Journal of Research and Development
A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy

Proceedings of the 43rd annual Design Automation Conference
Memory bandwidth optimization through stream descriptors

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Energy-efficient instruction scheduling utilizing cache miss information

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Characterization of simultaneous multithreading (SMT) efficiency in POWER5

IBM Journal of Research and Development - POWER5 and packaging
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
The exigency of benchmark and compiler drift: designing tomorrow's processors with yesterday's tools

Proceedings of the 20th annual international conference on Supercomputing
Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses

IEEE Transactions on Computers
Locality and parallelism optimization for dynamic programming algorithm in bioinformatics

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
SMP-SoC is the answer if you ask the right questions

SAICSIT '06 Proceedings of the 2006 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Cache oblivious algorithms for nonserial polyadic programming

The Journal of Supercomputing
Design and implementation of power-aware virtual memory

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Optimizing software cache performance of packet processing applications

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications

SCOPES '07 Proceedingsof the 10th international workshop on Software & compilers for embedded systems
Light-weight synchronization for inter-processor communication acceleration on embedded MPSoCs

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Mapping streaming architectures on reconfigurable platforms

ACM SIGARCH Computer Architecture News - Special issue on the 2006 reconfigurable and adaptive architecture workshop
Random-Accessible Compressed Triangle Meshes

IEEE Transactions on Visualization and Computer Graphics
Frame shared memory: line-rate networking on commodity hardware

Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
Configuration and extension of embedded processors to optimize IPSec protocol execution

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Reducing cache misses through programmable decoders

ACM Transactions on Architecture and Code Optimization (TACO)
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

International Journal of Parallel Programming
Augmenting priority rule heuristics with justification and rollout to solve the resource-constrained project scheduling problem

Computers and Operations Research
Stochastic rollout and justification to solve the resource-constrained project scheduling problem

Proceedings of the 39th conference on Winter simulation: 40 years! The best is yet to come
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
Future ILP processors

International Journal of High Performance Computing and Networking
A genetic algorithms approach to modeling the performance of memory-bound computations

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Memory performance attacks: denial of memory service in multi-core systems

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Exploiting program cyclic behavior to reduce memory latency in embedded processors

Proceedings of the 2008 ACM symposium on Applied computing
Optimizing thread throughput for multithreaded workloads on memory constrained CMPs

Proceedings of the 5th conference on Computing frontiers
Exploring power reduction options for a single-chip multiprocessor through system-level modeling

Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
HMTT: a platform independent full-system memory trace monitoring system

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Efficient dynamic heap allocation of scratch-pad memory

Proceedings of the 7th international symposium on Memory management
Server-based data push architecture for multi-processor environments

Journal of Computer Science and Technology
Utilizing shared data in chip multiprocessors with the Nahalal architecture

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
3D-Stacked Memory Architectures for Multi-core Processors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
PipesFS: fast Linux I/O in the unix tradition

ACM SIGOPS Operating Systems Review - Research and developments in the Linux kernel
CATCH: a mechanism for dynamically detecting Cache-Content-Duplication and its application to instruction caches

Proceedings of the conference on Design, automation and test in Europe
Recognition and Optimization of Loop-Carried Stream Reusing of Scientific Computing Applications on the Stream Processor

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Interprocedural Speculative Optimization of Memory Accesses to Global Variables

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Journal of Parallel and Distributed Computing
Exploiting loop-dependent stream reuse for stream processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Leveraging on-chip networks for data cache migration in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Efficient implementation of decoupling capacitors in 3D processor-dram integrated computing systems

Proceedings of the 19th ACM Great Lakes symposium on VLSI
Analysis of challenges for on-chip optical interconnects

Proceedings of the 19th ACM Great Lakes symposium on VLSI
Trace-Based Analysis and Optimization for the Semtex CFD Application --- Hidden Remote Memory Accesses and I/O Performance

Euro-Par 2008 Workshops - Parallel Processing
Pattern-based sparse matrix representation for memory-efficient SMVM kernels

Proceedings of the 23rd international conference on Supercomputing
Two memory allocators that use hints to improve locality

Proceedings of the 2009 international symposium on Memory management
On approximating the ideal random access machine by physical machines

Journal of the ACM (JACM)
Multi-target C++ implementation of parallel skeletons

Proceedings of the 8th workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Exploiting Locality on the Cell/B.E. through Bypassing

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
SSE Implementation of Multivariate PKCs on Modern x86 CPUs

CHES '09 Proceedings of the 11th International Workshop on Cryptographic Hardware and Embedded Systems
Allocation wall: a limiting factor of Java applications on emerging multi-core platforms

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Predictable performance for unpredictable workloads

Proceedings of the VLDB Endowment
Reevaluating Amdahl's law in the multicore era

Journal of Parallel and Distributed Computing
LIRAC: using live range information to optimize memory access

ARCS'07 Proceedings of the 20th international conference on Architecture of computing systems
Optimizing stream organization to improve the performance of scientific computing applications on the stream processor

ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
Stream image processing on a dual-core embedded system

SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
Direct coherence: bringing together performance and scalability in shared-memory multiprocessors

HiPC'07 Proceedings of the 14th international conference on High performance computing
Exploiting execution locality with a decoupled Kilo-instruction processor

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Scaling power/ground solvers on multi-core with memory bandwidth awareness

Proceedings of the 20th symposium on Great lakes symposium on VLSI
On-chip COMA cache-coherence protocol for microgrids of microthreaded cores

Euro-Par'07 Proceedings of the 2007 conference on Parallel processing
Timing local streams: improving timeliness in data prefetching

Proceedings of the 24th ACM International Conference on Supercomputing
Fast multiplication of large permutations for disk, flash memory and RAM

Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
An Adaptive Data Prefetcher for High-Performance Processors

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
High Resolution Program Flow Visualization of Hardware Accelerated Hybrid Multi-core Applications

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Efficient runahead threads

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Parallel search on video cards

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Characterization of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Mapping scientific applications on a large-scale data-path accelerator implemented by single-flux quantum (SFQ) circuits

Proceedings of the Conference on Design, Automation and Test in Europe
A robust multigrid solver on parallel computers

EURO-PDP'00 Proceedings of the 8th Euromicro conference on Parallel and distributed processing
An experimental study of optimizing bioinformatics applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
SigMatch: fast and scalable multi-pattern matching

Proceedings of the VLDB Endowment
Streaming Data Movement for Real-Time Image Analysis

Journal of Signal Processing Systems
Exploitation of multicore systems in a java virtual machine

IBM Journal of Research and Development
Patterns for cache optimizations on multi-processor machines

Proceedings of the 2010 Workshop on Parallel Programming Patterns
Application-Tailored I/O with Streamline

ACM Transactions on Computer Systems (TOCS)
Should we worry about memory loss?

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Tackling cache-line stealing effects using run-time adaptation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

The Journal of Supercomputing
Making the Best of Temporal Locality: Just-in-Time Renaming and Lazy Write-Back on the Cell/B.E

International Journal of High Performance Computing Applications
Cache injection for parallel applications

Proceedings of the 20th international symposium on High performance distributed computing
Adapt or become extinct!: the case for a unified framework for deployment-time optimization (position paper)

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Pinned to the walls: impact of packaging and application properties on the memory and power walls

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
CATCH: A mechanism for dynamically detecting cache-content-duplication in instruction caches

ACM Transactions on Architecture and Code Optimization (TACO)
A parallel code for time independent quantum reactive scattering on CPU-GPU platforms

ICCSA'11 Proceedings of the 2011 international conference on Computational science and its applications - Volume Part III
Software implementation of binary elliptic curves: impact of the carry-less multiplier on scalar multiplication

CHES'11 Proceedings of the 13th international conference on Cryptographic hardware and embedded systems
Toward five-dimensional scaling: how density improves efficiency in future computers

IBM Journal of Research and Development
Energy-Efficient Scheduling on Milliclusters with Performance Constraints

GREENCOM '11 Proceedings of the 2011 IEEE/ACM International Conference on Green Computing and Communications
Global-aware and multi-order context-based prefetching for high-performance processors

International Journal of High Performance Computing Applications
Memory access cycle and the measurement of memory systems

Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems
Scientific computing applications on the imagine stream processor

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Processor directed dynamic page policy

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
The challenges of efficient code-generation for massively parallel architectures

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Compile-Time thread distinguishment algorithm on VIM-Based architecture

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Design and analysis of adaptive processor

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Efficient memory management of a hierarchical and a hybrid main memory for MN-MATE platform

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Collecting and exploiting cache-reuse metrics

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
A memory bandwidth effective cache store miss policy

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Software–hardware cooperative power management for main memory

PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
Compilation and simulation tool chain for memory aware energy optimizations

SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Code-based cache partitioning for improving hardware cache performance

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
WMTools - assessing parallel application memory utilisation at scale

EPEW'11 Proceedings of the 8th European conference on Computer Performance Engineering
KISS-Tree: smart latch-free in-memory indexing on modern architectures

DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
CCDR-PAID: more efficient cache-conscious PAID algorithm by data reconstruction

Proceedings of the 27th Annual ACM Symposium on Applied Computing
An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

Journal of Computational Physics
Mat-core: a decoupled matrix core extension for general-purpose processors

Neural, Parallel & Scientific Computations
Partitioning and multi-core parallelization of multi-equation forecast models

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Making data prefetch smarter: adaptive prefetching on POWER7

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Shared hardware data structures for hard real-time systems

Proceedings of the tenth ACM international conference on Embedded software
A distributed interleaving scheme for efficient access to WideIO DRAM memory

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
APC: a performance metric of memory systems

ACM SIGMETRICS Performance Evaluation Review
Toward on-chip datacenters: a perspective on general trends and on-chip particulars

The Journal of Supercomputing
NUMA-aware graph mining techniques for performance and energy efficiency

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Algorithm-level Feedback-controlled Adaptive data prefetcher: Accelerating data access for high-performance processors

Parallel Computing
Interactive visualization for memory reference traces

EuroVis'08 Proceedings of the 10th Joint Eurographics / IEEE - VGTC conference on Visualization
TLM modelling of 3D stacked wide I/O DRAM subsystems: a virtual platform for memory controller design space exploration

Proceedings of the 2013 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
FPGA programming for the masses

Communications of the ACM
Low power cache architectures with hybrid approach of filtering unnecessary way accesses

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Application task and data placement in embedded many-core NUMA architectures

Proceedings of the 10th Workshop on Optimizations for DSP and Embedded Systems
FPGA Programming for the Masses

Queue - Mobile Web Development
Reducing memory access latency with asymmetric DRAM bank organizations

Proceedings of the 40th Annual International Symposium on Computer Architecture
Resilient die-stacked DRAM caches

Proceedings of the 40th Annual International Symposium on Computer Architecture
Return data interleaving for multi-channel embedded CMPs systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Memory-centric system interconnect design with hybrid memory cubes

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Toward application-specific memory reconfiguration for energy efficiency

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Reviewing traffic classification

DataTraffic Monitoring and Analysis
Apple-CORE: Harnessing general-purpose many-cores with hardware concurrency management

Microprocessors & Microsystems
Direct distributed memory access for CMPs

Journal of Parallel and Distributed Computing
HMTT: A hybrid hardware/software tracing system for bridging the DRAM access trace's semantic gap

ACM Transactions on Architecture and Code Optimization (TACO)
Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems

Scientific Programming
Creating robust high-throughput traffic sign detectors using centre-surround HOG statistics

Machine Vision and Applications

Quantified Score

Hi-index	0.03

Hitting the memory wall: implications of the obvious

Quantified Score

Visualization

Abstract