A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Code generation for streaming: an access/execute mechanism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Design and evaluation of dynamic access ordering hardware
ICS '96 Proceedings of the 10th international conference on Supercomputing
Maximizing memory bandwidth for streamed computations
Maximizing memory bandwidth for streamed computations
Increasing power efficiency of multi-core network processors through data filtering
CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
The Garp Architecture and C Compiler
Computer
Optimizing Main-Memory Join on Modern Hardware
IEEE Transactions on Knowledge and Data Engineering
Database Architecture Optimized for the New Bottleneck: Memory Access
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Using the Compiler to Improve Cache Replacement Decisions
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Optimizing database architecture for the new bottleneck: memory access
The VLDB Journal — The International Journal on Very Large Data Bases
Design and implementation of correlating caches
Proceedings of the 2004 international symposium on Low power electronics and design
MEDEA '04 Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
Optimizing software cache performance of packet processing applications
Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Memory scheduling for modern microprocessors
ACM Transactions on Computer Systems (TOCS)
A Hardware Acceleration Platform for Digital Holographic Imaging
Journal of Signal Processing Systems
Conjugate gradient sparse solvers: performance-power characteristics
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing packet accesses for a domain specific language on network processors
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Operating system support for multimedia systems
Computer Communications
Hi-index | 4.10 |
Memory speeds have not kept up with processor speeds. More precisely, DRAM latency has not kept pace: Processor speeds have been increasing by at least 70 percent per year, while DRAM latency has improved only 7 percent annually. As a result, a contemporary superscalar 300-MHz DEC Alpha system with a 40-ns DRAM can perform at least 24 instructions in the time it takes to access its memory just once. In a few years, if current trends continue, the number of instructions per access could increase to a thousand. Fortunately, memory bandwidth is another matter. Wider buses, multiple banks, more pins, the integrated circuit properties of DRAMs (such as static-column mode and on-chip cache), and the newer Rambus and synchronous DRAM have all contributed to band-widths that have scaled better than latency. A central problem for memory system designers is how to exploit this bandwidth to achieve lower latencies. In this article, we describe a technique that can convert more than 90 percent of a memory system's bandwidth into low-latency accesses, at least for a particular class of computations.The scheme nicely complements traditional caching in two ways: It handles frequently occurring memory reference patterns for which caches do not perform well and-by removing this problematic data from the cache-it reduces pollution, making the cache more effective for the remaining references.