Prefetching irregular references for software cache on cell

Authors:
Tong Chen;Tao Zhang;Zehra Sura;Mar Gonzales Tallada
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY, USA;IBM T. J. Watson Research Center, Yorktown Heights, NY, USA;IBM T. J. Watson Research Center, Yorktown Heights, NY, USA;IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Venue:
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Year:
2008

Citing 21
Cited 22

Software-controlled caches in the VMP multiprocessor

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
SPAID: software prefetching in pointer- and call-intensive environments

Proceedings of the 28th annual international symposium on Microarchitecture
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Dynamic management of scratch-pad memory space

Proceedings of the 38th annual Design Automation Conference
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Compiler Analysis for Irregular Problems in Fortran D

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Slicing Analysis and Indirect Accesses to Distributed Arrays

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dynamic allocation for scratch-pad memory using compile-time decisions

ACM Transactions on Embedded Computing Systems (TECS)
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Supporting OpenMP on Cell

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
Optimizing the use of static buffers for DMA on a CELL chip

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing

Hybrid access-specific software cache techniques for the cell BE architecture

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
COMIC: a coherent shared memory interface for cell be

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture

Languages and Compilers for Parallel Computing
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

ACM Transactions on Architecture and Code Optimization (TACO)
An efficient software cache for H.264 motion compensation

SOC'09 Proceedings of the 11th international conference on System-on-chip
LU decomposition on cell broadband engine: an empirical study to exploit heterogeneous chip multiprocessors

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Adaptive line size cache for irregular references on cell multicore processor

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A study of a software cache implementation of the OpenMP memory model for multicore and manycore architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Compiler-directed memory management for heterogeneous MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal
An instruction to accelerate software caches

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
OpenMP extensions for heterogeneous architectures

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
A reuse-aware prefetching scheme for scratchpad memory

Proceedings of the 48th Design Automation Conference
Vector class on limited local memory (LLM) multi-core processors

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Adaptive and speculative memory consistency support for multi-core architectures with on-chip local memories

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automatic data distribution for improving data locality on the cell BE architecture

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Analysis of task offloading for accelerators

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Integrating software caches with scratch pad memory

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
A transactional runtime system for the Cell/BE architecture

Journal of Parallel and Distributed Computing
Hardware-software coherence protocol for the coexistence of caches and local memories

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A Multidimensional Software Cache for Scratchpad-Based Systems

International Journal of Embedded and Real-Time Communication Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The IBM Single Source Research Compiler for the Cell processor (the SSC Research Compiler) was developed to manage the complexity of programming the heterogeneous multicore Cell processor. The compiler accepts conventional source programs as input, and automatically generates binaries that execute on both the PPU and SPU cores available on a Cell chip. The compiler uses a software cache and direct buffers to manage data in the small local memory of SPUs. However, irregular references, such as a[ind[i]], often become performance bottle-necks. These references are accessed through software cache, usually with high miss rates. To solve this problem, we propose a method to prefetch irregular references accessed through a software cache that is built upon hardware such as Cell. This method includes code transformation in the compiler and a runtime library component for the software cache. Our design simplifies the synchronization required when prefetching into software cache, overlaps DMA operations for misses, and avoids frequent context switching to the miss handler. It also minimizes the cache pollution caused by prefetching, by looking both forwards and backwards through the sequence of addresses to be prefetched. We evaluated our prefetching method using the NAS benchmarks. We found that when applicable, our prefetching can improve the performance of some benchmarks by 2 times on average, and by close to 4 times in the best case. We also present data to show the impact of different configurations and optimizations when prefetching in a software cache.