Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

Authors:
Zheng Zhang;Josep Torrellas
Affiliations:
Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, IL;Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, IL
Venue:
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Year:
1995

Citing 18
Cited 28

Analysis of cache invalidation patterns in multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Adjustable block size coherent caches

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Simulation of multiprocessors: accuracy and performance

Simulation of multiprocessors: accuracy and performance
Performance evaluation of hybrid hardware and software distributed shared memory protocols

ICS '94 Proceedings of the 8th international conference on Supercomputing
A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance advantages of integrating block data transfer in cache-coherent multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Compiler-directed data prefetching in multiprocessors with memory hierarchies

ICS '90 Proceedings of the 4th international conference on Supercomputing
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory

Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Profetching and memory system behavior of the SPEC95 benchmark suite

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Prefetching Using Markov Predictors

IEEE Transactions on Computers - Special issue on cache memory and related problems
Automatic Compiler-Inserted Prefetching for Pointer-Based Applications

IEEE Transactions on Computers - Special issue on cache memory and related problems
A high-level abstraction of shared accesses

ACM Transactions on Computer Systems (TOCS)
Push vs. pull: data movement for linked data structures

Proceedings of the 14th international conference on Supercomputing
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Designing a Modern Memory Hierarchy with Hardware Prefetching

IEEE Transactions on Computers
Using a user-level memory thread for correlation prefetching

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A Programmable Memory Hierarchy for Prefetching Linked Data Structures

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
The Illinois Aggressive Coma Multiprocessor project (I-ACOMA)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
MORPH: a system architecture for robust high performance using customization (an NSF 100 TeraOps point design study)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
Correlation Prefetching with a User-Level Memory Thread

IEEE Transactions on Parallel and Distributed Systems
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Tolerating memory latency through push prefetching for pointer-intensive applications

ACM Transactions on Architecture and Code Optimization (TACO)
Memory-side prefetching for linked data structures for processor-in-memory systems

Journal of Parallel and Distributed Computing
On the performance of trace locality of reference

Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
The STAMPede approach to thread-level speculation

ACM Transactions on Computer Systems (TOCS)
A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework

Proceedings of the International Symposium on Code Generation and Optimization
Application of Pfortran and Co-Array Fortran in the parallelization of the GROMOS96 molecular dynamics module

Scientific Programming
Prefetching irregular references for software cache on cell

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture

Languages and Compilers for Parallel Computing
PFetch: software prefetching exploiting temporal predictability of memory access streams

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
A pattern language for parallelizing irregular algorithms

Proceedings of the 2010 Workshop on Parallel Programming Patterns
Analysis and performance results of computing betweenness centrality on IBM Cyclops64

The Journal of Supercomputing
SAD prefetching for MPEG4 using flux caches

SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation

Quantified Score

Hi-index	0.00

Visualization

Abstract

While many parallel applications exhibit good spatial locality, other important codes in areas like graph problem-solving or CAD do not. Often, these irregular codes contain small records accessed via pointers. Consequently, while the former applications benefit from long cache lines, the latter prefer short lines. One good solution is to combine short lines with prefetching. In this way, each application can exploit the amount of spatial locality that it has. However, prefetching, if provided, should also work for the irregular codes. This paper presents a new prefetching scheme that, while usable by regular applications, is specifically targeted to irregular ones: Memory Binding and Group Prefetching.The idea is to hardware-bind and prefetch together groups of data that the programmer suggests are strongly related to each other. Examples are the different fields in a record or two records linked by a permanent pointer. This prefetching scheme, combined with short cache lines, results in a memory hierarchy design that can be exploited by both regular and irregular applications. Overall, it is better to use a system with short lines (16-32 bytes) and our prefetching than a system with long lines (128 bytes) with or without our prefetching. The former system runs 6 out of 7 Splash-class applications faster. In particular, some of the most irregular applications run 25-40% faster.