Compiler optimization to improve data locality for processor multithreading

Authors:
Balaram Sinharoy
Affiliations:
IBM Corporation, East Fishkill, NY 12533, USA E-mail: balaram@watson.ibm.com
Venue:
Scientific Programming
Year:
1999

Citing 17
Cited 0

Program optimization for instruction caches

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Branch prediction for free

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Generating local addresses and communication sets for data-parallel programs

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling for locality in shared-memory multiprocessors

Scheduling for locality in shared-memory multiprocessors
Data and task alignment in distributed memory architectures

Journal of Parallel and Distributed Computing - Special issue on data parallel algorithms and programming
Performance and optimization of data prefetching strategies in scalable multiprocessors

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Integer Programming for Array Subscript Analysis

IEEE Transactions on Parallel and Distributed Systems
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Automatic Detection of Parallelism: A Grand Challenge for High-Performance Computing

IEEE Parallel & Distributed Technology: Systems & Technology
Predicting and Precluding Problems with Memory Latency

IEEE Micro
Performance Tradeoffs in Multithreaded Processors

IEEE Transactions on Parallel and Distributed Systems
A Singular Loop Transformation Framework Based on Non-Singular Matrices

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Impact of memory hierarchy on program partitioning and scheduling

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the last decade processor speed has increased dramatically, whereas the speed of the memory subsystem improved at a modest rate. Due to the increase in the cache miss latency (in terms of the processor cycle), processors stall on cache misses for a significant portion of its execution time. Multithreaded processors has been proposed in the literature to reduce the processor stall time due to cache misses. Although multithreading improves processor utilization, it may also increase cache miss rates, because in a multithreaded processor multiple threads share the same cache, which effectively reduces the cache size available to each individual thread. Increased processor utilization and the increase in the cache miss rate demands higher memory bandwidth. A novel compiler optimization method has been presented in this paper that improves data locality for each of the threads and enhances data sharing among the threads. The method is based on loop transformation theory and optimizes both spatial and temporal data locality. The created threads exhibit high level of intra-thread and inter-thread data locality which effectively reduces both the data cache miss rates and the total execution time of numerically intensive computation running on a multithreaded processor.