Dynamic tiling for effective use of shared caches on multithreaded processors

Authors:
Dimitrios S. Nikolopoulos
Affiliations:
Department of Computer Science, College of William and Mary, Williamsburg, VA, USA
Venue:
International Journal of High Performance Computing and Networking
Year:
2004

Citing 24
Cited 4

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems (TOCS)
Tuning compiler optimizations for simultaneous multithreading

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
A tile selection algorithm for data locality and cache interference

ICS '99 Proceedings of the 13th international conference on Supercomputing
Loop tiling for parallelism

Loop tiling for parallelism
Tiling imperfectly-nested loop nests

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A scalable cross-platform infrastructure for application performance tuning using hardware counters

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Analytical cache models with applications to cache partitioning

ICS '01 Proceedings of the 15th international conference on Supercomputing
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
An analysis of operating system behavior on a simultaneous multithreaded architecture

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Effects of Memory Performance on Parallel Job Scheduling

JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
A Comparison of Compiler Tiling Algorithms

CC '99 Proceedings of the 8th International Conference on Compiler Construction, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS'99
Improving server software support for simultaneous multithreaded processors

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Analysis of Memory Hierarchy Performance of Block Data Layout

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Initial Observations of the Simultaneous Multithreading Pentium 4 Processor

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development

Parameterized tiled loops for free

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Multi-level tiling: M for the price of one

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Parameterized loop tiling

ACM Transactions on Programming Languages and Systems (TOPLAS)
Adaptive parallel tiled code generation and accelerated auto-tuning

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Simultaneous multithreaded (SMT) processors use data caches which are dynamically shared between threads. Depending on the processor workload, sharing the data cache may harm performance due to excessive cache conflicts. A way to overcome this problem is to physically partition the cache between threads. Unfortunately, partitioning the cache requires additional hardware and may lead to lower utilisation of the cache in certain workloads. It is therefore important to consider software mechanisms to implicitly partition the cache between threads by controlling the locations in the cache in which each thread can load data. This paper proposes standard program transformations for partitioning the shared data caches of SMT processors, if and only if there are conflicts between threads in the shared cache at runtime. We propose transformations based on dynamic tiling. The key idea is to use two tile sizes in the program, one for single-threaded execution mode and one suitable for multithreaded execution mode and switch between tile sizes at runtime. Our transformations combine dynamic tiling with either copying or storing arrays in block layout. The paper presents an implementation of these transformations along with runtime mechanisms for detecting cache contention between threads and react to it on-the-fly. Our experimental results show that for regular, perfect loop nests, these transformations provide substantial performance improvements.