Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

Authors:
Nan Zhang
Affiliations:
-
Venue:
Journal of Parallel and Distributed Computing
Year:
2011

Citing 16
Cited 0

Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Evaluating performance of prefetching second level caches

ACM SIGMETRICS Performance Evaluation Review
Limitations of cache prefetching on a bus-based multiprocessor

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Evaluating the performance of multithreading and prefetching in multiprocessors

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations

IEEE Transactions on Computers
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Summed-area tables for texture mapping

SIGGRAPH '84 Proceedings of the 11th annual conference on Computer graphics and interactive techniques
Efficient emulation of hardware prefetchers via event-driven helper threading

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
A PAB-based multi-prefetcher mechanism

International Journal of Parallel Programming
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Data Prefetching and Data Forwarding in Shared Memory Multiprocessors

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Coordinated control of multiple prefetchers in multi-core systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel workloads on shared-memory multi-core processors often suffer from performance degradation. Cache eviction, true/false sharing and bus contention are among the well-understood causes to this problem. This paper presents a study that shows the L2 DPL (data prefetch logic) in processors based on Intel Core microarchitecture can be a cause to this problem as well. The study through a case of an image integration finds the nonscaling problem on the parallel integration of images whose size exceeds the capacity of the processor's L2 cache. Through an analysis on relevant performance events using Intel VTune(TM)Performance Analyser the L2 DPL prefetch is found less effective over the parallel integration in prefetching needed data than over the serial ones. To resolve the problem a novel parallel image reverse loading is developed with the purpose of reducing the number of memory accesses over the parallel integration and the associated delay. Experimental results demonstrate that the parallel integration after the parallel reverse loading shows significant speedup against the same parallel integration but after serial loading.