Profiler and compiler assisted adaptive I/O prefetching for shared storage caches

Authors:
Seung Woo Son;Sai Prashanth Muralidhara;Ozcan Ozturk;Mahmut Kandemir;Ibrahim Kolcu;Mustafa Karakoy
Affiliations:
Pennsylvania State University, University Park, PA, USA;Pennsylvania State University, University Park, PA, USA;Bilkent University, Ankara, Turkey;Pennsylvania State University, University Park, PA, USA;University of Manchester, Manchester, United Kngdm;Imperial College, London, United Kngdm
Venue:
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Year:
2008

Citing 39
Cited 2

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Going Beyond Integer Programming with the Omega Test to Eliminate False Data Dependences

IEEE Transactions on Parallel and Distributed Systems
A study of integrated prefetching and caching strategies

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Informed prefetching and caching

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Automatic compiler-inserted I/O prefetching for out-of-core applications

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
A trace-driven comparison of algorithms for parallel prefetching and caching

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
An extended two-phase method for accessing sections of out-of-core arrays

Scientific Programming
Informed multi-process prefetching and caching

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Compiler-based I/O prefetching for out-of-core applications

ACM Transactions on Computer Systems (TOCS)
Optimal prefetching and caching for parallel I/O sytems

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
LIRS: an efficient low inter-reference recency set replacement policy to improve buffer cache performance

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Post-pass binary adaptation for software-based speculative precomputation

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Design and evaluation of compiler algorithms for pre-execution

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
SPEC CPU2000: Measuring CPU Performance in the New Millennium

Computer
An Experimental Evaluation of I/O Optimizations on Different Applications

IEEE Transactions on Parallel and Distributed Systems
Parallel Out-of-Core Cholesky and QR Factorization with POOCLAPACK

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Improving I/O response times via prefetching and storage system reorganization

COMPSAC '97 Proceedings of the 21st International Computer Software and Applications Conference
My Cache or Yours? Making Storage More Exclusive

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Discretionary Caching for I/O on Clusters

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Automatic ARIMA Time Series Modeling for Adaptive I/O Prefetching

IEEE Transactions on Parallel and Distributed Systems
ARC: A Self-Tuning, Low Overhead Replacement Cache

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
CAR: Clock with Adaptive Replacement

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
CLOCK-Pro: an effective improvement of the CLOCK replacement

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
DULO: an effective buffer cache management scheme to exploit both temporal and spatial locality

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Second-tier cache management using write hints

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Managing prefetch memory for data-intensive online servers

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Taming the memory hogs: using compiler-inserted releases to manage physical memory intelligently

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Karma: know-it-all replacement for a multilevel cache

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
AMP: adaptive multi-stream prefetching in a shared cache

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Competitive prefetching for concurrent sequential I/O

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Working Sets Past and Present

IEEE Transactions on Software Engineering
DiskSeen: exploiting disk layout and access history to enhance I/O prefetch

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Helper thread prefetching for loosely-coupled multiprocessor systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Cashing in on hints for better prefetching and caching in PVFS and MPI-IO

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Effective parallelization of loops in the presence of I/O operations

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

I/O prefetching has been employed in the past as one of the mechanisms to hide large disk latencies. However, I/O prefetching in parallel applications is problematic when multiple CPUs share the same set of disks due to the possibility that prefetches from different CPUs can interact on shared memory caches in the I/O nodes in complex and unpredictable ways. In this paper, we (i) quantify the impact of compiler-directed I/O prefetching - developed originally in the context of sequential execution - on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings benefits, its effectiveness reduces significantly as the number of CPUs is increased; (ii) identify inter-CPU misses due to harmful prefetches as one of the main sources for this reduction in performance with the increased number of CPUs; and (iii) propose and experimentally evaluate a profiler and compiler assisted adaptive I/O prefetching scheme targeting shared storage caches. The proposed scheme obtains inter-thread data sharing information using profiling and, based on the captured data sharing patterns, divides the threads into clusters and assigns a separate (customized) I/O prefetcher thread for each cluster. In our approach, the compiler generates the I/O prefetching threads automatically. We implemented this new I/O prefetching scheme using a compiler and the PVFS file system running on Linux, and the empirical data collected clearly underline the importance of adapting I/O prefetching based on program phases. Specifically, our proposed scheme improves performance, on average, by 19.9%, 11.9% and 10.3% over the cases without I/O prefetching, with independent I/O prefetching (each CPU is performing compiler-directed I/O prefetching independently), and with one CPU prefetching (one CPU is reserved for prefetching on behalf of others), respectively, when 8 CPUs are used.