Computation mapping for multi-level storage cache hierarchies

  • Authors:
  • Mahmut Kandemir;Sai Prashanth Muralidhara;Mustafa Karakoy;Seung Woo Son

  • Affiliations:
  • Pennsylvania State University;Pennsylvania State University;Imperial College;Argonne National Laboratory

  • Venue:
  • Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Improving I/O performance is an important issue for many data-intensive, large-scale parallel applications. Although storage caches are used for improving I/O latencies of parallel applications, most of the prior work has focused on the management and partitioning of cache space. In particular, the compiler's role in taking advantage of multilevel storage caches has been largely unexplored. The main contribution of this paper is a shared-storage, cache-aware loop iteration distribution (iteration-to-processor mapping) scheme for I/O-intensive applications that manipulate disk-resident data sets. The proposed scheme is compiler directed and can be tuned to target any multilevel storage cache hierarchy. At the core of our scheme lies an iterative strategy that clusters loop iterations based on the underlying storage cache hierarchy and on the way these different storage caches in the hierarchy are shared by different processors. We tested this mapping scheme using a set of eight I/O-intensive application programs. The results collected so far are promising. Our proposed scheme improves the I/O performance of the tested applications by 26.3% on average, and this improvement leads to an average 18.9% reduction in the overall execution latencies of these applications. Moreover, our scheme performs significantly better than a state-of-the-art (but storage-cache- hierarchy agnostic) data locality optimization scheme. We also present an enhancement to our baseline implementation that performs local scheduling once the loop iteration distribution is performed. We observe that applying this enhancement improves I/O latency and total execution time further by 30.7% and 21.9%, respectively.