An automatic object inlining optimization and its evaluation

  • Authors:
  • Julian Dolby;Andrew Chien

  • Affiliations:
  • IBM T. J. Watson Research Center, Yorktown Heights, NY;Department of Computer Science and Engineering, University of California, San Diego

  • Venue:
  • PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatic object inlining [19, 20] transforms heapdata structures by fusing parent and child objects together. It canimprove runtime by reducing object allocation and pointer dereferencecosts. We report continuing work studying object inlining optimizations. In particular, we present a new semantic derivation of the correctness conditions for object inlining, and program analysis which extends our previous work. And we present an object inlining transformation, focusing on a new algorithm which optimizes class field layout to minimize code expansion. Finally, we detail a fuller evaluation on eleven programs and libraries (including Xpdf, the 25,000 line Portable Document Format (PDF) file browser) that utilizes hardware measures of impact on the memory system. We show that our analysis scales effectively to large programs,finding many inlinable fields (45 in xpdf) at acceptable cost, and weshow that, on some programs, it finds nearly all fields for which object inlining is correct, and averages 40% of such fields across our benchmarks. We implement our analyses in an advanced analysis infrastructure, and we show that, compared to traditional 1-CFA, that infrastructure provides better results and lower and more scalable cost. Across all programs, analysis identified about 30% of objects as inlinable on average. Our transformation increases code size by only 20% while inlining this 30% of fields. Inlining these objects eliminated on average 28% of field reads, 58% of object creations, 12% of all loads. Further, the optimized programs have significantly improved memory reference behavior, producing 25% fewer L1 data cache misses and 25% fewer read stalls. On average the runtime improved by 14%.