Automatic parallelization of fine-grained metafunctions on a chip multiprocessor

  • Authors:
  • Sanghoon Lee;James Tuck

  • Affiliations:
  • North Carolina State University Department of Electrical and Computer Engineering, Raleigh, NC;North Carolina State University Department of Electrical and Computer Engineering, Raleigh, NC

  • Venue:
  • ACM Transactions on Architecture and Code Optimization (TACO)
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to the importance of reliability and security, prior studies have proposed inlining metafunctions into applications for detecting bugs and security vulnerabilities. However, because these software techniques add frequent, fine-grained instrumentation to programs, they often incur large runtime overheads. In this work, we consider an automatic thread extraction technique for removing these fine-grained checks from a main application and scheduling them on helper threads. In this way, we can leverage the resources available on a CMP to reduce the latency and overhead of fine-grained checking codes. Our parallelization strategy extracts metafunctions from a single threaded application and executes them in customized helper threads—threads constructed to mirror relevant fragments of the main program’s behavior in order to keep communication and overhead low. To get good performance, we consider optimizations that reduce communication and balance work among many threads. We evaluate our parallelization strategy on Mudflap, a pointer-use checking tool in GCC. To show the benefits of our technique, we compare it to a manually parallelized version of Mudflap. We run our experiments on an architectural simulator with support for fast queueing operations. On a subset of SPECint 2000, our automatically parallelized code using static load balance is only 19% slower, on average, than the manually parallelized version on a simulated eight-core system. In addition, our automatically parallelized code using dynamic load balance is competitive, on average, to the manually parallelized version on a simulated eight-core system. Furthermore, all the applications except parser achieve better speedups with our automatic algorithms than with the manual approach. Also, our approach introduces very little overhead in the main program—it is kept under 100%, which is more than a 5.3× reduction compared to serial Mudflap.