OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
(R) Polynomial - Time Nested Loop Fusion with Full Parallelism
ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
Interprocedural dependence analysis and parallelization
ACM SIGPLAN Notices - Best of PLDI 1979-1999
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms
International Journal of High Performance Computing Applications
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic parallelization for graphics processing units
PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
A domain-specific approach to heterogeneous parallelism
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems
Proceedings of the 4th International Workshop on Multicore Software Engineering
Automatic performance optimization in ViennaCL for GPUs
Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Hi-index | 0.00 |
Programming language support for multi-core architectures introduces a fundamentally new mechanism for modularity---a kernel. Though it can be used as a means to separate concerns, a kernel is given a clean slate of memory at execution time. As a consequence, application developers attempting to leverage libraries of kernels often incur substantial unanticipated performance penalties. Currently, the only recourse is to compromise modularity for the sake of optimizing data flow on an application-specific basis. KFusion is our prototype tool for optimizing libraries of kernels according to application-specific needs. Our goal is to shield application developers from loop fusion and deforestation in compositions of low level kernels that share data. Libraries, augmented by domain experts with annotations to ensure correct compositions of kernels, provide application developers with the opportunity to supply hints according to customized data flow needs---keeping modularity intact. In the worst case, an inaccurate hint incurs no penalty. Case studies of applications using general-purpose libraries for linear algebra, image manipulation and physics engines show that KFusion can substantially improve performance associated memory bandwidth bottlenecks.