Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
A global approach to detection of parallelism
A global approach to detection of parallelism
Automatic decomposition of scientific programs for parallel execution
POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Data dependence and its application to parallel processing
International Journal of Parallel Programming
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Selected papers of the second workshop on Languages and compilers for parallel computing
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A static performance estimator in the Fortran D programming system
Languages, compilers and run-time environments for distributed memory machines
Automatic and interactive parallelization
Automatic and interactive parallelization
Supercomputer performance evaluation and the Perfect Benchmarks
ICS '90 Proceedings of the 4th international conference on Supercomputing
The parallel execution of DO loops
Communications of the ACM
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
Dependence graphs and compiler optimizations
POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Improving the performance of virtual memory computers.
Improving the performance of virtual memory computers.
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines
ICS '92 Proceedings of the 6th international conference on Supercomputing
Global optimizations for parallelism and locality on scalable parallel machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Managing pages in shared virtual memory systems: getting the compiler into the game
ICS '93 Proceedings of the 7th international conference on Supercomputing
A static parameter based performance prediction tool for parallel programs
ICS '93 Proceedings of the 7th international conference on Supercomputing
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Unified compilation of Fortran 77D and 90D
ACM Letters on Programming Languages and Systems (LOPLAS)
Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors
ICS '94 Proceedings of the 8th international conference on Supercomputing
Using virtual lines to enhance locality exploitation
ICS '94 Proceedings of the 8th international conference on Supercomputing
Compiler optimizations for improving data locality
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Reducing false sharing on shared memory multiprocessors through compile time data transformations
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Unified compilation techniques for shared and distributed address space machines
ICS '95 Proceedings of the 9th international conference on Supercomputing
A compiler optimization to reduce execution time of loop nest
ACM SIGARCH Computer Architecture News
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
The influence of caches on the performance of heaps
Journal of Experimental Algorithmics (JEA)
Determining the idle time of a tiling
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Maximizing parallelism and minimizing synchronization with affine transforms
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The implementation and evaluation of fusion and contraction in array languages
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A general algorithm for tiling the register level
ICS '98 Proceedings of the 12th international conference on Supercomputing
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
An affine partitioning algorithm to maximize parallelism and minimize communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests
Proceedings of the 14th international conference on Supercomputing
Hoard: a scalable memory allocator for multithreaded applications
ACM SIGPLAN Notices
A compiler technique for improving whole-program locality
POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Loop optimization for a class of memory-constrained computations
ICS '01 Proceedings of the 15th international conference on Supercomputing
Reducing memory requirements of nested loops for embedded systems
Proceedings of the 38th annual Design Automation Conference
Hoard: a scalable memory allocator for multithreaded applications
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Blocking and array contraction across arbitrarily nested loops using affine partitioning
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Data Relation Vectors: A New Abstraction for Data Optimizations
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests
International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
Compile Time Barrier Synchronization Minimization
IEEE Transactions on Parallel and Distributed Systems
A Simulation Study of Decoupled Vector Architectures
The Journal of Supercomputing
Quantifying the Multi-Level Nature of Tiling Interactions
International Journal of Parallel Programming
Effects of Loop Fusion and Statement Migration on the Speedup of Vector Multiprocessors
PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
On the Parallel Execution Time of Tiled Loops
IEEE Transactions on Parallel and Distributed Systems
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework
IEEE Transactions on Parallel and Distributed Systems
Decoupled vector architectures
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
IEEE Transactions on Parallel and Distributed Systems
Transforming Complex Loop Nests for Locality
The Journal of Supercomputing
Scheduling Reusable Instructions for Power Reduction
Proceedings of the conference on Design, automation and test in Europe - Volume 1
Quasidynamic Layout Optimizations for Improving Data Locality
IEEE Transactions on Parallel and Distributed Systems
Line Size Adaptivity Analysis of Parameterized Loop Nests for Direct Mapped Data Cache
IEEE Transactions on Computers
Maximizing data reuse for minimizing memory space requirements and execution cycles
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Multi-compilation: capturing interactions among concurrently-executing applications
Proceedings of the 3rd conference on Computing frontiers
An extended OpenMP targeting on the hybrid architecture of SMP-cluster
ACST'06 Proceedings of the 2nd IASTED international conference on Advances in computer science and technology
Single-dimension software pipelining for multidimensional loops
ACM Transactions on Architecture and Code Optimization (TACO)
$P$^$3$$T+$: A performance estimator for distributed and parallel programs
Scientific Programming
Guidance of Loop Ordering for Reduced Memory Usage in Signal Processing Applications
Journal of Signal Processing Systems
Compiler directed parallelization of loops in scale for shared-memory multiprocessors
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
On the interaction of tiling and automatic parallelization
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
A programming language interface to describe transformations and code generation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Loop Distribution and Fusion with Timing and Code Size Optimization
Journal of Signal Processing Systems
An ILP-Based approach to locality optimization
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Hi-index | 0.00 |
Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the trade-offs between effectively utilizing parallelism and memory hierarchy on shared-memory multiprocessors. We present a simple, but surprisingly accurate, memory model to determine cache line reuse from both multiple accesses to the same memory location and from consecutive memory access. The model is used in memory optimizing and loop parallelization algorithms that effectively exploit data locality and parallelism in concert. We demonstrate the efficacy of this approach with very encouraging experimental results.