The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Optimizing for parallelism and data locality
ICS '92 Proceedings of the 6th international conference on Supercomputing
Compiling for numa parallel machines
Compiling for numa parallel machines
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Advanced compiler design and implementation
Advanced compiler design and implementation
Improving locality using loop and data transformations in an integrated framework
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Precise miss analysis for program transformations with caches of arbitrary associativity
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Loop fusion for memory space optimization
Proceedings of the 14th international symposium on Systems synthesis
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Data remapping for design space optimization of embedded memory systems
ACM Transactions on Embedded Computing Systems (TECS)
Proceedings of the 40th annual Design Automation Conference
A Matrix-Based Approach to the Global Locality Optimization Problem
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Integrating Loop and Data Transformations for Global Optimisation
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Strategies for Improving Data Locality in Embedded Applications
ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
Inter-program optimizations for conserving disk energy
ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Trace-Based data layout optimizations for multi-core processors
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Hi-index | 0.00 |
It is well known that while applying a compiler optimization to a large scope of code (e.g., an entire procedure or function) can bring larger benefits in return as compared to smaller scopes (e.g., a nested loop), code analysis and optimization at larger scopes are also more difficult to manage. As of today, the largest scope for a compiler optimization is an entire program source. However, as embedded chip multiprocessor architectures are finding their ways into commercial products, it is becoming important to consider the scenario of multiple applications executing on the same chip multiprocessor. This paper explores a novel technique called multi-compilation where multiple applications that are expected to be executed simultaneously on the same CMP (chip multiprocessor) are compiled together. The benefits of this approach include capturing the interactions amongst applications due to data sharing. While one can think of many potential optimizations that can work in an inter-application fashion exploiting data sharing across applications, we restrict ourselves in this paper to data layout optimization, which is the problem of determining the most suitable memory layout for array data. To demonstrate the impact of our contribution, we implemented our approach and performed a simulation-based study with several embedded applications. Our experimental results show that, by selecting the memory layouts of data arrays considering multiple applications at the same time, we can reduce cache misses by 18.7% and execution cycles by 13.1% on average.