Compile-Time Program Restructuring in Multiprogrammed Virtual Memory Systems
IEEE Transactions on Software Engineering
Program optimization for instruction caches
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Achieving high instruction cache performance with an optimizing compiler
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Profile guided code positioning
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Characterization of alpha AXP performance using TP and SPEC workloads
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The impact of architectural trends on operating system performance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Studies of Windows NT performance using dynamic execution traces
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Using the SimOS machine simulator to study complex computer systems
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Efficient procedure mapping using cache line coloring
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Continuous profiling: where have all the cycles gone?
Proceedings of the sixteenth ACM symposium on Operating systems principles
Procedure placement using temporal ordering information
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads
Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors
Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Optimizing alpha executables on Windows NT with spike
Digital Technical Journal
ICS '99 Proceedings of the 13th international conference on Supercomputing
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Improving locality by critical working sets
Communications of the ACM
Optimizing instruction cache performance for operating system intensive workloads
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Temporal-Based Procedure Reordering for Improved Instruction Cache Performance
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Optimization of Instruction Fetch for Decision Support Workloads
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Profile-directed restructuring of operating system code
IBM Systems Journal
In-memory Parallelism for Database Workloads
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Effective ahead pipelining of instruction block address generation
Proceedings of the 30th annual international symposium on Computer architecture
Buffering databse operations for enhanced instruction cache performance
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A low-complexity fetch architecture for high-performance superscalar processors
ACM Transactions on Architecture and Code Optimization (TACO)
A case for shared instruction cache on chip multiprocessors running OLTP
MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
IEEE Transactions on Computers
Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Performance of Runtime Optimization on BLAST
Proceedings of the international symposium on Code generation and optimization
Large scale Itanium® 2 processor OLTP workload characterization and optimization
DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
Improving instruction cache performance in OLTP
ACM Transactions on Database Systems (TODS)
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Steps towards cache-resident transaction processing
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Temporal instruction fetch streaming
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SINOF: A dynamic-static combined framework for dynamic binary translation
Journal of Systems Architecture: the EUROMICRO Journal
OLTP in wonderland: where do cache misses come from in major OLTP components?
Proceedings of the Ninth International Workshop on Data Management on New Hardware
RDIP: return-address-stack directed instruction prefetching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates. A number of recent studies have characterized the behavior of commercial workloads and proposed architectural features to improve their performance. However, there has been little research on the impact of software and compiler-level optimizations for improving the behavior of such workloads.This paper provides a detailed study of profile-driven compiler optimizations to improve the code layout in commercial workloads with large instruction footprints. Our compiler algorithms are implemented in the context of Spike, an executable optimizer for the Alpha architecture. Our experiments use the Oracle commercial database engine running an OLTP workload, with results generated using both full system simulations and actual runs on Alpha multiprocessors. Our results show that code layout optimizations can provide a major improvement in the instruction cache behavior, providing a 55% to 65% reduction in the application misses for 64-128K caches. Our analysis shows that this improvement primarily arises from longer sequences of consecutively executed instructions and more reuse of cache lines before they are replaced. We also show that the majority of application instruction misses are caused by self-interference. However, code layout optimizations significantly reduce the amount of self-interference, thus elevating the relative importance of interference with operating system code. Finally, we show that better code layout can also provide substantial improvements in the behavior of other memory system components such as the instruction TLB and the unified second-level cache. The overall performance impact of our code layout optimizations is an improvement of 1.33 times in the execution time of our workload.