Code layout optimizations for transaction processing workloads

Authors:
Alex Ramirez;Luiz André Barroso;Kourosh Gharachorloo;Robert Cohn;Josep Larriba-Pey;P. Geoffrey Lowney;Mateo Valero
Affiliations:
Computer Architecture Department, Universitat Politecnica de Catalunya;Western Research Laboratory, Compaq Computer Corporation;Western Research Laboratory, Compaq Computer Corporation;Alpha Development Group, Compaq Computer Corporation;Computer Architecture Department, Universitat Politecnica de Catalunya;Alpha Development Group, Compaq Computer Corporation;Computer Architecture Department, Universitat Politecnica de Catalunya
Venue:
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Year:
2001

Citing 24
Cited 21

Compile-Time Program Restructuring in Multiprogrammed Virtual Memory Systems

IEEE Transactions on Software Engineering
Program optimization for instruction caches

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Achieving high instruction cache performance with an optimizing compiler

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Profile guided code positioning

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Characterization of alpha AXP performance using TP and SPEC workloads

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The impact of architectural trends on operating system performance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Studies of Windows NT performance using dynamic execution traces

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Using the SimOS machine simulator to study complex computer systems

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Efficient procedure mapping using cache line coloring

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Continuous profiling: where have all the cycles gone?

Proceedings of the sixteenth ACM symposium on Operating systems principles
Procedure placement using temporal ordering information

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors

Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Optimizing alpha executables on Windows NT with spike

Digital Technical Journal
Software trace cache

ICS '99 Proceedings of the 13th international conference on Supercomputing
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Improving locality by critical working sets

Communications of the ACM
Optimizing instruction cache performance for operating system intensive workloads

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Temporal-Based Procedure Reordering for Improved Instruction Cache Performance

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Optimization of Instruction Fetch for Decision Support Workloads

ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Profile-directed restructuring of operating system code

IBM Systems Journal

In-memory Parallelism for Database Workloads

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reality-based optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Effective ahead pipelining of instruction block address generation

Proceedings of the 30th annual international symposium on Computer architecture
Buffering databse operations for enhanced instruction cache performance

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A low-complexity fetch architecture for high-performance superscalar processors

ACM Transactions on Architecture and Code Optimization (TACO)
A case for shared instruction cache on chip multiprocessors running OLTP

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Software Trace Cache

IEEE Transactions on Computers
Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Performance of Runtime Optimization on BLAST

Proceedings of the international symposium on Code generation and optimization
Large scale Itanium® 2 processor OLTP workload characterization and optimization

DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
Improving instruction cache performance in OLTP

ACM Transactions on Database Systems (TODS)
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Steps towards cache-resident transaction processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Temporal instruction fetch streaming

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Proactive instruction fetch

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SINOF: A dynamic-static combined framework for dynamic binary translation

Journal of Systems Architecture: the EUROMICRO Journal
OLTP in wonderland: where do cache misses come from in major OLTP components?

Proceedings of the Ninth International Workshop on Data Management on New Hardware
RDIP: return-address-stack directed instruction prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates. A number of recent studies have characterized the behavior of commercial workloads and proposed architectural features to improve their performance. However, there has been little research on the impact of software and compiler-level optimizations for improving the behavior of such workloads.This paper provides a detailed study of profile-driven compiler optimizations to improve the code layout in commercial workloads with large instruction footprints. Our compiler algorithms are implemented in the context of Spike, an executable optimizer for the Alpha architecture. Our experiments use the Oracle commercial database engine running an OLTP workload, with results generated using both full system simulations and actual runs on Alpha multiprocessors. Our results show that code layout optimizations can provide a major improvement in the instruction cache behavior, providing a 55% to 65% reduction in the application misses for 64-128K caches. Our analysis shows that this improvement primarily arises from longer sequences of consecutively executed instructions and more reuse of cache lines before they are replaced. We also show that the majority of application instruction misses are caused by self-interference. However, code layout optimizations significantly reduce the amount of self-interference, thus elevating the relative importance of interference with operating system code. Finally, we show that better code layout can also provide substantial improvements in the behavior of other memory system components such as the instruction TLB and the unified second-level cache. The overall performance impact of our code layout optimizations is an improvement of 1.33 times in the execution time of our workload.