Improving instruction cache performance in OLTP

Authors:
Stavros Harizopoulos;Anastassia Ailamaki
Affiliations:
MIT CSAIL, Cambridge, MA;Carnegie Mellon University, Pittsburgh, PA
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2006

Citing 26
Cited 10

ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Shoring up persistent applications

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The impact of architectural trends on operating system performance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Eraser: a dynamic data race detector for multithreaded programs

ACM Transactions on Computer Systems (TOCS)
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors

Proceedings of the 25th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Code layout optimizations for transaction processing workloads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Making Pointer-Based Data Structures Cache Conscious

Computer
Simics: A Full System Simulation Platform

Computer
B-Tree Indexes and CPU Caches

Proceedings of the 17th International Conference on Data Engineering
Block Oriented Processing of Relational Database Operations in Modern Computer Architectures

Proceedings of the 17th International Conference on Data Engineering
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Weaving Relations for Cache Performance

Proceedings of the 27th International Conference on Very Large Data Bases
Cache Conscious Algorithms for Relational Query Processing

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Instruction prefetching using branch prediction information

ICCD '97 Proceedings of the 1997 International Conference on Computer Design (ICCD '97)
Call graph prefetching for database applications

ACM Transactions on Computer Systems (TOCS)
Buffering databse operations for enhanced instruction cache performance

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture

ACM SIGMETRICS Performance Evaluation Review - Special issue on tools for computer architecture research
DBmbench: fast and accurate database workload representation on modern microarchitecture

CASCON '05 Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative research
Instrumentation and optimization of Win32/intel executables using Etch

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Steps towards cache-resident transaction processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

A general framework for improving query processing performance on multi-level memory hierarchies

DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware
Architectural characterization of XQuery workloads on modern processors

DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware
Cache-oblivious databases: Limitations and opportunities

ACM Transactions on Database Systems (TODS)
Cache-conscious buffering for database operators with state

Proceedings of the Fifth International Workshop on Data Management on New Hardware
Designing fast architecture-sensitive tree search on modern multicore/many-core processors

ACM Transactions on Database Systems (TODS)
Optimization of query processing with cache conscious buffering operator

DNIS'10 Proceedings of the 6th international conference on Databases in Networked Information Systems
Extrinsic and intrinsic text cloning

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Reducing OLTP instruction misses with thread migration

DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution

Proceedings of the 40th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Instruction-cache misses account for up to 40% of execution time in online transaction processing (OLTP) database workloads. In contrast to data cache misses, instruction misses cannot be overlapped with out-of-order execution. Chip design limitations do not allow increases in the size or associativity of instruction caches that would help reduce misses. On the contrary, the effective instruction cache size is expected to further decrease with the adoption of multicore and multithreading chip designs (multiple on-chip processor cores and multiple simultaneous threads per core). Different concurrent database threads, however, execute similar instruction sequences over their lifetime, too long to be captured and exploited in hardware. The challenge, from a software designer's point of view, is to identify and exploit common code paths across threads executing arbitrary operations, thereby eliminating extraneous instruction misses.In this article, we describe Synchronized Threads through Explicit Processor Scheduling (STEPS), a methodology and tool to increase instruction locality in database servers executing transaction processing workloads. STEPS works at two levels to increase reusability of instructions brought in the cache. At a higher level, synchronization barriers form teams of threads that execute the same system component. Within a team, STEPS schedules special fast context-switches at very fine granularity to reuse sets of instructions across team members. To find points in the code where context-switches should occur, we develop autoSTEPS, a code profiling tool that runs directly on the DBMS binary. STEPS can minimize both capacity and conflict instruction cache misses for arbitrarily long code paths.We demonstrate the effectiveness of our approach on Shore, a research prototype database system shown to be governed by similar bottlenecks as commercial systems. Using microbenchmarks on real and simulated processors, we observe that STEPS eliminates up to 96% of instruction-cache misses for each additional team thread and at the same time eliminates up to 64% of mispredicted branches by providing a repetitive execution pattern to the processor. When performing a full-system evaluation on real hardware using TPC-C, the industry-standard transactional benchmark, STEPS eliminates two-thirds of instruction-cache misses and provides up to 1.4 overall speedup.