ATOM: a system for building customized program analysis tools
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Shoring up persistent applications
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The impact of architectural trends on operating system performance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Eraser: a dynamic data race detector for multithreaded programs
ACM Transactions on Computer Systems (TOCS)
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads
Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors
Proceedings of the 25th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Performance of database workloads on shared-memory systems with out-of-order processors
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Code layout optimizations for transaction processing workloads
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Proceedings of the 17th International Conference on Data Engineering
Block Oriented Processing of Relational Database Operations in Modern Computer Architectures
Proceedings of the 17th International Conference on Data Engineering
DBMSs on a Modern Processor: Where Does Time Go?
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Weaving Relations for Cache Performance
Proceedings of the 27th International Conference on Very Large Data Bases
Cache Conscious Algorithms for Relational Query Processing
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Instruction prefetching using branch prediction information
ICCD '97 Proceedings of the 1997 International Conference on Computer Design (ICCD '97)
Call graph prefetching for database applications
ACM Transactions on Computer Systems (TOCS)
Buffering databse operations for enhanced instruction cache performance
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
ACM SIGMETRICS Performance Evaluation Review - Special issue on tools for computer architecture research
DBmbench: fast and accurate database workload representation on modern microarchitecture
CASCON '05 Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative research
Instrumentation and optimization of Win32/intel executables using Etch
NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Steps towards cache-resident transaction processing
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A general framework for improving query processing performance on multi-level memory hierarchies
DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware
Architectural characterization of XQuery workloads on modern processors
DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware
Cache-oblivious databases: Limitations and opportunities
ACM Transactions on Database Systems (TODS)
Cache-conscious buffering for database operators with state
Proceedings of the Fifth International Workshop on Data Management on New Hardware
Designing fast architecture-sensitive tree search on modern multicore/many-core processors
ACM Transactions on Database Systems (TODS)
Optimization of query processing with cache conscious buffering operator
DNIS'10 Proceedings of the 6th international conference on Databases in Networked Information Systems
Extrinsic and intrinsic text cloning
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Reducing OLTP instruction misses with thread migration
DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Instruction-cache misses account for up to 40% of execution time in online transaction processing (OLTP) database workloads. In contrast to data cache misses, instruction misses cannot be overlapped with out-of-order execution. Chip design limitations do not allow increases in the size or associativity of instruction caches that would help reduce misses. On the contrary, the effective instruction cache size is expected to further decrease with the adoption of multicore and multithreading chip designs (multiple on-chip processor cores and multiple simultaneous threads per core). Different concurrent database threads, however, execute similar instruction sequences over their lifetime, too long to be captured and exploited in hardware. The challenge, from a software designer's point of view, is to identify and exploit common code paths across threads executing arbitrary operations, thereby eliminating extraneous instruction misses.In this article, we describe Synchronized Threads through Explicit Processor Scheduling (STEPS), a methodology and tool to increase instruction locality in database servers executing transaction processing workloads. STEPS works at two levels to increase reusability of instructions brought in the cache. At a higher level, synchronization barriers form teams of threads that execute the same system component. Within a team, STEPS schedules special fast context-switches at very fine granularity to reuse sets of instructions across team members. To find points in the code where context-switches should occur, we develop autoSTEPS, a code profiling tool that runs directly on the DBMS binary. STEPS can minimize both capacity and conflict instruction cache misses for arbitrarily long code paths.We demonstrate the effectiveness of our approach on Shore, a research prototype database system shown to be governed by similar bottlenecks as commercial systems. Using microbenchmarks on real and simulated processors, we observe that STEPS eliminates up to 96% of instruction-cache misses for each additional team thread and at the same time eliminates up to 64% of mispredicted branches by providing a repetitive execution pattern to the processor. When performing a full-system evaluation on real hardware using TPC-C, the industry-standard transactional benchmark, STEPS eliminates two-thirds of instruction-cache misses and provides up to 1.4 overall speedup.