Profile guided code positioning
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
AlphaSort: a RISC machine sort
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Shoring up persistent applications
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Characterization of alpha AXP performance using TP and SPEC workloads
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Commercial workload performance in the IBM POWER2 RISC System/6000 processor
IBM Journal of Research and Development
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Evaluation of multithreaded uniprocessors for commercial application environments
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Wrong-path instruction prefetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Efficient procedure mapping using cache line coloring
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Prefetching using Markov predictors
Proceedings of the 24th annual international symposium on Computer architecture
Procedure placement using temporal ordering information
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
An analysis of database workload performance on simultaneous multithreaded processors
Proceedings of the 25th annual international symposium on Computer architecture
A Performance Study of Instruction Cache Prefetching Methods
IEEE Transactions on Computers
The Asilomar report on database research
ACM SIGMOD Record
Fetch directed instruction prefetching
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Architectural and compiler support for effective instruction prefetching: a cooperative approach
ACM Transactions on Computer Systems (TOCS)
Data prefetching by dependence graph precomputation
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
DBMSs on a Modern Processor: Where Does Time Go?
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Benchmarking Database Systems A Systematic Approach
VLDB '83 Proceedings of the 9th International Conference on Very Large Data Bases
Cache Conscious Algorithms for Relational Query Processing
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Temporal-Based Procedure Reordering for Improved Instruction Cache Performance
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Instruction prefetching using branch prediction information
ICCD '97 Proceedings of the 1997 International Conference on Computer Design (ICCD '97)
Call Graph Prefetching for Database Applications
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Branch History Guided Instruction Prefetching
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Prefetch mechanisms that acquire and exploit application specific knowledge
Prefetch mechanisms that acquire and exploit application specific knowledge
IEEE Transactions on Computers
Instrumentation and optimization of Win32/intel executables using Etch
NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Improving instruction cache performance in OLTP
ACM Transactions on Database Systems (TODS)
Steps towards cache-resident transaction processing
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Temporal instruction fetch streaming
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Global-aware and multi-order context-based prefetching for high-performance processors
International Journal of High Performance Computing Applications
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
RDIP: return-address-stack directed instruction prefetching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
With the continuing technological trend of ever cheaper and larger memory, most data sets in database servers will soon be able to reside in main memory. In this configuration, the performance bottleneck is likely to be the gap between the processing speed of the CPU and the memory access latency. Previous work has shown that database applications have large instruction and data footprints and hence do not use processor caches effectively. In this paper, we propose Call Graph Prefetching (CGP), an N instruction prefetching technique that analyzes the call graph of a database system and prefetches instructions from the function that is deemed likely to be called next. CGP capitalizes on the highly predictable function call sequences that are typical of database systems. CGP can be implemented either in software or in hardware. The software-based CGP (CGP_S) uses profile information to build a call graph, and uses the predictable call sequences in the call graph to determine which function to prefetch next. The hardware-based CGP(CGP_H) uses a hardware table, called the Call Graph History Cache (CGHC), to dynamically store sequences of functions invoked during program execution, and uses that stored history when choosing which functions to prefetch.We evaluate the performance of CGP on sets of Wisconsin and TPC-H queries, as well as on CPU-2000 benchmarks. For most CPU-2000 applications the number of instruction cache (I-cache) misses were very few even without any prefetching, obviating the need for CGP. On the other hand, the database workloads do suffer a significant number of I-cache misses; CGP_S improves their performance by 23% and CGP_H by 26% over a baseline system that has already been highly tuned for efficient I-cache usage by using the OM tool. CGP, with or without OM, reduces the I-cache miss stall time by about 50% relative to O5+OM, taking us about half way from an already highly tuned baseline system toward perfect I-cache performance.