Memory system characterization of commercial workloads

Authors:
Luiz André Barroso;Kourosh Gharachorloo;Edouard Bugnion
Affiliations:
Western Research Laboratory, Digital Equipment Corporation;Western Research Laboratory, Digital Equipment Corporation;Western Research Laboratory, Digital Equipment Corporation
Venue:
Proceedings of the 25th annual international symposium on Computer architecture
Year:
1998

Citing 19
Cited 129

The detection and elimination of useless misses in multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Characterization of alpha AXP performance using TP and SPEC workloads

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The impact of architectural trends on operating system performance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Evaluation of multithreaded uniprocessors for commercial application environments

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Embra: fast and flexible machine simulation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Studies of Windows NT performance using dynamic execution traces

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Using the SimOS machine simulator to study complex computer systems

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Performance analysis using very large memory on the 64-bit AlphaServer system

Digital Technical Journal
Continuous profiling: where have all the cycles gone?

Proceedings of the sixteenth ACM symposium on Operating systems principles
An analysis of database workload performance on simultaneous multithreaded processors

Proceedings of the 25th annual international symposium on Computer architecture
Performance of an OLTP application on symmetry multiprocessor system

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Oracle 7: A User's and Developer's Guide, Including Version 7.1

Oracle 7: A User's and Developer's Guide, Including Version 7.1
The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture

An analysis of database workload performance on simultaneous multithreaded processors

Proceedings of the 25th annual international symposium on Computer architecture
Retrospective: memory consistency and event ordering in scalable shared-memory multiprocessors

25 years of the international symposia on Computer architecture (selected papers)
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A performance comparison of contemporary DRAM architectures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Using complete system simulation to characterize SPECjvm98 benchmarks

Proceedings of the 14th international conference on Supercomputing
Memory system behavior of Java programs: methodology and analysis

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An analytical model of the working-set sizes in decision-support systems

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Comparing the effectiveness of fine-grain memory caching against page migration/replication in reducing traffic in DSM clusters

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Architecture and design of AlphaServer GS320

ACM SIGPLAN Notices
Timestamp snooping: an approach for extending SMPs

ACM SIGPLAN Notices
An analysis of operating system behavior on a simultaneous multithreaded architecture

ACM SIGPLAN Notices
Improving index performance through prefetching

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Characterizing the memory behavior of Java workloads: a structured view and opportunities for optimizations

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A study of memory system performance of multimedia applications

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Architecture and design of AlphaServer GS320

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Timestamp snooping: an approach for extending SMPs

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
An analysis of operating system behavior on a simultaneous multithreaded architecture

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Code layout optimizations for transaction processing workloads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Using Cohort Scheduling to Enhance Server Performance (Extended Abstract)

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
High-Performance DRAMs in Workstation Environments

IEEE Transactions on Computers
ADir_pNB: A Cost-Effective Way to Implement Full Map Directory-Based Cache Coherence Protocols

IEEE Transactions on Computers
Characterizing operating system activity in SPECjvm98 Benchmarks

Workload characterization of emerging computer applications
Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Fractal prefetching B+-Trees: optimizing both cache and disk performance

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Shared cache architectures for decision support systems

Performance Evaluation
Temporally silent stores

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Software Trace Cache for Commercial Applications

International Journal of Parallel Programming
System Optimization for OLTP Workloads

IEEE Micro
Optimizing Main-Memory Join on Modern Hardware

IEEE Transactions on Knowledge and Data Engineering
Benchmarking Internet Servers on Superscalar Machines

Computer
Simulating a $2M Commercial Server on a $2K PC

Computer
Analytic Evaluation of Shared-Memory Architectures

IEEE Transactions on Parallel and Distributed Systems
Comparing the Memory System Performance of DSS Workloads on the HP V-Class and SGI Origin 2000

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Exploring the Cache Design Space for Web Servers

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
What Happens During a Join? Dissecting CPU and Memory Optimization Effects

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
On the Performance of Fetch Engines Running DSS Workloads

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
In-memory Parallelism for Database Workloads

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Using Cohort-Scheduling to Enhance Server Performance

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Boosting the Performance of Three-Tier Web Servers Deploying SMP Architecture

Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Efficient synchronization for nonuniform communication architectures

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Improving server software support for simultaneous multithreaded processors

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
A methodology for auto-recognizing DBMS workloads

CASCON '02 Proceedings of the 2002 conference of the Centre for Advanced Studies on Collaborative research
Inferential queueing and speculative push for reducing critical communication latencies

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Hierarchical Backoff Locks for Nonuniform Communication Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Memory System Behavior of Java-Based Middleware

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Variability in Architectural Simulations of Multi-Threaded Workloads

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Behavior and Performance of Interactive Multi-Player Game Servers

Cluster Computing
An Analysis of Cache Performance of Multimedia Applications

IEEE Transactions on Computers
Scaling and Charact rizing Database Workloads: Bridging the Gap between Research and Practice

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications

IEEE Transactions on Parallel and Distributed Systems
Self-correcting LRU replacement policies

Proceedings of the 1st conference on Computing frontiers
Improving Hash Join Performance through Prefetching

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
Adaptive Cache Compression for High-Performance Processors

Proceedings of the 31st annual international symposium on Computer architecture
Buffering databse operations for enhanced instruction cache performance

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A case for shared instruction cache on chip multiprocessors running OLTP

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Software Trace Cache

IEEE Transactions on Computers
The Fuzzy Correlation between Code and Performance Predictability

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Memory coherence activity prediction in commercial workloads

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Comprehensive multiprocessor cache miss rate generation using multivariate models

ACM Transactions on Computer Systems (TOCS)
Reducing coherence overhead and boosting performance of high-end SMP multiprocessors running a DSS workload

Journal of Parallel and Distributed Computing
Mining block correlations to improve storage performance

ACM Transactions on Storage (TOS)
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
The implications of working set analysis on supercomputing memory hierarchy design

Proceedings of the 19th annual international conference on Supercomputing
Maximizing CMP Throughput with Mediocre Cores

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Store-Ordered Streaming of Shared Memory

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Reducing Server Data Traffic Using a Hierarchical Computation Model

IEEE Transactions on Parallel and Distributed Systems
C-Miner: Mining Block Correlations in Storage Systems

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Store Memory-Level Parallelism Optimizations for Commercial Applications

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
DBmbench: fast and accurate database workload representation on modern microarchitecture

CASCON '05 Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative research
Performance/Watt: the new server focus

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
The RASE (Rapid, Accurate Simulation Environment) for chip multiprocessors

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Application analysis using memory pressure

Proceedings of the 2005 workshop on Memory system performance
A characterization of data mining algorithms on a modern processor

DaMoN '05 Proceedings of the 1st international workshop on Data management on new hardware
Inferential queueing and speculative push

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Spatial Memory Streaming

Proceedings of the 33rd annual international symposium on Computer Architecture
Measuring Benchmark Similarity Using Inherent Program Characteristics

IEEE Transactions on Computers
Large scale Itanium® 2 processor OLTP workload characterization and optimization

DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
Block-aware instruction set architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Improving instruction cache performance in OLTP

ACM Transactions on Database Systems (TODS)
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Comprehensive multivariate extrapolation modeling of multiprocessor cache miss rates

ACM Transactions on Computer Systems (TOCS)
Speculative supplier identification for reducing power of interconnects in snoopy cache coherence protocols

Proceedings of the 4th international conference on Computing frontiers
Unichos: a full system simulator for thin client platform

Proceedings of the 2007 ACM symposium on Applied computing
Performance of multithreaded chip multiprocessors and implications for operating system design

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications

IEEE Transactions on Computers
Active memory operations

Proceedings of the 21st annual international conference on Supercomputing
A Study of Architectural Optimization Methods in Bioinformatics Applications

International Journal of High Performance Computing Applications
SimWattch: Integrating Complete-System and User-Level Performance and Power Simulators

IEEE Micro
Steps towards cache-resident transaction processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Cache-conscious radix-decluster projections

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Characterization of Apache web server with Specweb2005

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Speeding-up multiprocessors running DBMS workloads through coherence protocols

International Journal of High Performance Computing and Networking
Architectural characterization of XQuery workloads on modern processors

DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware
Is it DSS or OLTP: automatically identifying DBMS workloads

Journal of Intelligent Information Systems
HMTT: a platform independent full-system memory trace monitoring system

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
DLL-conscious instruction fetch optimization for SMT processors

Journal of Systems Architecture: the EUROMICRO Journal
Phantom-BTB: a virtualized branch target buffer design

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Scaling the bandwidth wall: challenges in and avenues for CMP scaling

Proceedings of the 36th annual international symposium on Computer architecture
A performance methodology for commercial servers

IBM Journal of Research and Development
A multithreaded PowerPC processor for commercial servers

IBM Journal of Research and Development
Request behavior variations

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Architectural implications of cache coherence protocols with network applications on chip multiprocessors

NPC'07 Proceedings of the 2007 IFIP international conference on Network and parallel computing
Using GPU to accelerate a pin-based multi-level cache simulator

SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Performance analysis of java concurrent programming: a case study of video mining system

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
C-Miner: mining block correlations in storage systems

FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
Performance prediction for concurrent database workloads

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Design space navigation for neighboring power-performance efficient microprocessor configurations

ARCS'05 Proceedings of the 18th international conference on Architecture of Computing Systems conference on Systems Aspects in Organic and Pervasive Computing
Analyzing advanced PDE solvers through simulation

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Reducing L1 caches power by exploiting software semantics

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
OLTP on hardware islands

Proceedings of the VLDB Endowment
Active memory controller

The Journal of Supercomputing
Performance evaluation of evolutionary multi-core and aggressively multi-threaded processor architectures

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Surveying the landscape: an in-depth analysis of spatial database workloads

Proceedings of the 20th International Conference on Advances in Geographic Information Systems
From A to E: analyzing TPC's OLTP benchmarks: the obsolete, the ubiquitous, the unexplored

Proceedings of the 16th International Conference on Extending Database Technology
Vector Extensions for Decision Support DBMS Acceleration

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
CMP off-chip bandwidth scheduling guided by instruction criticality

Proceedings of the 27th international ACM conference on International conference on supercomputing
OLTP in wonderland: where do cache misses come from in major OLTP components?

Proceedings of the Ninth International Workshop on Data Management on New Hardware
Eliminating unscalable communication in transaction processing

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.02

Visualization

Abstract

Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as the critical performance factor for such workloads. However, most current server designs have been optimized to perform well on scientific and engineering workloads, potentially leading to design decisions that are non-ideal for commercial applications. The above problem is exacerbated by the lack of information on the performance requirements of commercial workloads, the lack of available applications for widespread study, and the fact that most representative applications are too large and complex to serve as suitable benchmarks for evaluating trade-offs in the design of processors and servers.This paper presents a detailed performance study of three important classes of commercial workloads: online transaction processing (OLTP), decision support systems (DSS), and Web index search. We use the Oracle commercial database engine for our OLTP and DSS workloads, and the AltaVista search engine for our Web index search workload. This study characterizes the memory system behavior of these workloads through a large number of architectural experiments on Alpha multiprocessors augmented with full system simulations to determine the impact of architectural trends. We also identify a set of simplifications that make these workloads more amenable to monitoring and simulation without affecting representative memory system behavior. We observe that systems optimized for OLTP versus DSS and index search workloads may lead to diverging designs, specifically in the size and speed requirements for off-chip caches.