Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems
IEEE Transactions on Computers
The cedar system and an initial performance study
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Execution-driven simulation of multiprocessors: address and timing analysis
ACM Transactions on Modeling and Computer Simulation (TOMACS)
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads
Proceedings of the 25th annual international symposium on Computer architecture
Accurate indirect branch prediction
Proceedings of the 25th annual international symposium on Computer architecture
The YAGS branch prediction scheme
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
HLS: combining statistical and symbolic simulation to guide microprocessor designs
Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Timestamp snooping: an approach for extending SMPs
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
MemorIES3: a programmable, real-time hardware emulation tool for multiprocessor server design
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
FLASH vs. (Simulated) FLASH: closing the simulation loop
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Full-system timing-first simulation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Measuring Experimental Error in Microprocessor Simulation
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Automatically characterizing large scale program behavior
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Starfire: Extending the SMP Envelope
IEEE Micro
A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches
IEEE Transactions on Computers
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
An Direct-Execution Framework for Fast and Accurate Simulation of Superscalar Processors
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
The Effects of Mispredicted-Path Execution on Branch Prediction Structures
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
A multithreaded PowerPC processor for commercial servers
IBM Journal of Research and Development
Run-time modeling and estimation of operating system power consumption
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Memory System Behavior of Java-Based Middleware
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Scaling and Charact rizing Database Workloads: Bridging the Gap between Research and Practice
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Memory Ordering: A Value-Based Approach
Proceedings of the 31st annual international symposium on Computer architecture
Adaptive Cache Compression for High-Performance Processors
Proceedings of the 31st annual international symposium on Computer architecture
A case for shared instruction cache on chip multiprocessors running OLTP
MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Coherence decoupling: making use of incoherence
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Managing Wire Delay in Large Chip-Multiprocessor Caches
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Memory Ordering: A Value-Based Approach
IEEE Micro
Optimizing Replication, Communication, and Capacity Allocation in CMPs
Proceedings of the 32nd annual international symposium on Computer Architecture
Evaluating IA-32 web servers through simics: a practical experience
Journal of Systems Architecture: the EUROMICRO Journal
Maximizing CMP Throughput with Mediocre Cores
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Simulating Commercial Java Throughput Workloads: A Case Study
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
The RASE (Rapid, Accurate Simulation Environment) for chip multiprocessors
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Simulation of Computer Architectures: Simulators, Benchmarks, Methodologies, and Recommendations
IEEE Transactions on Computers
Hardware support for spin management in overcommitted virtual machines
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Coherence Ordering for Ring-based Chip Multiprocessors
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Determining output uncertainty of computer system models
Performance Evaluation
Virtual hierarchies to support server consolidation
Proceedings of the 34th annual international symposium on Computer architecture
Performance pathologies in hardware transactional memory
Proceedings of the 34th annual international symposium on Computer architecture
MetaTM/TxLinux: transactional memory for an operating system
Proceedings of the 34th annual international symposium on Computer architecture
TxLinux: using and managing hardware transactional memory in an operating system
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Applying Statistical Sampling for Fast and Efficient Simulation of Commercial Workloads
IEEE Transactions on Computers
NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Producing wrong data without doing anything obviously wrong!
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Phantom-BTB: a virtualized branch target buffer design
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Modeling transactional memory workload performance
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Reducing performance non-determinism via cache-aware page allocation strategies
Proceedings of the first joint WOSP/SIPEW international conference on Performance engineering
Flexible architectural support for fine-grain scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An analysis of on-chip interconnection networks for large-scale chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Timetraveler: exploiting acyclic races for optimizing memory race recording
Proceedings of the 37th annual international symposium on Computer architecture
Proximity coherence for chip multiprocessors
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SWEL: hardware cache coherence protocols to map shared data onto shared caches
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Netrace: dependency-driven trace-based network-on-chip simulation
Proceedings of the Third International Workshop on Network on Chip Architectures
Fractal Coherence: Scalably Verifiable Cache Coherence
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Transactional conflict decoupling and value prediction
Proceedings of the international conference on Supercomputing
Multiset signatures for transactional memory
Proceedings of the international conference on Supercomputing
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
Proceedings of the 38th annual international symposium on Computer architecture
Filtering directory lookups in CMPs with write-through caches
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Filtering directory lookups in CMPs
Microprocessors & Microsystems
Thread Tranquilizer: Dynamically reducing performance variation
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Switch-based packing technique to reduce traffic and latency in token coherence
Journal of Parallel and Distributed Computing
Trace-driven simulation of memory system scheduling in multithread application
Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Proceedings of the 26th ACM international conference on Supercomputing
Something old and something new: P-states can borrow microarchitecture techniques too
Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
XPoint cache: scaling existing bus-based coherence protocols for 2D and 3D many-core systems
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
STABILIZER: statistically sound performance evaluation
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Wait-n-GoTM: improving HTM performance by serializing cyclic dependencies
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
MapReduce with communication overlap (MaRCO)
Journal of Parallel and Distributed Computing
A study of performance variations in the Mozilla Firefox web browser
ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135
Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
High-performance fractal coherence
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling
ACM Transactions on Architecture and Code Optimization (TACO)
The case of using multiple streams in streaming
International Journal of Automation and Computing
Removal of Conflicts in Hardware Transactional Memory Systems
International Journal of Parallel Programming
Hi-index | 0.01 |
Multi-threaded commercial workloads implement many important internet services. Consequently, these workloads are increasingly used to evaluate the performance of uniprocessor and multiprocessor system designs. This paper identifies performance variability as a potentially major challenge for architectural simulation studies using these workloads. Variability refers to the differences between multiple estimates of a work-load's performance. Time variability occurs when a workload exhibits different characteristics during different phases of a single run. Space variability occurs when small variations in timing cause runs starting from the same initial condition to follow widely different execution paths.Variability is a well-known phenomenon in real systems, but is nearly universally ignored in simulation experiments. In a central result of this paper, we show that variability in multi-threaded commercial workloads can lead to incorrect architectural conclusions (e.g., 31% of the time in one experiment). We propose a methodology, based on multiple simulations and standard statistical techniques, to compensate for variability. Our methodology greatly reduces the probability of reaching incorrect conclusions, while enabling simulations to finish within reasonable time limits.