Multi-execution: multicore caching for data-similar executions

Authors:
Susmit Biswas;Diana Franklin;Alan Savage;Ryan Dixon;Timothy Sherwood;Frederic T. Chong
Affiliations:
University of California, Santa Barbara, Santa Barbara, CA, USA;University of California, Santa Barbara, Santa Barbara, CA, USA;University of California, Santa Barbara, Santa Barbara, CA, USA;University of California, Santa Barbara, Santa Barbara, CA, USA;University of California, Santa Barbara, Santa Barbara, CA, USA;University of California, Santa Barbara, Santa Barbara, CA, USA
Venue:
Proceedings of the 36th annual international symposium on Computer architecture
Year:
2009

Citing 17
Cited 10

The Strength of Weak Learnability

Machine Learning
Functional Implementation Techniques for CPU Cache Memories

IEEE Transactions on Computers - Special issue on cache memory and related problems
Dual use of superscalar datapath for transient-fault detection and recovery

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
VPR: A new packing, placement and routing tool for FPGA research

FPL '97 Proceedings of the 7th International Workshop on Field-Programmable Logic and Applications
Monte Carlo algorithms for stationary device simulations

Mathematics and Computers in Simulation - Special issue: 3rd IMACS seminar on Monte Carlo methods - MCM 2001
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Memory resource management in VMware ESX server

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Adaptive Cache Compression for High-Performance Processors

Proceedings of the 31st annual international symposium on Computer architecture
The STAMPede approach to thread-level speculation

ACM Transactions on Computer Systems (TOCS)
DRAMsim: a memory system simulator

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research

IEEE Computer Architecture Letters
Algorithms for parallel boosting

ICMLA '05 Proceedings of the Fourth International Conference on Machine Learning and Applications
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Valgrind: a framework for heavyweight dynamic binary instrumentation

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Revisiting the Sequential Programming Model for Multi-Core

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
CATCH: a mechanism for dynamically detecting Cache-Content-Duplication and its application to instruction caches

Proceedings of the conference on Design, automation and test in Europe

PSMalloc: content based memory management for MPI applications

Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Orthrus: efficient software integrity protection on multi-cores

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Virtually cool ternary content addressable memory

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
CATCH: A mechanism for dynamically detecting cache-content-duplication in instruction caches

ACM Transactions on Architecture and Code Optimization (TACO)
HICAMP: architectural support for efficient concurrency-safe shared structured data access

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Diversity

Dependable and Historic Computing
Extrinsic and intrinsic text cloning

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Rhythm: harnessing data parallel hardware for server workloads

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as "multi-execution." Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5x on average (ranging from 0.93x to 6.92x), while posing an overhead of only 4.28% on cache area and 5.21% on power when it is used.