The Strength of Weak Learnability
Machine Learning
Functional Implementation Techniques for CPU Cache Memories
IEEE Transactions on Computers - Special issue on cache memory and related problems
Dual use of superscalar datapath for transient-fault detection and recovery
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
VPR: A new packing, placement and routing tool for FPGA research
FPL '97 Proceedings of the 7th International Workshop on Field-Programmable Logic and Applications
Monte Carlo algorithms for stationary device simulations
Mathematics and Computers in Simulation - Special issue: 3rd IMACS seminar on Monte Carlo methods - MCM 2001
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Memory resource management in VMware ESX server
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Adaptive Cache Compression for High-Performance Processors
Proceedings of the 31st annual international symposium on Computer architecture
The STAMPede approach to thread-level speculation
ACM Transactions on Computer Systems (TOCS)
DRAMsim: a memory system simulator
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research
IEEE Computer Architecture Letters
Algorithms for parallel boosting
ICMLA '05 Proceedings of the Fourth International Conference on Machine Learning and Applications
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Valgrind: a framework for heavyweight dynamic binary instrumentation
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Revisiting the Sequential Programming Model for Multi-Core
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the conference on Design, automation and test in Europe
PSMalloc: content based memory management for MPI applications
Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Orthrus: efficient software integrity protection on multi-cores
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Virtually cool ternary content addressable memory
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
CATCH: A mechanism for dynamically detecting cache-content-duplication in instruction caches
ACM Transactions on Architecture and Code Optimization (TACO)
HICAMP: architectural support for efficient concurrency-safe shared structured data access
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Dependable and Historic Computing
Extrinsic and intrinsic text cloning
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Rhythm: harnessing data parallel hardware for server workloads
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as "multi-execution." Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5x on average (ranging from 0.93x to 6.92x), while posing an overhead of only 4.28% on cache area and 5.21% on power when it is used.