Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Scalable high speed IP routing lookups
SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
The interaction of software prefetching with ILP processors in shared-memory systems
Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Prefetching using Markov predictors
Proceedings of the 24th annual international symposium on Computer architecture
25 years of the international symposia on Computer architecture (selected papers)
Push vs. pull: data movement for linked data structures
Proceedings of the 14th international conference on Supercomputing
Implementing remote procedure calls
ACM Transactions on Computer Systems (TOCS)
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A stateless, content-directed data prefetching mechanism
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Using Cohort-Scheduling to Enhance Server Performance
ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
StreamIt: A Language for Streaming Applications
CC '02 Proceedings of the 11th International Conference on Compiler Construction
The OpenMP Source Code Repository
PDP '05 Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing
Mitigating Amdahl's Law through EPI Throttling
Proceedings of the 32nd annual international symposium on Computer Architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors
IEEE Computer Architecture Letters
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
The shared-thread multiprocessor
Proceedings of the 22nd annual international conference on Supercomputing
Amdahl's Law in the Multicore Era
Computer
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Cache coherence techniques for multicore processors
Cache coherence techniques for multicore processors
Accelerating critical section execution with asymmetric multi-core architectures
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Fast switching of threads between cores
ACM SIGOPS Operating Systems Review
Spatio-temporal memory streaming
Proceedings of the 36th annual international symposium on Computer architecture
Proceedings of the 36th annual international symposium on Computer architecture
Thread motion: fine-grained power management for multi-core systems
Proceedings of the 36th annual international symposium on Computer architecture
DDCache: Decoupled and Delegable Cache Data and Metadata
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
POWER4 system microarchitecture
IBM Journal of Research and Development
Reinventing scheduling for multicore systems
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Bottleneck identification and scheduling in multithreaded applications
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Utility-based acceleration of multithreaded applications on asymmetric CMPs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Interprocedural strength reduction of critical sections in explicitly-parallel programs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
Previous research has shown that Staged Execution (SE), i.e., dividing a program into segments and executing each segment at the core that has the data and/or functionality to best run that segment, can improve performance and save power. However, SE's benefit is limited because most segments access inter-segment data, i.e., data generated by the previous segment. When consecutive segments run on different cores, accesses to inter-segment data incur cache misses, thereby reducing performance. This paper proposes Data Marshaling (DM), a new technique to eliminate cache misses to inter-segment data. DM uses profiling to identify instructions that generate inter-segment data, and adds only 96 bytes/core of storage overhead. We show that DM significantly improves the performance of two promising Staged Execution models, Accelerated Critical Sections and producer-consumer pipeline parallelism, on both homogeneous and heterogeneous multi-core systems. In both models, DM can achieve almost all of the potential of ideally eliminating cache misses to inter-segment data. DM's performance benefit increases with the number of cores.