Transactional memory: architectural support for lock-free data structures
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Speculative lock elision: enabling highly concurrent multithreaded execution
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Speculative synchronization: applying thread-level speculation to explicitly parallel applications
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A stream compiler for communication-exposed architectures
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
StreamIt: A Language for Streaming Applications
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Using cache memory to reduce processor-memory traffic
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
Internetworking with TCP/IP, Vol 2: Design, Implementation, and Internals (4th Edition)
Internetworking with TCP/IP, Vol 2: Design, Implementation, and Internals (4th Edition)
Transactional Memory Coherence and Consistency
Proceedings of the 31st annual international symposium on Computer architecture
Automatically partitioning packet processing applications for pipelined architectures
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Automatic multithreading and multiprocessing of C programs for IXP
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor
Languages and Compilers for Parallel Computing
Hi-index | 0.00 |
To keep up with the explosive internet packet processing demands, modern network processors (NPs) employ a highly parallel, multi-threaded and multi-core architecture. In such a parallel paradigm, accesses to the shared variables in the external memory (and the associated memory latency) are contained in the critical sections, so that they can be executed atomically and sequentially by different threads in the network processor. In this paper, we present a novel program transformation that is used in the Intel Auto-partitioning C Compiler for IXP to exploit the inherent finer-grained parallelism of those critical sections, using the software-controlled caching mechanism available in the NPs. Consequently, those critical sections can be executed in a pipelined fashion by different threads, thereby effectively hiding the memory latency and improving the performance of network applications. Experimental results show that the proposed transformation provides impressive speedup (up-to 9.94 and scalability (up-to 80 threads) of the performance for the real-world network application (a l0Gbps Efhernet Core/Metro Router).