Optimizing software cache performance of packet processing applications

Authors:
Qin Wang;Junpu Chen;Weihua Zhang;Min Yang;Binyu Zang
Affiliations:
Fudan University, Shanghai, China;Fudan University, Shanghai, China;Fudan University, Shanghai, China;Fudan University, Shanghai, China;Fudan University, Shanghai, China
Venue:
Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Year:
2007

Citing 16
Cited 1

Communicating sequential processes

Communicating sequential processes
Characteristics of destination address locality in computer networks: a comparison of caching schemes

Computer Networks and ISDN Systems
Memory access coalescing: a technique for eliminating redundant memory accesses

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
The click modular router

ACM Transactions on Computer Systems (TOCS)
Smarter Memory: Improving Bandwidth for Streamed References

Computer
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Design Tradeoffs for Embedded Network Processors

ARCS '02 Proceedings of the International Conference on Architecture of Computing Systems: Trends in Network and Pervasive Computing
Interprocedural optimizations for improving data cache performance of array-intensive embedded applications

Proceedings of the 40th annual Design Automation Conference
Memory Hierarchy Design for a Multiprocessor Look-up Engine

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Automatically partitioning packet processing applications for pipelined architectures

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Automatic multithreading and multiprocessing of C programs for IXP

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Overcoming the memory wall in packet processing: hammers or ladders?

Proceedings of the 2005 ACM symposium on Architecture for networking and communications systems
Dynamic allocation for scratch-pad memory using compile-time decisions

ACM Transactions on Embedded Computing Systems (TECS)
Optimizing packet accesses for a domain specific language on network processors

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Survey and taxonomy of IP address lookup algorithms

IEEE Network: The Magazine of Global Internetworking

A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor

Languages and Compilers for Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Network processors (NPs) are widely used in many types of networking equipment due to their high performance and flexibility. For most NPs, software cache is used instead of hardware cache due to the chip area, cost and power constraints. Therefore, programmers should take full responsibility for software cache management which is neither intuitive nor easy to most of them. Actually, without an effective use of it, long memory access latency will be a critical limiting factor to overall applications. Prior researches like hardware multi-threading, wide-word accesses and packet access combination for caching have already been applied to help programmers to overcome this bottleneck. However, most of them do not make enough use of the characteristics of packet processing applications and often perform intraprocedural optimizations only. As a result, the binary codes generated by those techniques often get lower performance than that comes from hand-tuned assembly programming for some applications. In this paper, we propose an algorithm including two techniques - Critical Path Based Analysis (CPBA) and Global Adaptive Localization (GAL), to optimize the software cache performance of packet processing applications. Packet processing applications usually have several hot paths and CPBA tries to insert localization instructions according to their execution frequencies. For further optimizations, GAL eliminates some redundant localization instructions by interprocedural analysis and optimizations. Our algorithm is applied on some representative applications. Experiment results show that it leads to an average speedup by a factor of 1.974.