Locking effects in multiprocessor implementations of protocols
SIGCOMM '93 Conference proceedings on Communications architectures, protocols and applications
IEEE/ACM Transactions on Networking (TON)
Characterizing processor architectures for programmable network interfaces
Proceedings of the 14th international conference on Supercomputing
Integrating superscalar processor components to implement register caching
ICS '01 Proceedings of the 15th international conference on Supercomputing
New directions in traffic measurement and accounting
IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
IXP-1200 Programming
NetBench: a benchmarking suite for network processors
Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
A pipelined memory architecture for high throughput network processors
Proceedings of the 30th annual international symposium on Computer architecture
Efficient use of memory bandwidth to improve network processor throughput
Proceedings of the 30th annual international symposium on Computer architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture
Proceedings of the 30th annual international symposium on Computer architecture
DRIBBLE-BACK REGISTERS: A TECHNIQUE FOR LATENCY TOLERANCE IN MULTIPROCESSORS
DRIBBLE-BACK REGISTERS: A TECHNIQUE FOR LATENCY TOLERANCE IN MULTIPROCESSORS
Memory Hierarchy Design for a Multiprocessor Look-up Engine
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Tree bitmap: hardware/software IP lookups with incremental updates
ACM SIGCOMM Computer Communication Review
Use-Based Register Caching with Decoupled Indexing
Proceedings of the 31st annual international symposium on Computer architecture
Managing memory access latency in packet processing
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Fast hash table lookup using extended bloom filter: an aid to network processing
Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Segmented hash: an efficient hash table implementation for high performance networking subsystems
Proceedings of the 2005 ACM symposium on Architecture for networking and communications systems
CommBench-a telecommunications benchmark for network processors
ISPASS '00 Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software
Addressing the memory bottleneck in packet processing systems
Addressing the memory bottleneck in packet processing systems
Network Algorithmics,: An Interdisciplinary Approach to Designing Fast Networked Devices (The Morgan Kaufmann Series in Networking)
Algorithms for packet classification
IEEE Network: The Magazine of Global Internetworking
Proceedings of the ACM workshop on Programmable routers for extensible services of tomorrow
A programmable architecture for scalable and real-time network traffic measurements
Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Characterizing user-level network virtualization: performance, overheads and limits
International Journal of Network Management
Improving the throughput and delay performance of network processors by applying push model
Proceedings of the 2012 IEEE 20th International Workshop on Quality of Service
Hi-index | 0.00 |
Challenges in addressing the memory bottleneck have made it difficult to design a packet processing platform that simultaneously achieves both ease-of-programming and high performance. Today's commercial processors support two architectural mechanisms - namely, hardware multithreading and caching - to overcome the memory bottleneck. The configurations of these mechanisms (e.g., cache capacity, number of threads per processor core) are fixed at processor-design time. The relative effectiveness of these mechanisms, however, varies significantly with application, traffic, and system characteristics. Thus, programmers often struggle to achieve high performance from a processor that is not well-suited to a particular deployment. To address this challenge, we first make a case for, and then develop a malleable processor architecture that facilitates the dynamic reconfiguration of cache capacity and number of threads to best-suit the needs of each deployment. We then present an algorithm that can determine the optimal thread-cache balance at run-time. The combination of these two allows us to simultaneously achieve the goals of ease-of-programming and high performance. We demonstrate that our processor outperforms a processor similar to Intel's IXP2800 - a state-of-the-art commercial Network Processor - in about 89% of the deployments we consider. Further, in about 30% of the deployments our platform improves the throughput by as much as 300%.