Code generation for streaming: an access/execute mechanism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data cache with multiple caching strategies tuned to different types of locality
ICS '95 Proceedings of the 9th international conference on Supercomputing
Predictability of load/store instruction latencies
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Managing data caches using selective cache line replacement
International Journal of Parallel Programming - Special issue on instruction-level parallel processing—part II
Run-time adaptive cache hierarchy management via reference analysis
Proceedings of the 24th annual international symposium on Computer architecture
The filter cache: an energy efficient memory structure
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
25 years of the international symposia on Computer architecture (selected papers)
Power and performance tradeoffs using various caching strategies
ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
Selective cache ways: on-demand cache resource allocation
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories
ISLPED '00 Proceedings of the 2000 international symposium on Low power electronics and design
NetBench: a benchmarking suite for network processors
Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
IEEE Micro
Towards a Programming Environment for a Computer with Intelligent Memory
PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Combined DRAM and logic chip for massively parallel systems
ARVLSI '95 Proceedings of the 16th Conference on Advanced Research in VLSI (ARVLSI'95)
Reducing Address Bus Transitions for Low Power Memory Mapping
EDTC '96 Proceedings of the 1996 European conference on Design and Test
Energy-Delay Analysis for On-Chip Interconnect at the System Level
WVLSI '99 Proceedings of the IEEE Computer Society Workshop on VLSI'99
Compiler-directed proactive power management for networks
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Conserving network processor power consumption by exploiting traffic variability
ACM Transactions on Architecture and Code Optimization (TACO)
E-AHRW: An Energy-Efficient Adaptive Hash Scheduler for Stream Processing on Multi-core Servers
Proceedings of the 2011 ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems
Exploiting a computation reuse cache to reduce energy in network processors
HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
Efficient traffic aware power management in multicore communications processors
Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
Hi-index | 0.00 |
We propose and evaluate a data filtering method to reduce the power consumption of high-end processors with multiple execution cores. Although the proposed method can be applied to a wide variety of multi-processor systems including MPPs, SMPs and any type of single-chip multiprocessor, we concentrate on Network Processors. The proposed method uses an execution unit called Data Filtering Engine that processes data with low temporal locality before it is placed on the system bus. The execution cores use locality to decide which load instructions have low temporal locality and which portion of the surrounding code should be off-loaded to the data filtering engine.Our technique reduces the power consumption, because a) the low temporal data is processed on the data filtering engine before it is placed onto the high capacitance system bus, and b) the conflict misses caused by low temporal data are reduced resulting in fewer accesses to the L2 cache. Specifically, we show that our technique reduces the bus accesses in representative applications by as much as 46.8% (26.5% on average) and reduces the overall power by as much as 15.6% (8.6% on average) on a single-core processor. It also improves the performance by as much as 76.7% (29.7% on average) for a processor with 16 execution cores.