Optimal pipelining in supercomputers
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
IEEE Transactions on Computers
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Proceedings of the 28th annual international symposium on Microarchitecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Dynamic IPC/clock rate optimization
Proceedings of the 25th annual international symposium on Computer architecture
Smart Memories: a modular reconfigurable architecture
Proceedings of the 27th annual international symposium on Computer architecture
Computer
The Alpha 21264 Microprocessor
IEEE Micro
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Hardware techniques to improve the performance of the processor/memory interface
Hardware techniques to improve the performance of the processor/memory interface
The impact of delay on the design of branch predictors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Modulo scheduling for a fully-distributed clustered VLIW architecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Reducing wire delay penalty through value prediction
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Optimizations Enabled by a Decoupled Front-End Architecture
IEEE Transactions on Computers
Multiplex: unifying conventional and speculative thread-level parallelism on a chip multiprocessor
ICS '01 Proceedings of the 15th international conference on Supercomputing
Focusing processor policies via critical-path prediction
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Measuring experimental error in microprocessor simulation
SSR '01 Proceedings of the 2001 symposium on Software reusability: putting software reuse in context
Application specific architectures: a recipe for fast, flexible and power efficient designs
CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Designing a Modern Memory Hierarchy with Hardware Prefetching
IEEE Transactions on Computers
Compiler Support for Scalable and Efficient Memory Systems
IEEE Transactions on Computers
Latency and energy aware value prediction for high-frequency processors
ICS '02 Proceedings of the 16th international conference on Supercomputing
An interleaved cache clustered VLIW processor
ICS '02 Proceedings of the 16th international conference on Supercomputing
The optimum pipeline depth for a microprocessor
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Efficient dynamic scheduling through tag elimination
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
An instruction set and microarchitecture for instruction level distributed processing
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Multithreading decoupled architectures for complexity-effective general purpose computing
ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
A design space evaluation of grid processor architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Graph-partitioning based instruction scheduling for clustered processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Measuring Experimental Error in Microprocessor Simulation
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Hardware-assisted simulated annealing with application for fast FPGA placement
FPGA '03 Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays
Stochastic, spatial routing for hypergraphs, trees, and meshes
FPGA '03 Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays
Coping with Latency in SOC Design
IEEE Micro
Parallel simulation of chip-multiprocessor architectures
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Instruction Level Distributed Processing
HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Efficient Interconnects for Clustered Microarchitectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Using the Compiler to Improve Cache Replacement Decisions
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Instruction Level Distributed Processing: Adapting to Future Technology
ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
High Performance and Energy Efficient Serial Prefetch Architecture
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Speeding Up Target Address Generation Using a Self-indexed FTB (Research Note)
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic addressing memory arrays with physical locality
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Microarchitecture evaluation with physical planning
Proceedings of the 40th annual Design Automation Conference
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Exploring the VLSI Scalability of Stream Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Reconsidering Complex Branch Predictors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Dynamic Data Dependence Tracking and its Application to Branch Prediction
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Interface Design Techniques for Single-Chip Systems
VLSID '03 Proceedings of the 16th International Conference on VLSI Design
Highly accurate and efficient evaluation of randomising set index functions
Journal of Systems Architecture: the EUROMICRO Journal
Effective ahead pipelining of instruction block address generation
Proceedings of the 30th annual international symposium on Computer architecture
Cyclone: a broadcast-free dynamic instruction scheduler with selective replay
Proceedings of the 30th annual international symposium on Computer architecture
Dynamically managing the communication-parallelism trade-off in future clustered processors
Proceedings of the 30th annual international symposium on Computer architecture
A fast parallel reed-solomon decoder on a reconfigurable architecture
Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
On-chip communication design: roadblocks and avenues
Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Optimum Power/Performance Pipeline Depth
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Ambient intelligence: a computational platform perspective
Ambient intelligence
Profile-guided microarchitectural floorplanning for deep submicron processor design
Proceedings of the 41st annual Design Automation Conference
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures
Proceedings of the 18th annual international conference on Supercomputing
Wire Delay is Not a Problem for SMT (In the Near Future)
Proceedings of the 31st annual international symposium on Computer architecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
Proceedings of the 31st annual international symposium on Computer architecture
A low-power in-order/out-of-order issue queue
ACM Transactions on Architecture and Code Optimization (TACO)
A low-complexity fetch architecture for high-performance superscalar processors
ACM Transactions on Architecture and Code Optimization (TACO)
Reducing pipeline energy demands with local DVS and dynamic retiming
Proceedings of the 2004 international symposium on Low power electronics and design
A scalable, clustered SMT processor for digital signal processing
MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Programming with transactional coherence and consistency (TCC)
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Scalable selective re-execution for EDGE architectures
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
On-Chip Interconnects and Instruction Steering Schemes for Clustered Microarchitectures
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Effects of speculation on performance and issue queue design
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
The optimum pipeline depth considering both power and performance
ACM Transactions on Architecture and Code Optimization (TACO)
Inherently Workload-Balanced Clustered Microarchitecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Effective Instruction Prefetching via Fetch Prestaging
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
An in-depth look at computer performance growth
ACM SIGARCH Computer Architecture News - Special issue: Workshop on architectural support for security and anti-virus (WASSA)
Controlling leakage power with the replacement policy in slumberous caches
Proceedings of the 2nd conference on Computing frontiers
Demystifying on-the-fly spill code
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Journal of Systems Architecture: the EUROMICRO Journal
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
The STAMPede approach to thread-level speculation
ACM Transactions on Computer Systems (TOCS)
Distributed Data Cache Designs for Clustered VLIW Processors
IEEE Transactions on Computers
Fast and fair: data-stream quality of service
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
TAPE: a transactional application profiling environment
Proceedings of the 19th annual international conference on Supercomputing
A Distributed Control Path Architecture for VLIW Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors
IEEE Transactions on Parallel and Distributed Systems
Implementing Caches in a 3D Technology for High Performance Processors
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency
IEEE Transactions on Computers
Beating In-Order Stalls with "Flea-Flicker" Two-Pass Pipelining
IEEE Transactions on Computers
Journal of Systems Architecture: the EUROMICRO Journal
A power aware system level interconnect design methodology for latency-insensitive systems
Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design
The design and implementation of a low-latency on-chip network
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Compiling for EDGE Architectures
Proceedings of the International Symposium on Code Generation and Optimization
Dynamic instruction schedulers in a 3-dimensional integration technology
GLSVLSI '06 Proceedings of the 16th ACM Great Lakes symposium on VLSI
The impact of the nanoscale on computing systems
ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Microarchitectural floorplanning under performance and thermal tradeoff
Proceedings of the conference on Design, automation and test in Europe: Proceedings
The Atomos transactional programming language
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory
Proceedings of the 33rd annual international symposium on Computer Architecture
Interconnect-Aware Coherence Protocols for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Branchless cycle prediction for embedded processors
Proceedings of the 2006 ACM symposium on Applied computing
Modeling wire delay, area, power, and performance in a simulation infrastructure
IBM Journal of Research and Development
Modeling instruction placement on a spatial architecture
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Design and evaluation of a hierarchical decoupled architecture
The Journal of Supercomputing
Three-dimensional integrated circuits
IBM Journal of Research and Development - Advanced silicon technology
Supporting microthread scheduling and synchronisation in CMPs
International Journal of Parallel Programming
A scalable low power issue queue for large instruction window processors
Proceedings of the 20th annual international conference on Supercomputing
A wire delay-tolerant reconfigurable unit for a clustered programmable-reconfigurable processor
Microprocessors & Microsystems
Executing Java programs with transactional memory
Science of Computer Programming - Special issue: Synchronization and concurrency in object-oriented languages
ACM Transactions on Computer Systems (TOCS)
Efficient scheduling of soft real-time applications on multiprocessors
Journal of Embedded Computing - Real-Time Systems (Euromicro RTS-03)
A cache design for high performance embedded systems
Journal of Embedded Computing - Cache exploitation in embedded systems
Comparing memory systems for chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Inter-cluster communication in VLIW architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Heterogeneous Clustered VLIW Microarchitectures
Proceedings of the International Symposium on Code Generation and Optimization
Implementation and Evaluation of a Dynamically Routed Processor Operand Network
NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
Thermal-aware scheduling for future chip multiprocessors
EURASIP Journal on Embedded Systems
IEEE Transactions on Computers
Data locality enhancement for CMPs
Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design
Improving power efficiency of D-NUCA caches
ACM SIGARCH Computer Architecture News
Reducing cache misses through programmable decoders
ACM Transactions on Architecture and Code Optimization (TACO)
Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies
IEEE Transactions on Computers
A latency-conscious SMT branch prediction architecture
International Journal of High Performance Computing and Networking
Variable latency caches for nanoscale processor
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Communications of the ACM - Web science
Software-directed combined cpu/link voltage scaling fornoc-based cmps
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
Utilizing shared data in chip multiprocessors with the Nahalal architecture
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Concurrent CS: preparing students for a multicore world
Proceedings of the 13th annual conference on Innovation and technology in computer science education
A Non-blocking Multithreaded Architecture with Support for Speculative Threads
ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
A low-complexity microprocessor design with speculative pre-execution
Journal of Systems Architecture: the EUROMICRO Journal
Dual-mode floating-point adder architectures
Journal of Systems Architecture: the EUROMICRO Journal
Comparative evaluation of memory models for chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
On the potential of latency tolerant execution in speculative multithreading
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
A comparative evaluation of hybrid distributed shared-memory systems
Journal of Systems Architecture: the EUROMICRO Journal
Convergent Compilation Applied to Loop Unrolling
Transactions on High-Performance Embedded Architectures and Compilers I
Demystifying magic: high-level low-level programming
Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
A criticality-driven microarchitectural three dimensional (3D) floorplanner
Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Design and implementation of a queue compiler
Microprocessors & Microsystems
Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore
Proceedings of the 2009 ACM symposium on Applied Computing
Factored operating systems (fos): the case for a scalable operating system for multicores
ACM SIGOPS Operating Systems Review
A mechanistic performance model for superscalar out-of-order processors
ACM Transactions on Computer Systems (TOCS)
Evolution in architectures and programming methodologies of coarse-grained reconfigurable computing
Microprocessors & Microsystems
Accurate Instruction Pre-scheduling in Dynamically Scheduled Processors
Transactions on High-Performance Embedded Architectures and Compilers II
Complexity Effective Bypass Networks
Transactions on High-Performance Embedded Architectures and Compilers II
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
ACM Transactions on Architecture and Code Optimization (TACO)
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
A scalable micro wireless interconnect structure for CMPs
Proceedings of the 15th annual international conference on Mobile computing and networking
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Access region cache with register guided memory reference partitioning
Journal of Systems Architecture: the EUROMICRO Journal
A Functional Programming Framework for Latency Insensitive Protocol Validation
Electronic Notes in Theoretical Computer Science (ENTCS)
A 186-Mvertices/s 161-mW floating-point vertex processor with optimized datapath and vertex caches
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
An analysis of on-chip interconnection networks for large-scale chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Physical realization oriented area-power-delay tradeoff exploration
SOC'09 Proceedings of the 11th international conference on System-on-chip
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Providing deterministic end-to-end fairness guarantees in core-stateless networks
IWQoS'03 Proceedings of the 11th international conference on Quality of service
LRU-PEA: a smart replacement policy for non-uniform cache architectures on chip multiprocessors
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
An operating system for multicore and clouds: mechanisms and implementation
Proceedings of the 1st ACM symposium on Cloud computing
The auction: optimizing banks usage in Non-Uniform Cache Architectures
Proceedings of the 24th ACM International Conference on Supercomputing
Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing
Proceedings of the 37th annual international symposium on Computer architecture
Exploiting the reuse supplied by loop-dependent stream references for stream processors
ACM Transactions on Architecture and Code Optimization (TACO)
PoliMakE: a policy making engine for secure embedded software execution on chip-multiprocessors
WESS '10 Proceedings of the 5th Workshop on Embedded Systems Security
A power-efficient migration mechanism for D-NUCA caches
Proceedings of the Conference on Design, Automation and Test in Europe
Process variation aware thread mapping for chip multiprocessors
Proceedings of the Conference on Design, Automation and Test in Europe
Virtualizing network-on-chip resources in chip-multiprocessors
Microprocessors & Microsystems
Comparing FPGA vs. custom cmos and the impact on processor microarchitecture
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
On-chip interconnect analysis of performance and energy metrics under different design goals
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Toward five-dimensional scaling: how density improves efficiency in future computers
IBM Journal of Research and Development
The migration prefetcher: Anticipating data promotion in dynamic NUCA caches
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Next generation embedded processor architecture for personal information devices
EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
Fast parallel FFT on CTaiJi: a coarse-grained reconfigurable computation platform
ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
A low-complexity issue queue design with speculative pre-execution
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Dynamic partition of memory reference instructions – a register guided approach
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A high efficient on-chip interconnection network in SIMD CMPs
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Single FU bypass networks for high clock rate superscalar processors
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
DDM-CMP: data-driven multithreading on a chip multiprocessor
SAMOS'05 Proceedings of the 5th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
ACM SIGARCH Computer Architecture News
Disjoint out-of-order execution processor
ACM Transactions on Architecture and Code Optimization (TACO)
MultiMaKe: Chip-multiprocessor driven memory-aware kernel pipelining
ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
Rapid, low-power loop execution in a network of functional units
Proceedings of the 17th Panhellenic Conference on Informatics
An efficient scheduling scheme using estimated execution time for heterogeneous computing systems
The Journal of Supercomputing
McRouter: multicast within a router for high performance network-on-chips
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
X-Network: An area-efficient and high-performance on-chip wormhole interconnect network
Microprocessors & Microsystems
Tuning the continual flow pipeline architecture with virtual register renaming
ACM Transactions on Architecture and Code Optimization (TACO)
A novel architecture for ahead branch prediction
Frontiers of Computer Science: Selected Publications from Chinese Universities
Hi-index | 0.03 |
The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scali ng of the processor clock with technology generation. Our results show that, due to both diminishing improvements in clock rates and poor wire scaling as semiconductor devices shrink, the achievable performance growth of conventional microarchitectures will slow substantially. In this paper, we describe technology-driven models for wire capacitance, wire delay, and microarchitectural component delay. Using the results of these models, we measure the simulated performance—estimating both clock rate and IPC —of an aggressive out-of-order microarchitecture as it is scaled from a 250nm technology to a 35nm technology. We perform this analysis for three clock scaling targets and two microarchitecture scaling strategies: pipeline scaling and capacity scaling. We find that no scaling strategy permits annual performance improvements of better than 12.5%, which is far worse than the annual 50-60% to which we have grown accustomed.