Niagara: A 32-Way Multithreaded Sparc Processor

Authors:
Poonacha Kongetira;Kathirgamar Aingaran;Kunle Olukotun
Affiliations:
Sun Microsystems;Sun Microsystems;Sun Microsystems
Venue:
IEEE Micro
Year:
2005

Citing 6
Cited 259

Interleaving: a multithreading technique targeting multiprocessors and workstations

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
A Chip Multithreaded Processor for Network-Facing Workloads

IEEE Micro
A performance methodology for commercial servers

IBM Journal of Research and Development

High-Performance Throughput Computing

IEEE Micro
Maximizing CMP Throughput with Mediocre Cores

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
The Future of Microprocessors

Queue - Multiprocessors
Extreme Software Scaling

Queue - Multiprocessors
The Price of Performance

Queue - Multiprocessors
How to Fake 1000 Registers

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Performance/Watt: the new server focus

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Exploring the cache design space for large scale CMPs

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Fast synchronization for chip multiprocessors

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Performance implications of single thread migration on a chip multi-core

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Power-Efficient Error Tolerance in Chip Multiprocessors

IEEE Micro
Power/performance hardware optimization for synchronization intensive applications in MPSoCs

Proceedings of the conference on Design, automation and test in Europe: Proceedings
The Atomos transactional programming language

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

Proceedings of the 33rd annual international symposium on Computer Architecture
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
An efficient synchronization technique for multiprocessor systems on-chip

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Architectural support for operating system-driven CMP cache management

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
IPC Considered Harmful for Multiprocessor Workloads

IEEE Micro
A case for chip multiprocessors based on the data-driven multithreading model

International Journal of Parallel Programming
Reducing power through compiler-directed barrier synchronization elimination

Proceedings of the 2006 international symposium on Low power electronics and design
A regulated transitive reduction (RTR) for longer memory race recording

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A multiprocessing approach to accelerate retargetable and portable dynamic-compiled instruction-set simulation

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
FlashCache: a NAND flash memory file cache for low power web servers

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
High-level power analysis for multi-core chips

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Deconstructing process isolation

Proceedings of the 2006 workshop on Memory system performance and correctness
A flexible data to L2 cache mapping approach for future multicore processors

Proceedings of the 2006 workshop on Memory system performance and correctness
Supporting microthread scheduling and synchronisation in CMPs

International Journal of Parallel Programming
Multigrid and Gauss-Seidel smoothers revisited: parallelization on chip multiprocessors

Proceedings of the 20th annual international conference on Supercomputing
Online power-performance adaptation of multithreaded programs using hardware event-based prediction

Proceedings of the 20th annual international conference on Supercomputing
Design space exploration for multicore architectures: a power/performance/thermal view

Proceedings of the 20th annual international conference on Supercomputing
WormTerminator: an effective containment of unknown and polymorphic fast spreading worms

Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Fairness and Throughput in Switch on Event Multithreading

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
SMP-SoC is the answer if you ask the right questions

SAICSIT '06 Proceedings of the 2006 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Executing Java programs with transactional memory

Science of Computer Programming - Special issue: Synchronization and concurrency in object-oriented languages
Multi-processor operating system emulation framework with thermal feedback for systems-on-chip

Proceedings of the 17th ACM Great Lakes symposium on VLSI
Transactional collection classes

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluating the potential of multithreaded platforms for irregular scientific computations

Proceedings of the 4th international conference on Computing frontiers
Fuce: the continuation-based multithreading processor

Proceedings of the 4th international conference on Computing frontiers
Scalability of continuation-based fine-grained multithreading in handling multiple I/O requests on FUCE

Proceedings of the 4th international conference on Computing frontiers
Singularity: rethinking the software stack

ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
Proximity-aware directory-based coherence for multi-core processor architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Virtual private caches

Proceedings of the 34th annual international symposium on Computer architecture
Rotary router: an efficient architecture for CMP interconnection networks

Proceedings of the 34th annual international symposium on Computer architecture
Configurable isolation: building high availability systems with commodity multi-core processors

Proceedings of the 34th annual international symposium on Computer architecture
A study of thread migration in temperature-constrained multicores

ACM Transactions on Architecture and Code Optimization (TACO)
Temperature aware task scheduling in MPSoCs

Proceedings of the conference on Design, automation and test in Europe
Isolation in Commodity Multicore Processors

Computer
Enabling scalability and performance in a large scale CMP environment

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Fairness enforcement in switch on event multithreading

ACM Transactions on Architecture and Code Optimization (TACO)
Thermal-aware scheduling for future chip multiprocessors

EURASIP Journal on Embedded Systems
Towards a Java multiprocessor

JTRES '07 Proceedings of the 5th international workshop on Java technologies for real-time and embedded systems
Exploring Large-Scale CMP Architectures Using ManySim

IEEE Micro
A 5-GHz Mesh Interconnect for a Teraflops Processor

IEEE Micro
Data locality enhancement for CMPs

Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design
Improving disk bandwidth-bound applications through main memory compression

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Characterization of Apache web server with Specweb2005

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Efficient architectural design space exploration via predictive modeling

ACM Transactions on Architecture and Code Optimization (TACO)
Hiding the misprediction penalty of a resource-efficient high-performance processor

ACM Transactions on Architecture and Code Optimization (TACO)
The coming wave of multithreaded chip multiprocessors

International Journal of Parallel Programming
Toward high performance nonblocking software transactional memory

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Thread scheduling for multi-core platforms

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
A case for low-complexity MP architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scaling performance of interior-point method on large-scale chip multiprocessor system

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Sams: single-affiliation multiple-stride parallel memory scheme

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Reducing the impact of intra-core process variability with criticality-based resource allocation and prefetching

Proceedings of the 5th conference on Computing frontiers
Software-directed combined cpu/link voltage scaling fornoc-based cmps

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The shared-thread multiprocessor

Proceedings of the 22nd annual international conference on Supercomputing
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Operational analysis of processor speed scaling

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Early detection and bypassing of trivial operations to improve energy efficiency of processors

Microprocessors & Microsystems
A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Technology-Driven, Highly-Scalable Dragonfly Topology

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Corona: System Implications of Emerging Nanophotonic Technology

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
MIRA: A Multi-layered On-Chip Interconnect Router Architecture

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Many-core design from a thermal perspective

Proceedings of the 45th annual Design Automation Conference
On-chip cache device scaling limits and effective fault repair techniques in future nanoscale technology

Microprocessors & Microsystems
Thread fusion

Proceedings of the 13th international symposium on Low power electronics and design
Thermal monitoring mechanisms for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Temperature control of high-performance multi-core platforms using convex optimization

Proceedings of the conference on Design, automation and test in Europe
MCjammer: adaptive verification for multi-core designs

Proceedings of the conference on Design, automation and test in Europe
Aspects in hardware: what do they look like?

Proceedings of the 2008 AOSD workshop on Aspects, components, and patterns for infrastructure software
PicoServer: Using 3D stacking technology to build energy efficient servers

ACM Journal on Emerging Technologies in Computing Systems (JETC)
A novel migration-based NUCA design for chip multiprocessors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Adaptive Loop Tiling for a Multi-cluster CMP

ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
Techniques for Efficient Software Checking

Languages and Compilers for Parallel Computing
Core cannibalization architecture: improving lifetime chip performance for multicore processors in the presence of hard faults

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Improving support for locality and fine-grain sharing in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Adaptive insertion policies for managing shared caches

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Meeting points: using thread criticality to adapt multicore hardware to parallel regions

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Critical sections: re-emerging scalability concerns for database storage engines

Proceedings of the 4th international workshop on Data management on new hardware
A comparative evaluation of hybrid distributed shared-memory systems

Journal of Systems Architecture: the EUROMICRO Journal
GRAMPS: A programming model for graphics pipelines

ACM Transactions on Graphics (TOG)
Controlling chaos: on safe side-effects in data-parallel operations

Proceedings of the 4th workshop on Declarative aspects of multicore programming
A continuation-based noninterruptible multithreading processor architecture

The Journal of Supercomputing
An approach on distributed and shared dynamic cache partition

DNCOCO'08 Proceedings of the 7th conference on Data networks, communications, computers
A compiler-directed data prefetching scheme for chip multiprocessors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving Search Engines Performance on Multithreading Processors

High Performance Computing for Computational Science - VECPAR 2008
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points Based on Branch, Memory and TLB Characteristics

Proceedings of the 2009 SPEC Benchmark Workshop on Computer Performance Evaluation and Benchmarking
Accelerating critical section execution with asymmetric multi-core architectures

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Zero loads: canceling load requests by tracking zero values

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
MC-Sim: an efficient simulation tool for MPSoC designs

Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design
Dynamically reconfigurable on-chip communication architectures for multi use-case chip multiprocessor applications

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
A control theory approach for thermal balancing of MPSoC

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Static and dynamic temperature-aware scheduling for multiprocessor SoCs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Shore-MT: a scalable storage manager for the multicore era

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Implementing high availability memory with a duplication cache

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
The StageNet fabric for constructing resilient multicore systems

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Notary: Hardware techniques to enhance signatures

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Strategies for mapping dataflow blocks to distributed hardware

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Optimizing Memory Access Latencies on a Reconfigurable Multimedia Accelerator: A Case of a Turbo Product Codes Decoder

ARC '09 Proceedings of the 5th International Workshop on Reconfigurable Computing: Architectures, Tools and Applications
Rate-based QoS techniques for cache/memory in CMP platforms

Proceedings of the 23rd international conference on Supercomputing
Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Service level agreement for multithreaded processors

ACM Transactions on Architecture and Code Optimization (TACO)
ESoftCheck: Removal of Non-vital Checks for Fault Tolerance

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
A multigrain Delaunay mesh generation method for multicore SMT-based architectures

Journal of Parallel and Distributed Computing
Algorithm, software, and hardware optimizations for Delaunay mesh generation on simultaneous multithreaded architectures

Journal of Parallel and Distributed Computing
Time-predictable computer architecture

EURASIP Journal on Embedded Systems - FPGA supercomputing platforms, architectures, and techniques for accelerating computationally complex algorithms
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
A memory system design framework: creating smart memories

Proceedings of the 36th annual international symposium on Computer architecture
Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor

Proceedings of the 36th annual international symposium on Computer architecture
Scalable directory architecture for distributed shared memory chip multiprocessors

ACM SIGARCH Computer Architecture News
Type-Directed Compilation for Multicore Programming

Electronic Notes in Theoretical Computer Science (ENTCS)
Triplet-based topology for on-chip networks

WSEAS Transactions on Computers
SPARTAN: A software tool for Parallelization Bottleneck Analysis

IWMSE '09 Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering
FlexCore: Utilizing Exposed Datapath Control for Efficient Computing

Journal of Signal Processing Systems
Way-tagged cache: an energy-efficient L2 cache architecture under write-through policy

Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Combining Coarse-Grained Software Pipelining with DVS for Scheduling Real-Time Periodic Dependent Tasks on Multi-Core Embedded Systems

Journal of Signal Processing Systems
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Allocation wall: a limiting factor of Java applications on emerging multi-core platforms

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
A performance evaluation of 2D-mesh, ring, and crossbar interconnects for chip multi-processors

Proceedings of the 2nd International Workshop on Network on Chip Architectures
Reconsidering algorithms for iterative solvers in the multicore era

International Journal of Computational Science and Engineering
Enabling software management for multicore caches with a lightweight hardware support

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Future scaling of processor-memory interfaces

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Finding representative workloads for computer system design

Finding representative workloads for computer system design
Algorithm/architecture co-exploration of visual computing on emergent platforms: overview and future prospects

IEEE Transactions on Circuits and Systems for Video Technology
VisoMT: a collaborative multithreading multicore processor for multimedia applications with a fast data switching mechanism

IEEE Transactions on Circuits and Systems for Video Technology
Fast and accurate prediction of the steady-state throughput of multicore processors under thermal constraints

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRaX: a multicore hardware architecture for real-time ray tracing

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Iterative induced dipoles computation for molecular mechanics on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Compiler-based data classification for hybrid caching

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Two-phase trace-driven simulation (TPTS): a fast multicore processor architecture simulation approach

Software—Practice & Experience
Compiler directed network-on-chip reliability enhancement for chip multiprocessors

Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems
Minimizing communication in rate-optimal software pipelining for stream programs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
PIRATE: QoS and performance management in CMP architectures

ACM SIGMETRICS Performance Evaluation Review
Online convex optimization-based algorithm for thermal management of MPSoCs

Proceedings of the 20th symposium on Great lakes symposium on VLSI
Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
The auction: optimizing banks usage in Non-Uniform Cache Architectures

Proceedings of the 24th ACM International Conference on Supercomputing
An approach to resource-aware co-scheduling for CMPs

Proceedings of the 24th ACM International Conference on Supercomputing
A real-time Java chip-multiprocessor

ACM Transactions on Embedded Computing Systems (TECS)
WiDGET: Wisconsin decoupled grid execution tiles

Proceedings of the 37th annual international symposium on Computer architecture
Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing

Proceedings of the 37th annual international symposium on Computer architecture
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
Modeling soft errors for data caches and alleviating their effects on data reliability

Microprocessors & Microsystems
Trace-driven optimization of networks-on-chip configurations

Proceedings of the 47th Design Automation Conference
Design of a cache hierarchy for LogN and LogN+1 model for multi-level cache system for multi-core processors

Proceedings of the 7th International Conference on Frontiers of Information Technology
Applied inference: Case studies in microarchitectural design

ACM Transactions on Architecture and Code Optimization (TACO)
Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Understanding throughput-oriented architectures

Communications of the ACM
Introduction to the wire-speed processor and architecture

IBM Journal of Research and Development
VoIP performance on multicore platforms

IBM Journal of Research and Development
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Compilation of stream programs for multicore processors that incorporate scratchpad memories

Proceedings of the Conference on Design, Automation and Test in Europe
ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration

Proceedings of the Conference on Design, Automation and Test in Europe
Group-caching for NoC based multicore cache coherent systems

Proceedings of the Conference on Design, Automation and Test in Europe
Adaptive prefetching for shared cache based chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Process variation aware thread mapping for chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Characterization and exploitation of narrow-width loads: the narrow-width cache approach

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Federation: Boosting per-thread performance of throughput-oriented manycore architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Automatic granularity selection and OpenMP directive generation via extended machine descriptors in the PROMIS parallelizing compiler

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Online strategies for high-performance power-aware thread execution on emerging multiprocessors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Enhancing L2 organization for CMPs with a center cell

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Compatible phase co-scheduling on a CMP of multi-threaded processors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Erasing Core Boundaries for Robust and Configurable Performance

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Throughput-Effective On-Chip Networks for Manycore Accelerators

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive Flow Control for Robust Performance and Energy

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Real-Time Adaptive Background Modeling for Multicore Embedded Systems

Journal of Signal Processing Systems
Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Run-time adaptable on-chip thermal triggers

Proceedings of the 16th Asia and South Pacific Design Automation Conference
Low Power Design for a Multi-core Multi-thread Microprocessor

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Efficient synchronization for embedded on-chip multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Analysis of execution efficiency in the microthreaded processor UTLEON3

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Memory-, bandwidth-, and power-aware multi-core for a graph database workload

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Emulation-based transient thermal modeling of 2D/3D systems-on-chip with active cooling

Microelectronics Journal
Research note: C-AMTE: A location mechanism for flexible cache management in chip multiprocessors

Journal of Parallel and Distributed Computing
Robust adaptation to available parallelism in transactional memory applications

Transactions on high-performance embedded architectures and compilers III
Dual-thread speculation: a simple approach to uncover thread-level parallelism on a simultaneous multithreaded processor

International Journal of Parallel Programming
METE: meeting end-to-end QoS in multicores through system-wide resource management

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
CRIB: consolidated rename, issue, and bypass

Proceedings of the 38th annual international symposium on Computer architecture
CPPC: correctable parity protected cache

Proceedings of the 38th annual international symposium on Computer architecture
Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading

Proceedings of the 38th annual international symposium on Computer architecture
Moguls: a model to explore the memory hierarchy for bandwidth improvements

Proceedings of the 38th annual international symposium on Computer architecture
Mobile processors for energy-efficient web search

ACM Transactions on Computer Systems (TOCS)
METE: meeting end-to-end QoS in multicores through system-wide resource management

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Branch penalty reduction on IBM cell SPUs via software branch hinting

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Processing (multiple) spatio-temporal range queries in multicore settings

ADBIS'11 Proceedings of the 15th international conference on Advances in databases and information systems
Convex-based thermal management for 3D MPSoCs using DVFS and variable-flow liquid cooling

PATMOS'11 Proceedings of the 21st international conference on Integrated circuit and system design: power and timing modeling, optimization, and simulation
A study of 3D Network-on-Chip design for data parallel H.264 coding

Microprocessors & Microsystems
Loop selection for thread-level speculation

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
DAPSCO: Distance-aware partially shared cache organization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Throughput computing with chip multithreading and clusters

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
CACTI-P: architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques

Proceedings of the International Conference on Computer-Aided Design
Improving System Energy Efficiency with Memory Rank Subsetting

ACM Transactions on Architecture and Code Optimization (TACO)
Efficient trace-driven metaheuristics for optimization of networks-on-chip configurations

Proceedings of the International Conference on Computer-Aided Design
A hierarchical CLH queue lock

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
CRUISE: cache replacement and utility-aware scheduling

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Performance/Thermal-Aware Design of 3D-Stacked L2 Caches for CMPs

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Preventing denial-of-service attacks in shared CMP caches

SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Tiled multi-core stream architecture

Transactions on High-Performance Embedded Architectures and Compilers IV
Improving performance by reducing aborts in hardware transactional memory

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Can manycores support the memory requirements of scientific applications?

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Machine learning based performance prediction for multi-core simulation

MIWAI'11 Proceedings of the 5th international conference on Multi-Disciplinary Trends in Artificial Intelligence
A weighted-fair-queuing (WFQ)-based dynamic request scheduling approach in a multi-core system

Future Generation Computer Systems
High radix self-arbitrating switch fabric with multiple arbitration schemes and quality of service

Proceedings of the 49th Annual Design Automation Conference
Courteous cache sharing: being nice to others in capacity management

Proceedings of the 49th Annual Design Automation Conference
Load balancing for processing spatio-temporal queries in multi-core settings

MobiDE '12 Proceedings of the Eleventh ACM International Workshop on Data Engineering for Wireless and Mobile Access
Practically private: enabling high performance CMPs through compiler-assisted data classification

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
DIMSim: a rapid two-level cache simulation approach for deadline-based MPSoCs

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
APC: a performance metric of memory systems

ACM SIGMETRICS Performance Evaluation Review
A combined sensor placement and convex optimization approach for thermal management in 3D-MPSoC with liquid cooling

Integration, the VLSI Journal
MAGE: adaptive granularity and ECC for resilient and power efficient memory systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Online thermal control methods for multiprocessor systems

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special section on adaptive power management for energy and temperature-aware computing systems
Performance evaluation of evolutionary multi-core and aggressively multi-threaded processor architectures

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Synchronization mechanisms on modern multi-core architectures

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Accelerated parallel genetic programming tree evaluation with OpenCL

Journal of Parallel and Distributed Computing
Understanding fundamental design choices in single-ISA heterogeneous multicore architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing

ACM Transactions on Architecture and Code Optimization (TACO)
Automatic parallelization of canonical loops

Science of Computer Programming
An energy-efficient L2 cache architecture using way tag information under write-through policy

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

ACM Transactions on Computer Systems (TOCS)
Exploring the vulnerability of CMPs to soft errors with 3D stacked nonvolatile memory

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Hazard driven test generation for SMT processors

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
CATRA- congestion aware trapezoid-based routing algorithm for on-chip networks

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
A hybrid HW-SW approach for intermittent error mitigation in streaming-based embedded systems

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Designing on-chip networks for throughput accelerators

ACM Transactions on Architecture and Code Optimization (TACO)
Implementing an interconnection network based on crossbar topology for parallel applications in MPSoC

ICCSA'13 Proceedings of the 13th international conference on Computational Science and Its Applications - Volume 1
P2EST: parallelization philosophies for evaluating spatio-temporal queries

Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data
Market mechanisms for managing datacenters with heterogeneous microarchitectures

ACM Transactions on Computer Systems (TOCS)
Integrated 3D-stacked server designs for increasing physical density of key-value stores

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
A Host-Based Approach for Unknown Fast-Spreading Worm Detection and Containment

ACM Transactions on Autonomous and Adaptive Systems (TAAS) - Special Section on Best Papers from SEAMS 2012
Exploiting replication to improve performances of NUCA-based CMP systems

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
WCET analysis with MRU cache: Challenging LRU for predictability

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.03

Visualization

Abstract

The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications. The hardware supports 32 threads with a memory subsystem consisting of an on-board crossbar, level-2 cache, and memory controllers for a highly integrated design that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.