Marching cubes: A high resolution 3D surface construction algorithm
SIGGRAPH '87 Proceedings of the 14th annual conference on Computer graphics and interactive techniques
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Integration of message passing and shared memory in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Routing in communications networks
Routing in communications networks
Route packets, not wires: on-chip inteconnection networks
Proceedings of the 38th annual Design Automation Conference
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
The Design of Rijndael
ASCOMA: An Adaptive Hybrid Shared Memory Architecture
ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Parallel Scientific Computing in C++ and MPI
Parallel Scientific Computing in C++ and MPI
A Network on Chip Architecture and Design Methodology
ISVLSI '02 Proceedings of the IEEE Computer Society Annual Symposium on VLSI
Maximizing CMP Throughput with Mediocre Cores
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Implementation analysis of NoC: a MPSoC trace-driven approach
GLSVLSI '06 Proceedings of the 16th ACM Great Lakes symposium on VLSI
Evaluation of on-chip networks using deflection routing
GLSVLSI '06 Proceedings of the 16th ACM Great Lakes symposium on VLSI
Design tradeoffs for tiled CMP on-chip networks
Proceedings of the 20th annual international conference on Supercomputing
Proceedings of the 44th annual Design Automation Conference
DSD '07 Proceedings of the 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools
NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
A Quantitative Study of the On-Chip Network and Memory Hierarchy Design for Many-Core Processor
ICPADS '08 Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems
A case for bufferless routing in on-chip networks
Proceedings of the 36th annual international symposium on Computer architecture
A case study for NoC-based homogeneous MPSoC architectures
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Using a configurable processor generator for computer architecture prototyping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Parallel programming models for a multiprocessor SoC platform applied to networking and multimedia
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
UWB microwave imaging for breast cancer detection: Many-core, GPU, or FPGA?
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
Future chip-multiprocessors (CMP) will integrate many cores interconnected with a high-bandwidth and low-latency scalable network-on-chip (NoC). However, the potential that this approach offers at the transport level needs to be paired with an analogous paradigm shift at the higher levels. In particular, the standard shared-memory programming model fails to address the requirements of scalability of the many-core era. Fast data exchange among the cores and low-latency synchronization are desirable but hard to achieve in practice due to the memory hierarchy. The message-passing paradigm permits instead direct data communication and synchronization between the cores. The shared-memory could still be used for the instruction fetch. Hence, we propose a hybrid approach that combines shared-memory and message passing in a single general-purpose CMP architecture that allows efficient execution of applications developed with both parallel programming approaches. Cores fetch instructions from a hierarchical memory and exchange their data through the same memory, for compatibility with existing software, or directly through the fast NoC. We developed a fast SystemC based cycle-accurate simulator for design space explorations that we used to evaluate the performance with real benchmarks. The various components have been RTL coded and mapped to a CMOS 45nm technology to build a silicon area model that we used to select the best architectural configurations.