Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory
Proceedings of the 25th annual international symposium on Computer architecture
A bandwidth-efficient architecture for media processing
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Demonstrating the scalability of a molecular dynamics application on a Petaflop computer
ICS '01 Proceedings of the 15th international conference on Supercomputing
Multi-processor performance on the Tera MTA
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Pursuing a Petaflop: Point Designs for 100 TF Computers Using PIM Technologies
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
FlexRAM: Toward an Advanced Intelligent Memory System
ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
Evaluation of a Multithreaded Architecture for Cellular Computing
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
TiNy Threads: A Thread Virtual Machine for the Cyclops64 Cellular Architecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14 - Volume 15
Controlling chaos: on safe side-effects in data-parallel operations
Proceedings of the 4th workshop on Declarative aspects of multicore programming
A multigrain Delaunay mesh generation method for multicore SMT-based architectures
Journal of Parallel and Distributed Computing
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Efficient address mapping of shared cache for on-chip many-core architecture
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Thread owned block cache: managing latency in many-core architecture
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Crunching large graphs with commodity processors
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
The challenges of efficient code-generation for massively parallel architectures
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Performance modelling and optimization of memory access on cellular computer architecture cyclops64
NPC'05 Proceedings of the 2005 IFIP international conference on Network and Parallel Computing
Hi-index | 0.00 |
Multiprocessor systems-on-a-chip offer a structured approach to managing complexity in chip design. Cyclops is a new family of multithreaded architectures which integrates processing logic, main memory and communications hardware on a single chip. Its simple, hierarchical design allows the hardware architect to manage a large number of components to meet the design constraints in terms of performance, power or application domain.This paper evaluates several alternative Cyclops designs with different relative costs and trade-offs. We compare the performance of several scientific kernels running on different configurations of this architecture. We show that by increasing the number of threads sharing a floating point unit we can hide fairly high cache and memory latencies. We prove that we can reach the theoretical peak performance of the chip and we identify the optimal balance of components for each application. We demonstrate that the design is well adapted to solve problems that are difficult to optimize. For example, we show that sparse matrix vector multiplication obtains 16 GFlops out of 32 GFlops of peak performance.