Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Parallel Computing - Special issues on languages and compilers for parallel computers
Task selection for a multiscalar processor
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Maps: a compiler-managed memory system for raw machines
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Improving the performance of speculatively parallel applications on the Hydra CMP
ICS '99 Proceedings of the 13th international conference on Supercomputing
Parallel Language and Compiler Research in Japan
Parallel Language and Compiler Research in Japan
Data Localization Using Loop Aligned Decomposition for Macro-Dataflow Processing
LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Performance Study of a Concurrent Multithreaded Processor
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
FlexRAM: Toward an Advanced Intelligent Memory System
ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
Coarse grain task parallel processing with cache optimization on shared memory multiprocessor
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Performance of OSCAR multigrain parallelizing compiler on SMP servers
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Hi-index | 0.00 |
With the increase of the number of transistors integrated on a chip, efficient use of transistors and scalable improvement of effective performance of a processor are getting important problems. However, it has been thought that popular superscalar and VLIW would have difficulty to obtain scalable improvement of effective performance in future because of the limitation of instruction level parallelism. To cope with this problem, a single chip multiprocessor (SCM) approach with multi grain parallel processing inside a chip, which hierarchically exploits loop parallelism and coarse grain parallelism among subroutines, loops and basic blocks in addition to instruction level parallelism, is thought one of the most promising approaches. This paper evaluates effectiveness of the single chip multiprocessor architectures with a shared cache, global registers, distributed shared memory and/or local memory for near fine grain parallel processing as the first step of research on SCM architecture to support multi grain parallel processing. The evaluation shows OSCAR (Optimally Scheduled Advanced Multiprocessor) architecture having distributed shared memory and local memory in addition to centralized shared memory and attachment of global register gives us significant speed up such as 13.8% to 143.8% for four processors compared with shared cache architecture for applications which have been difficult to extract parallelism effectively.