Limits of instruction-level parallelism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Limits of control flow on parallelism
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Dynamic dependency analysis of ordinary programs
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
On the limits of program parallelism and its smoothability
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The SimpleScalar tool set, version 2.0
ACM SIGARCH Computer Architecture News
Automatic data layout for distributed-memory machines
ACM Transactions on Programming Languages and Systems (TOPLAS)
The limits of instruction level parallelism in SPEC95 applications
ACM SIGARCH Computer Architecture News - Special issue on Interact-3 workshop
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Exploring the Limits of Sub-Word Level Parallelism
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Automatic Thread Extraction with Decoupled Software Pipelining
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
MiBench: A free, commercially representative embedded benchmark suite
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Detection and Parallel Execution of Independent Instructions
IEEE Transactions on Computers
The Inhibition of Potential Parallelism by Conditional Jumps
IEEE Transactions on Computers
Evaluating MapReduce for Multi-core and Multiprocessor Systems
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Revisiting the Sequential Programming Model for Multi-Core
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
The shared-thread multiprocessor
Proceedings of the 22nd annual international conference on Supercomputing
Validity of the single processor approach to achieving large scale computing capabilities
AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Toward more efficient computer organizations
AFIPS '72 (Spring) Proceedings of the May 16-18, 1972, spring joint computer conference
Introduction to Algorithms, Third Edition
Introduction to Algorithms, Third Edition
The Cilkview scalability analyzer
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Proceedings of the 8th ACM International Conference on Computing Frontiers
DOME: towards the ASTRON & IBM center for exascale technology
Proceedings of the 2012 workshop on High-Performance Computing for Astronomy Date
Hi-index | 0.01 |
This paper presents a framework for characterizing the distribution of fine-grained parallelism, data movement, and communication-minimizing code partitions. Understanding the spectrum of parallelism available in applications, and how much data movement might result if such parallelism is exploited, is essential in the hardware design process because these properties will be the limiters to performance scaling of future computing systems. The framework is applied to characterizing 26 applications and kernels, classified according to their dominant components in the Berkeley dwarf/ computational motif classification. The distributions of ILP and TLP over execution time are studied, and it is shown that, though mean ILP is high, available ILP is significantly smaller for most of the execution. The results from this framework are complemented by hardware performance counter data on two RISC platforms (IBM Power7 and Freescale P2020) and one CISC platform (IntelAtom D510), spanning a broad range of real machine characteristics. Employing a combination of these new techniques, and building upon previous proposals, it is demonstrated that the similarity in available ideal-case parallelism and data movement within and across the dwarf classes, is limited.