Interlock collapsing ALU for increased instruction-level parallelism
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Way-predicting set-associative cache for high performance and low energy consumption
ISLPED '99 Proceedings of the 1999 international symposium on Low power electronics and design
Proceedings of the 14th international conference on Supercomputing
On pipelining dynamic instruction scheduling logic
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
L1 data cache decomposition for energy efficiency
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
An instruction set and microarchitecture for instruction level distributed processing
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Automatic application-specific instruction-set extensions under microarchitectural constraints
Proceedings of the 40th annual Design Automation Conference
Predictive sequential associative cache
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Cost-Effective Hardware Acceleration of Multimedia Applications
ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Estimating Potential Parallelism for Platform Retargeting
WCRE '02 Proceedings of the Ninth Working Conference on Reverse Engineering (WCRE'02)
Automatic generation of application specific processors
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Macro-op Scheduling: Relaxing Scheduling Loop Constraints
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Effective stream-based and execution-based data prefetching
Proceedings of the 18th annual international conference on Supercomputing
What to Adapt in a High-Performance Microprocessor
DSD '04 Proceedings of the Digital System Design, EUROMICRO Systems
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
MiBench: A free, commercially representative embedded benchmark suite
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Characteristics of workloads used in high performance and technical computing
Proceedings of the 21st annual international conference on Supercomputing
From Silicon to Science: The Long Road to Production Reconfigurable Supercomputing
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Understanding sources of inefficiency in general-purpose chips
Proceedings of the 37th annual international symposium on Computer architecture
Scientific Application Demands on a Reconfigurable Functional Unit Interface
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Improving Floating-Point Performance in Less Area: Fractured Floating Point Units (FFPUs)
Journal of Signal Processing Systems
International Journal of Distributed Systems and Technologies
Determination of performance characteristics of scientific applications on IBM Blue Gene/Q
IBM Journal of Research and Development
Hi-index | 0.00 |
Many modern scientific applications execute on massively parallel collections of microprocessors. Supercomputers such as the Cray XT3 (Red Storm) and Blue Gene/L support thousands to tens of thousands of processors per parallel job. However, individual microprocessor performance remains a critical component of overall performance. Traditional approaches to improve scientific application performance concentrate on floating-point (FP) instructions; however, our studies show that in the scientific applications used at Sandia National Labs, integer instructions constitute a large and critical part of the instruction mix. Although the SPEC-FP benchmark suite is considered representative of FP workloads, it has a much smaller proportion of integer computation instructions than the Sandia scientific applications, with 22.9% as compared to 36.9%. Integer instructions in Sandia applications also behave differently than in SPEC-FP. Integer instruction outputs are reused 8.8x to 13.1x more often in SPEC-FP benchmarks, and integer dataflow in Sandia applications is more complex than in the SPEC-FP suite. In this work, we examine common dataflow and usage patterns of integer instructions---information essential to develop hardware techniques to accelerate critical scientific applications. We present statistics for SPEC-FP and Sandia applications, summarizing integer computation usage and the size, shape and interface (number of inputs/outputs) of dataflow graphs.