Balanced scheduling: instruction scheduling when memory latency is uncertain
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
The multiflow trace scheduling compiler
The Journal of Supercomputing - Special issue on instruction-level parallelism
ATOM: a system for building customized program analysis tools
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Exploiting hardware performance counters with flow and context sensitive profiling
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving data-flow analysis with path profiles
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Performance monitoring in a Myrinet-connected SHRIMP cluster
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Modulo Scheduling with Cache Reuse Information
Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Combining Optimization for Cache and Instruction-Level Parallelism
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Trace Scheduling: A Technique for Global Microcode Compaction
IEEE Transactions on Computers
Using the Compiler to Improve Cache Replacement Decisions
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Hi-index | 0.00 |
Within the past five years, many manufactures have added hardware performance counters to their microprocessors to generate profile data cheaply. We show how to use Compaq's DCPI tool to determine load latencies which are at a fine, instruction granularity and use them as fodder for improving instruction scheduling. We validate our heuristic for using DCPI latency data to classify loads as hits and misses against simulation numbers. We map our classification into the Multiflow compiler's intermediate representation, and use a locality sensitive Balanced scheduling algorithm. Our experiments illustrate that our algorithm improves run times by 1% on average, but up to 10% on a Compaq Alpha.