Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Organizing matrices and matrix operations for paged memory systems
Communications of the ACM
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
LAPACK Working Note 18: Implementation Guide for LAPACK
LAPACK Working Note 18: Implementation Guide for LAPACK
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
MemSpy: analyzing memory system bottlenecks in programs
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Page placement algorithms for large real-indexed caches
ACM Transactions on Computer Systems (TOCS)
Comparative performance evaluation of cache-coherent NUMA and COMA architectures
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
A novel cache design for vector processing
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Optimizing for parallelism and data locality
ICS '92 Proceedings of the 6th international conference on Supercomputing
Software support for speculative loads
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Architecture support for single address space operating systems
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cooperative shared memory: software and hardware for scalable multiprocessor
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Compiler blockability of numerical algorithms
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Characterizing the behavior of sparse algorithms on caches
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
ICS '93 Proceedings of the 7th international conference on Supercomputing
A static parameter based performance prediction tool for parallel programs
ICS '93 Proceedings of the 7th international conference on Supercomputing
Effectiveness of trace sampling for performance debugging tools
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Fortran-S: a Fortran interface for shared virtual memory architectures
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Introducing a New Cache Design into Vector Computers
IEEE Transactions on Computers
Compiling for shared-memory and message-passing computers
ACM Letters on Programming Languages and Systems (LOPLAS)
Precise compile-time performance prediction for superscalar-based computers
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Memory access coalescing: a technique for eliminating redundant memory accesses
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Cache performance of garbage-collected programs
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
MOB forms: a class of multilevel block algorithms for dense linear algebra operations
ICS '94 Proceedings of the 8th international conference on Supercomputing
Reducing cache conflicts in data cache prefetching
ACM SIGARCH Computer Architecture News - Special issue on input/output in parallel computer systems
Architectural support for performance tuning: a case study on the SPARCcenter 2000
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Data relocation and prefetching for programs with large data sets
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Compiler optimizations for improving data locality
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
A Memory Interference Model for Regularly Patterned Multiple Stream Vector Accesses
IEEE Transactions on Parallel and Distributed Systems
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Influence of cross-interferences on blocked loops: a case study with matrix-vector multiply
ACM Transactions on Programming Languages and Systems (TOPLAS)
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Design of cache memories for multi-threaded dataflow architecture
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Skewed associativity enhances performance predictability
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Data forwarding in scalable shared-memory multiprocessors
ICS '95 Proceedings of the 9th international conference on Supercomputing
A data cache with multiple caching strategies tuned to different types of locality
ICS '95 Proceedings of the 9th international conference on Supercomputing
SPAID: software prefetching in pointer- and call-intensive environments
Proceedings of the 28th annual international symposium on Microarchitecture
Cache miss heuristics and preloading techniques for general-purpose programs
Proceedings of the 28th annual international symposium on Microarchitecture
Efficient and language-independent mobile programs
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Thread scheduling for cache locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data prefetching and multilevel blocking for linear algebra operations
ICS '96 Proceedings of the 10th international conference on Supercomputing
Examination of a memory access classification scheme for pointer-intensive and numeric programs
ICS '96 Proceedings of the 10th international conference on Supercomputing
Block algorithms for sparse matrix computations on high performance workstations
ICS '96 Proceedings of the 10th international conference on Supercomputing
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Data Forwarding in Scalable Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Speeding up protocols for small messages
Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Fusion of Loops for Parallelism and Locality
IEEE Transactions on Parallel and Distributed Systems
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A compiler algorithm for optimizing locality in loop nests
ICS '97 Proceedings of the 11th international conference on Supercomputing
Cache miss equations: an analytical representation of cache misses
ICS '97 Proceedings of the 11th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Determining the idle time of a tiling
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
System support for automatic profiling and optimization
Proceedings of the sixteenth ACM symposium on Operating systems principles
The design and performance of a conflict-avoiding cache
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Tuning compiler optimizations for simultaneous multithreading
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Compiler blockability of dense matrix factorizations
ACM Transactions on Mathematical Software (TOMS)
A general algorithm for tiling the register level
ICS '98 Proceedings of the 12th international conference on Supercomputing
Eliminating conflict misses for high performance architectures
ICS '98 Proceedings of the 12th international conference on Supercomputing
Informing memory operations: memory performance feedback mechanisms and their applications
ACM Transactions on Computer Systems (TOCS)
A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems
IEEE Transactions on Parallel and Distributed Systems
Precise miss analysis for program transformations with caches of arbitrary associativity
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
Improving Cache Locality by a Combination of Loop and Data Transformations
IEEE Transactions on Computers - Special issue on cache memory and related problems
Randomized Cache Placement for Eliminating Conflicts
IEEE Transactions on Computers - Special issue on cache memory and related problems
Cache optimization in scientific computations
Proceedings of the 1999 ACM symposium on Applied computing
A Comparative Analysis of Cache Designs for Vector Processing
IEEE Transactions on Computers
Cache conscious programming in undergraduate computer science
SIGCSE '99 The proceedings of the thirtieth SIGCSE technical symposium on Computer science education
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Cache-conscious structure layout
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving memory hierarchy performance for irregular applications
ICS '99 Proceedings of the 13th international conference on Supercomputing
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
A tile selection algorithm for data locality and cache interference
ICS '99 Proceedings of the 13th international conference on Supercomputing
An integer linear programming approach for optimizing cache locality
ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Analytical Modeling of Set-Associative Cache Behavior
IEEE Transactions on Computers
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks
ACM Transactions on Computer Systems (TOCS)
Locality optimizations for multi-level caches
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Towards a theory of cache-efficient algorithms
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
An analytical model of the working-set sizes in decision-support systems
Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Tuning Compiler Optimizations for Simultaneous Multithreading
International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Symbolic Cache Analysis for Real-Time Systems
Real-Time Systems - Special issue on worst-case execution-time analysis
On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Transforming loops to recursion for multi-level memory hierarchies
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
A compiler technique for improving whole-program locality
POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Cache conscious data layout organization for embedded multimedia applications
Proceedings of the conference on Design, automation and test in Europe
Tiling imperfectly-nested loop nests
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Towards effective embedded processors in codesigns: customizable partitioned caches
Proceedings of the ninth international symposium on Hardware/software codesign
A dynamic locality optimization algorithm for linear algebra codes
Proceedings of the 2001 ACM symposium on Applied computing
Loop optimization for a class of memory-constrained computations
ICS '01 Proceedings of the 15th international conference on Supercomputing
ICS '01 Proceedings of the 15th international conference on Supercomputing
Reducing memory requirements of nested loops for embedded systems
Proceedings of the 38th annual Design Automation Conference
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Optimal tiling for minimizing communication in distributed shared-memory multiprocessors
Compiler optimizations for scalable parallel systems
Source code transformation based on software cost analysis
Proceedings of the 14th international symposium on Systems synthesis
Static and Dynamic Locality Optimizations Using Integer Linear Programming
IEEE Transactions on Parallel and Distributed Systems
Data Relation Vectors: A New Abstraction for Data Optimizations
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Efficient Representation Scheme for Multidimensional Array Operations
IEEE Transactions on Computers
IEEE Transactions on Computers
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Loop re-ordering and pre-fetching at run-time
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Design space optimization of embedded memory systems via data remapping
Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Computation regrouping: restructuring programs for temporal data cache locality
ICS '02 Proceedings of the 16th international conference on Supercomputing
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests
International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
Array form representation of idiom recognition system for numerical programs
Proceedings of the 2001 conference on APL: an arrays odyssey
Low-power data memory communication for application-specific embedded processors
Proceedings of the 15th international symposium on System Synthesis
Increasing temporal locality with skewing and recursive blocking
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping
The Journal of Supercomputing
An I/O-Conscious Tiling Strategy for Disk-Resident Data Sets
The Journal of Supercomputing
Precise Data Locality Optimization of Nested Loops
The Journal of Supercomputing
Compilation of Vector Statements of C[] Language for Architectures with Multilevel Memory Hierarchy
Programming and Computing Software
Towards a theory of cache-efficient algorithms
Journal of the ACM (JACM)
Run-time and compile-time support for adaptive irregular problems
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Combining Loop Transformations Considering Caches and Scheduling
International Journal of Parallel Programming
Quantifying the Multi-Level Nature of Tiling Interactions
International Journal of Parallel Programming
International Journal of Parallel Programming
Data-Centric Transformations for Locality Enhancement
International Journal of Parallel Programming
Achieving Scalable Locality with Time Skewing
International Journal of Parallel Programming
Skewed Associativity Improves Program Performance and Enhances Predictability
IEEE Transactions on Computers
A Layout-Conscious Iteration Space Transformation Technique
IEEE Transactions on Computers
Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
On Supernode Transformation with Minimized Total Running Time
IEEE Transactions on Parallel and Distributed Systems
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
On Time Optimal Supernode Shape
IEEE Transactions on Parallel and Distributed Systems
Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance
IEEE Transactions on Computers
Data remapping for design space optimization of embedded memory systems
ACM Transactions on Embedded Computing Systems (TECS)
HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Memory Architectures for Embedded Systems-On-Chip
HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Cache-Efficient Multigrid Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
False Sharing Elimination by Selection of Runtime Scheduling Parameters
ICPP '97 Proceedings of the international Conference on Parallel Processing
Improving the Performance of Out-of-Core Computations
ICPP '97 Proceedings of the international Conference on Parallel Processing
A Memory Controller for Improved Performance of Streamed Computations on Symmetric Multiprocessors
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Optimizing Graph Algorithms for Improved Cache Performance
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
An Efficient Technique for Corner-Turn in SAR Image Reconstruction by Improving Cache Access
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Advanced Data Layout Optimization for Multimedia Applications
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Compiler and Run-Time Support for Improving Locality in Scientific Codes
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Experimental Evaluation of Energy Behavior of Iteration Space Tiling
LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation - SAMOS
Cache Remapping to Improve the Performance of Tiled Algorithms
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Data Sequence Locality: A Generalization of Temporal Locality
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Cache Models for Iterative Compilation
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
I/O-Conscious Tiling for Disk-Resident Data Sets
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
Efficient Sorting Using Registers and Caches
WAE '00 Proceedings of the 4th International Workshop on Algorithm Engineering
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance
WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Reducing Cache Conflicts by a Parametrized Memory Mapping
ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Software Controlled Reconfigurable On-Chip Memory for High Performance Computing
IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Improving Cache Effectiveness through Array Data Layout Manipulation in SAC
IFL '00 Selected Papers from the 12th International Workshop on Implementation of Functional Languages
Architectures for an Efficient Application Execution in a Collection of HNOWS
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Cache Line Impact on 3D PDE Solvers
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Memory System Support for Irregular Applications
LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Better tiling and array contraction for compiling scientific programs
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Embedded processor design challenges
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework
IEEE Transactions on Parallel and Distributed Systems
Compiler-directed instruction cache leakage optimization
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Data cache locking for higher program predictability
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A comparison of empirical and model-driven optimization
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness
CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
A compiler approach for reducing data cache energy
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications
EDTC '97 Proceedings of the 1997 European conference on Design and Test
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Access ordering and memory-conscious cache utilization
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Program balance and its impact on high performance RISC architectures
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Distributed Prefetch-buffer/Cache Design for High Performance Memory Systems
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Cache Performance and Algorithm Optimization
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Reference Distance as a Metric for Data Locality
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Performance Improvement for Matrix Calculation on CP-PACS Node Processor
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Automatic exploitation of dual level parallelism on a network of multiprocessors
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Address Code and Arithmetic Optimizations for Embedded Systems
ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
SCIMA: A Novel Architecture for High Performance Computing
IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
Highly accurate and efficient evaluation of randomising set index functions
Journal of Systems Architecture: the EUROMICRO Journal
Code Transformations for Low Power Caching in Embedded Multimedia Processors
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
IEEE Transactions on Parallel and Distributed Systems
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Efficient sorting using registers and caches
Journal of Experimental Algorithmics (JEA)
Array Regrouping and Its Use in Compiling Data-Intensive Embedded Applications
IEEE Transactions on Computers
Data Caches in Multitasking Hard Real-Time Systems
RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
Transforming Complex Loop Nests for Locality
The Journal of Supercomputing
A Quantitative Analysis of Tile Size Selection Algorithms
The Journal of Supercomputing
Single Assignment C: efficient support for high-level array operations in a functional setting
Journal of Functional Programming
Analysis and Modeling of Energy Reducing Source Code Transformations
Proceedings of the conference on Design, automation and test in Europe - Volume 3
A fast and accurate framework to analyze and optimize cache memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving effective bandwidth through compiler enhancement of global cache reuse
Journal of Parallel and Distributed Computing
Quantification of memory communication
High performance scientific and engineering computing
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior
IEEE Transactions on Computers
Reducing instruction cache energy consumption using a compiler-based strategy
ACM Transactions on Architecture and Code Optimization (TACO)
A data locality optimizing algorithm
ACM SIGPLAN Notices - Best of PLDI 1979-1999
A compiler tool to predict memory hierarchy performance of scientific codes
Parallel Computing
Restructuring computations for temporal data cache locality
International Journal of Parallel Programming
Optimizing Graph Algorithms for Improved Cache Performance
IEEE Transactions on Parallel and Distributed Systems
Quasidynamic Layout Optimizations for Improving Data Locality
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Computers
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
A Constraint Network Based Approach to Memory Layout Optimization
Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
A Geometric Programming Framework for Optimal Multi-Level Tiling
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Software-Directed Disk Power Management for Scientific Applications
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
SCIMA-SMP: on-chip memory processor architecture for SMP
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
A case for a working-set-based memory hierarchy
Proceedings of the 2nd conference on Computing frontiers
Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs
The Journal of Supercomputing
Automatic blocking of QR and LU factorizations for locality
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Data space-oriented tiling for enhancing locality
ACM Transactions on Embedded Computing Systems (TECS)
Memory Performance Optimizations For Real-Time Software HDTV Decoding
Journal of VLSI Signal Processing Systems
Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions
Journal of VLSI Signal Processing Systems
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
Cache-Efficient Multigrid Algorithms
International Journal of High Performance Computing Applications
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion
International Journal of High Performance Computing Applications
Visualizing Industrial CT Volume Data for Nondestructive Testing Applications
Proceedings of the 14th IEEE Visualization 2003 (VIS'03)
CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Reducing data cache leakage energy using a compiler-based approach
ACM Transactions on Embedded Computing Systems (TECS)
An accurate cost model for guiding data locality transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Compiler-directed high-level energy estimation and optimization
ACM Transactions on Embedded Computing Systems (TECS)
Reduction Transformations for Optimization Parameter Selection
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Automatic benchmark generation for cache optimization of matrix operations
ACM-SE 33 Proceedings of the 33rd annual on Southeast regional conference
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
International Journal of High Performance Computing Applications
Multi-compilation: capturing interactions among concurrently-executing applications
Proceedings of the 3rd conference on Computing frontiers
Integrating loop and data optimizations for locality within a constraint network based framework
ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Empirical optimization for a sparse linear solver: a case study
International Journal of Parallel Programming - Special issue: The next generation software program
A metaprogramming approach to generating optimized code for algorithms in linear algebra
Proceedings of the 43rd annual Southeast regional conference - Volume 1
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Cache-Friendly implementations of transitive closure
Journal of Experimental Algorithmics (JEA)
Message-passing code generation for non-rectangular tiling transformations
Parallel Computing
Cache oblivious algorithms for nonserial polyadic programming
The Journal of Supercomputing
Iterative compilation for energy reduction
Journal of Embedded Computing - Cache exploitation in embedded systems
The rise and fall of High Performance Fortran: an historical object lesson
Proceedings of the third ACM SIGPLAN conference on History of programming languages
Impulse: Memory system support for scientific applications
Scientific Programming
$P$^$3$$T+$: A performance estimator for distributed and parallel programs
Scientific Programming
A One's Complement Cache Memory
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
The Nachos instructional operating system
USENIX'93 Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 Conference Proceedings
Precise automatable analytical modeling of the cache behavior of codes with indirections
ACM Transactions on Architecture and Code Optimization (TACO)
Cache-efficient numerical algorithms using graphics hardware
Parallel Computing
Data cache locking for tight timing calculations
ACM Transactions on Embedded Computing Systems (TECS)
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Fast indexing for blocked array layouts to reduce cache misses
International Journal of High Performance Computing and Networking
Analyzing memory access intensity in parallel programs on multicore
Proceedings of the 22nd annual international conference on Supercomputing
Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputers
CHINA HPC '07 Proceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing
Unfavorable Strides in Cache Memory Systems (RNR Technical Report RNR-92-015)
Scientific Programming
Positivity, posynomials and tile size selection
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Adaptive Loop Tiling for a Multi-cluster CMP
ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
Simultaneous minimization of capacity and conflict misses
Journal of Computer Science and Technology
Enabling software management for multicore caches with a lightweight hardware support
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Optimizing shared cache behavior of chip multiprocessors
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Parallel lattice Boltzmann method with blocked partitioning
International Journal of Parallel Programming - Special issue on the 19th international symposium on computer architecture and high performance computing (SBAC-PAD 2007)
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
Dependence-based code generation for a CELL processor
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Automatic creation of tile size selection models
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Using non-canonical array layouts in dense matrix operations
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Evaluating ISA support and hardware support for recursive data layouts
HiPC'07 Proceedings of the 14th international conference on High performance computing
Empirical study for optimization of power-performance with on-chip memory
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Software data spreading: leveraging distributed caches to improve single thread performance
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Fast field solver for the simulation of large-area OLEDs
Microelectronics Journal
Optimized dense matrix multiplication on a many-core architecture
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Code scheduling for optimizing parallelism and data locality
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
On the interaction of tiling and automatic parallelization
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Data locality and parallelism optimization using a constraint-based approach
Journal of Parallel and Distributed Computing
A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness
CASCON First Decade High Impact Papers
ULCC: a user-level facility for optimizing shared cache performance on multicores
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
ACM Transactions on Algorithms (TALG)
Combining measures for temporal and spatial locality
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Applying data copy to improve memory performance of general array computations
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Optimizing matrix multiplication with a classifier learning system
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Using platform-specific performance counters for dynamic compilation
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Optimizing explicit data transfers for data parallel applications on the cell architecture
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Parallelized sigma-point Kalman filtering for structural dynamics
Computers and Structures
Automatic memory optimizations for improving MPI derived datatype performance
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
A study on load imbalance in parallel hypermatrix multiplication using OpenMP
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Adapting linear algebra codes to the memory hierarchy using a hypermatrix scheme
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Tuning blocked array layouts to exploit memory hierarchy in SMT architectures
PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
Optimizing data locality using array tiling
Proceedings of the International Conference on Computer-Aided Design
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
An ILP-Based approach to locality optimization
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
A matrix-type for performance–portability
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Efficient execution of scientific computation on geographically distributed clusters
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion
ACM Transactions on Mathematical Software (TOMS)
Implementing p systems parallelism by means of GPUs
WMC'09 Proceedings of the 10th international conference on Membrane Computing
Automated programmable control and parameterization of compiler optimizations
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations
Software—Practice & Experience
Analytical bounds for optimal tile size selection
CC'12 Proceedings of the 21st international conference on Compiler Construction
Cache-sensitive MapReduce DGEMM algorithms for shared memory architectures
Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
Enhancing GPU parallelism in nature-inspired algorithms
The Journal of Supercomputing
Strategies for improving performance and energy efficiency on a many-core
Proceedings of the ACM International Conference on Computing Frontiers
Adaptive parallel tiled code generation and accelerated auto-tuning
International Journal of High Performance Computing Applications
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
ACM Transactions on Architecture and Code Optimization (TACO)
The Journal of Supercomputing
Hi-index | 0.03 |