Vector access performance in parallel memories using skewed storage scheme
IEEE Transactions on Computers
Interleaved parallel schemes: improving memory throughput on supercomputers
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Vector multiprocessors with arbitrated memory access
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Tarantula: a vector extension to the alpha architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Memory organization with multi-pattern parallel accesses
Proceedings of the conference on Design, automation and test in Europe
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance
Proceedings of the 24th ACM International Conference on Supercomputing
Hi-index | 14.98 |
With the advance of integration technology, it has become feasible to implement a microprocessor, a vector unit, and a multimegabyte bank-interleaved L2 cache on a single die. Parallel access to strided vectors on the L2 cache is a major performance issue on such vector microprocessors. A major difficulty for such a parallel access is that one would like to interleave the cache on a block size basis in order to benefit from spatial locality and to maintain a low tag volume, while strided vector accesses naturally work on a word granularity. In this paper, we address this issue. Considering a parallel vector unit with 2^n independent lanes, a 2^n bank interleaved cache, and a cache line size of 2^k words, we show that any slice of 2^{n+k} consecutive elements of any strided vector with stride 2^rR with R odd and r\leq k can be accessed in the L2 cache and routed back to the lanes in 2^k subslices of 2^n elements.