VLSI array processors
Achieving Full Parallelism Using Multidimensional Retiming
IEEE Transactions on Parallel and Distributed Systems
Scheduling of uniform multidimensional systems under resource constraints
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
The parallel execution of DO loops
Communications of the ACM
VLSI Digital Signal Processors: An Introduction to Rapid Prototyping and Design Synthesis
VLSI Digital Signal Processors: An Introduction to Rapid Prototyping and Design Synthesis
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Optimizing synchronous systems for multi-dimensional applications
EDTC '95 Proceedings of the 1995 European conference on Design and Test
Algorithm and Hardware Support for Branch Anticipation
GLS '97 Proceedings of the 7th Great Lakes Symposium on VLSI
Communication-sensitive loop scheduling for DSP applications
IEEE Transactions on Signal Processing
Economic analysis of testing homogeneous Manycore chips
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Hi-index | 0.00 |
A digital signal processor (DSP), which is a special-purpose microprocessor, is designed to achieve higher performance on DSP applications. Because most DSP applications contain many nested loops and permit a very high degree of parallelism, the DSP multiprocessor has a suitable architecture to execute these applications. Unfortunately, conventional scheduling methods used on DSP multiprocessors allocate only one operation to each DSP every time unit, even if the DSP includes several function units that can operate in parallel. Obviously they cannot achieve full function unit utilization. Hence, in this paper, we propose a two-level scheduling method (TSM) to overcome this common failing. TSM contains two approaches, which integrates unimodular transformations, loop tiling technique, and conventional methods used on single DSP. Besides introducing algorithm, we also use an analytic module to analyze its preliminary performance. Based on our analyses the TSM can achieve shorter execution time and more scalable speedup results. In addition, the TSM causes less memory access and synchronization overheads, which are usually negligible in the DSP multiprocessor architecture.