Loop transformations for architectures with partitioned register banks

Document Type

Conference Proceeding

Publication Date



© ACM 2001. Embeddedsystems require maximum performance from a processor within significant constraints in pow er consumption and chip cost. Using software pipelining, processors can often exploit considerable instruction-level parallelism (ILP), and thus significantly improve performance, at the cost of substantially increasing register requirements. These increasing register requirements, how ever, make it difficult to build a high-performance embedded processor with a single, multi-ported register file while maintaining clock speed and limiting pow er consumption. Some digital signal processors, such as the TI C6x, reduce the number of ports required for a register bank by partitioning the register bank into multiple banks. Disjoint subsets of functional units are directly connected to one of the partitioned register banks. Eac h register bank and its associate functional units is called a cluster. Clustering reduces the number of ports needed on a per-bank basis, allowing an increased clock rate. Ho w ever,execution speed can be hampered because of the potential need to copy \non-local" operands among register banks in order to make them available to the functional unit performing an operation. The task of the compiler is to both maximize parallelism and minimize the number of remote register accesses needed. Previous work has concen tratedon methods to partition virtual registers amongst the target architecture's clusters. In this paper, we show how high-level loop transformations can enhance the partitioning obtained by low-lev el schemes. In our experiments, loop transformations improved soft w are pipelining by 27% on a machine with 2 clusters, each having 1 oating-point and 1 integer register bank and 4 functional units. We also observ ed a20% improvement on a similar machine with 4 clusters of 2 functional units. In fact, by performing the described loop transformations we were able to sho w improvements of greater than 10% over schedules (for un-Transformed loops) generated with the unrealistic assumption of a single multi-ported register bank.

Publication Title

Proceedings of the 2001 ACM SIGPLAN Workshop on Optimization of Middleware and Distributed Systems, OM 2001