Loop transformations for architectures with partitioned register banks

Document Type

Conference Proceeding

Publication Date

8-1-2001

Abstract

© ACM 2001. Embeddedsystems require maximum performance from a processor within significant constraints in power consumption and chip cost. Using software pipelining, processors can often exploit considerable instruction-level parallelism (ILP), and thus significantly improve performance, at the cost of substantially increasing register requirements. These increasing register requirements, how ever, make it difficult to build a high-performance embedded processor with a single, multi-ported register file while maintaining clock speed and limiting power consumption. Some digital signal processors, such as the TI C6x, reduce the number of ports required for a register bank by partitioning the register bank into multiple banks. Disjoint subsets of functional units are directly connected to one of the partitioned register banks. Eac h register bank and its associate functional units is called a cluster. Clustering reduces the number of ports needed on a per-bank basis, allowing an increased clock rate. How ever,execution speed can be hampered because of the potential need to copy \non-local" operands among register banks in order to make them available to the functional unit performing an operation. The task of the compiler is to both maximize parallelism and minimize the number of remote register accesses needed. Previous work has concen tratedon methods to partition virtual registers amongst the target architecture's clusters. In this paper, we show how high-level loop transformations can enhance the partitioning obtained by low-lev el schemes. In our experiments, loop transformations improved software pipelining by 27% on a machine with 2 clusters, each having 1 floating-point and 1 integer register bank and 4 functional units. We also observ ed a20% improvement on a similar machine with 4 clusters of 2 functional units. In fact, by performing the described loop transformations we were able to show improvements of greater than 10% over schedules (for un-transformed loops) generated with the unrealistic assumption of a single multi-ported register bank.

Publication Title

LCTES 2001 - Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems

Share

COinS