Eager Scheduling of Dependent Instructions

Modern superscalar processors are able to potentially issue and execute multiple instructions per cycle. Several techniques over the years have focused on increasing the Instruction Level Parallelism (ILP) that a processor can exploit. However, there are many limitations of ILP that hinder performance, chief of them being the chain of dependencies between instructions that stops instructions from being executed in parallel. We propose a new micro-architecture design which extends the superscalar pipeline with a data-flow pipeline where the dataflow part identifies immediately dependent instructions and executes them early. The dataflow pipeline is able to identify redundant instructions, track changes in the operands of the redundant instructions and execute new instructions early in case of operand change. Our design helps alleviate some of the main limitations of ILP.


Introduction
Superscalar architectures are able to potentially issue and execute multiple independent instructions per cycle. This parallelism which machines exploit at the instruction level is called Instruction Level Parallelism (ILP). Register renaming techniques help eliminate false dependencies in an out of order superscalar and helps in extracting more ILP. A lot of research has been focused on trying to maximize ILP in superscalar processors. However, there are some fundamental limitations of ILP that hinders the processor to utilize ILP to its full power.
One limitation of ILP is related to the size of the issue window. A small issue window does not allow enough independent instructions in it at any given time. Increasing the window size on the other hand leads to hardware complexity which makes it difficult to maintain a high clock speed. Another limitation of ILP is associated with the control flow of a program. Branch predictions in a processor delays the fetching of instructions until the correct target address becomes known. It is not practical to fetch instructions from more than one target address in a single cycle. This delays the filling of the issue window even on a correct prediction.
Another limitation of ILP is associated with data cache misses. Load instructions are often the first instruction in a dependency chain and they often miss in the data cache. These misses translate to the delay of all instructions that are dependent on the load instruction. A non-blocking cache reduces the impact of a data cache miss on ILP but misses still affect ILP performance.
A third limitation of ILP is the inherent sequential portion of computation. The true dependencies in a code can never be parallelized and this sequential component gravely restricts ILP.
Past work has tried to overcome the limitations of ILP by trying to dynamically detect and eliminate redundant computation. We believe that along with eliminating redundant instructions, the processor should also try to execute the next group of immediately dependent instructions. We can achieve a higher ILP by executing dependent instructions earlier which reduces the time to process instructions along the critical path.
We propose a new micro-architecture design which extends the superscalar pipeline with a data-flow pipeline where the dataflow part identifies immediately dependent instructions and executes them early. The dataflow pipeline is able to identify redun-  Compiler techniques are able to capture a large amount of static redundancy in computations. However, studies [1] [2] indicate that a large amount of redundancy in programs is actually dynamic.
Sodani and Sohi [3] introduced the concept of Dynamic instruction reuse. They store the results of a previously executed instruction in a hardware structure called the Reuse Buffer(RB) (Figure 2.1).

Figure 2.1: Reuse Buffer
After an instruction is decoded, the RB is searched to determine whether a valid result from its previous execution is available. This search is done with the use of the  The second scheme is based on register names (Sn). In this scheme, the operand architecture registers of an instruction are stored in the RB along with its result.
Any writes to architecture registers are broadcasted to the RB for invalidation. The reuse test in this case involves checking whether the corresponding RB is valid.
The third scheme uses register names and dependence chain information to extend Sn (Sn+d). In this scheme, a source index field is added along with the operand register as in Sn. The source index field stores the RB index of the source operand. This creates a dependency chain of the instructions in RB. The reuse test for independent instructions is the same as in Sn. A dependent instruction is valid if source operands (stored in source index) are the latest producers of those registers. Invalidation of independent instructions is the same as in Sn. Dependent instructions are invalidated when their source operands are evicted from the RB.
Richardson [4] observes that long latency operations take up a significant portion of computation time. A result cache is proposed which stores the results of such long latency operations. The result cache is indexed using a hashing of an instruction's source operands. Access to the result cache can be initiated at or before the time of a long latency instruction operation. A hit in the cache makes the result of the instruction available instantly and the already issued instruction can be halted/killed. The execution continues normally on a cache miss and updates the cache after completing execution.
There are a number of key differences between the reuse buffer and the result cache.
The foremost difference is that the reuse buffer is indexed with the address (PC) of the instruction while the result cache is indexed with a hash of an instruction's source operands. This difference translates to when the result of a reuse is available to the processor (reuse latency). Access to the reuse buffer can be initiated as soon as the instruction is fetched. On the other hand, the result cache can only be accessed after the instruction has been decoded, renamed and its source operands pass through the hashing algorithm. Thus the reuse buffer makes the result of a reuse available at least one cycle earlier than the result cache. One drawback of the reuse buffer is that since it is indexed with PC, it can only reuse dynamic instances of the same static instruction. A different static instruction with the same source operands (and consequently the same result) will not hit in the reuse buffer. However, since the result cache is indexed with a hash of its source operands, multiple instructions using the same source operands will hit in the result cache.
One advantage of the reuse buffer over result cache is that the entire fetch block can search the reuse buffer in parallel since it is indexed by PC. The dependencies among instructions can then be checked in the same manner as parallel renaming. In case of result cache, parallel lookup of instructions cannot happen since the result of an instruction is needed as the source of another dependent instruction. Thus access to the result cache is inherently sequential and cannot happen in a single cycle.
Molina et.al. [5] proposes the Redundant Computation Buffer (RCB) which seeks to embrace the merits of both reuse buffer and result cache. The proposed RCB has the same reuse latency as the reuse buffer while also being able to identify reuse across different static instructions. On an average, the RCB is able to reuse around 30% of all dynamic instructions.
Yi et. al. [6] observe that some redundant computations in the reuse buffer may be evicted, re-executed and re-stored in the reuse buffer. They go on to say that such computations with a low frequency of execution hurt the effectiveness of storing instructions in a reuse buffer. They introduce a novel approach called instruction precomputation which involves profiling of the program before its execution. Profiling determines the redundant computations with the highest frequencies of execution and stores them in the Precomputation Table (PT) before program execution. The PT is checked during execution to determine a successful instruction reuse. The PT is loaded during the profiling step and does not undergo replacement or eviction of entries. Their approach outperforms similar instruction reuse techniques for similar table sizes while providing a decrease in area, cycle time and port usage.
Several studies have examined the concept of value prediction to achieve a higher ILP. Some of them [7] focus on load value predictions while others [8] extend the concept to predict the values of any instructions that write their result to a register.
Value prediction helps in breaking down true data dependency chains by predicting the result of an instruction and allowing the dependent instruction to speculatively execute using the predicted value. Reexecution is necessary in case of a missprediction but many value prediction based techniques have shown a significant increase in ILP.
Value prediction and value reuse capture distinct parts of redundancy in a program.
Liao and Shieh [9] propose an architecture which combines the two techniques. They use information from the value prediction table to produce a speculative result from the value reuse table. By combining the two techniques, they are able to achieve a speedup of around 8% over the baseline.
Gellert et. al. [10] observe that branches that depend on dynamic values during execution correspond to a majority of branch misspeculations, even in modern state of the art branch predictors. These branches eat up a lot of cycles during missprediction recovery. They observed that more than 30% branches are dependent on critical load instructions (instructions which miss in the L2 cache) and around 25% of them depend on the result of a multiply or division operation. They postulate that by reducing the latency of these high latency operations, the dependent branches would be executed early and thus reduce the misprediction penalty. To accomplish this, they use a Reuse Buffer for multiply and division instructions. Another table is implemenented which serves as a value predictor for loads that miss in the L1 data cache. By eliminating redundant long latency operations and predicting critical loads, they are able to obtain a speedup of 3.5% in Spec integer benchmarks and around 23% in Spec floating point benchmarks.
Golander and Weiss [11] explore the significance of instruction reuse in checkpoint processors. Checkpoint processors are known for their fast misprediction recovery rate. The recovery process involves two steps: bringing the architecture state back to the last safe checkpoint (rollback), and reexecuting the instruction sequence between the safe checkpoint and the mispredicted instruction. They discovered that a large number of instructions in integer benchmarks rexecute after the state has been restored to the previous checkpoint and a large fraction (nearly 92%) of these instructions already have a result avaiable by the time the misspeculation is detected. They propose that by reusing the isntructions along the reexecution path, a large fraction of redundant computation can be avoided, thus leading to higher ILP.
Huang and Lilja [12] observe that there is a strong correlation between the inputs and outputs of a chain of instructions. They argue that reuse at a basic block level (in contrast to reuse at instruction level) will reduce execution time further while also consuming less hardware. To exploit block reuse, the authors propose a Block History Buffer which stores dynamically determined basic block boundaries along with its inputs and live outputs. The entire basic block is squashed if there is a hit in the block history buffer with a particular series of inputs.
Continuing with the trend to increase reuse granularity, many researches have explored function level reuse. Kavi and Chen [13] replace the entries in the reuse buffer from that of an instruction to that of a function. The reuse buffer is indexed by the PC of the function call and stores the inputs and result of the function. Access to the reuse buffer is initiated at the same time as fetch and a hit in the reuse buffer skips the entire function by correctly changing the PC.

Chapter 3
Exploiting redundancy and its interaction with the fetch engine There are many parameters that affect the performance of a superscalar. When dealing with code which is highly independent, two parameters become very important   The solution to increasing ILP with dynamic reuse is to increase the fetch width of the processor. The fetch width was not considered to be an important parameter for increasing performance since a machine without dynamic reuse will not be able to issue more instructions than the issue width. Since there is a possibility of squashing instructions earlier in the pipeline, other instructions from the now-widened fetch group maybe able to proceed to the execution units.  Figure 3.3. Since i1 and i3 have valid results in the buffer, they will be squashed before they enter the issue window. This allows i4 and i5 to proceed to the execution units one cycle earlier than they would have in the baseline. At this moment of time, the machine with instruction reuse is 2 instructions ahead in its execution than the baseline. This effect is propagated at every instance of an instruction reuse. In a highly independent program, this machine will be able to achieve an ILP of greater than 3 even in a 3 issue superscalar.
The benefits of a wider frontend with instruction reuse are twofold. Since some instructions are squashed at decode, they do not occupy the issue window. This makes space in the issue window for the next fetch group which may contain more independent instructions. The second benefit is that the instructions in the same fetch group which would not have otherwise proceeded to execution are now issued because the issue width has been freed of the instructions which have been reused.
Note that the IPC of the baseline with this configuration will never exceed 3.

Eager Execution
Past work has mainly focused on dynamic reuse of instructions which are stored in a buffer to avoid redundant exeuction of instructions. Ideally, the more time an instruction resides in the buffer, the better chance it has of being reused. Several factors limit an instruction from continuing to reside in the buffer -the size of the buffer (older instructions need to be replaced with newer ones when the buffer is full) as well as the limited number of logical locations that the ISA provides and the limited number of physical locations that the hardware provides. Since the ISA has a specific number of logical registers, certain logical names are used multiple times throughout the program. In order to maintain correctness in the buffer, instructions need to be invalidated in the buffer when any of their sources are being rewritten.
Since logical names are repeated all the time in programs, the continuous invalidation of instructions in the buffer lead to the destruction of their invariance. An invalidated instruction can no longer be reused since one or more of its source operands have now changed.
Our aim with eager execution is to make use of the invalidated instructions and not discard them as soon as their invariance ends. Henceforth, we will call the buffer which stores instructions as the Eager Shelf. Instead of invalidating an instruction when their source operand is updated in the shelf, we treat the updated operand as a new producer of that instruction. Thus, while finding independent ready instructions to dispatch, we also attempt to execute the next set of immediately dependent instructions from the shelf. The vast majority of instructions encountered in typical programs are operations on one or two operands which are stored in other registers.
By eagerly executing instructions whenever their source operand is updated, we increase the ILP as the result of eager execution will be available when the instruction is encountered on the regular path.
The eager execution paradigm combines the power of an out-of-order superscalar with a dataflow style pipeline where the dataflow pipeline makes the capture and early execution of dependent instructions possible. This is accomplised by placing instructions in the eager shelf and re-executing them as soon as any of their producer operands are updated. A speculative dependence graph for the dataflow engine is generated dynamically in the shelf as the superscalar processor keeps fetching new instructions. Instructions are placed in the shelf as they are fetched and their source operands are updated by each new producer instruction writing to the same logical destination. These dependent instructions are dispatched from the shelf as soon as their source operands become ready. The dependence graph built in the shelf is speculative since there is no guarentee that a certain instruction in the shelf will be encountered on the regular path.
In Chapter 1, we discussed the main issues limiting ILP on superscalar processors.
One of them is the limited number of reservation stations associated with functional units where an instruction waits for its source operands to become ready. Fully occupied reservation stations stall the fetch engine, thereby halting the ability of the processor to find independent instructions which can be issued. In this case, the eager shelves act as a second set of reservation stations which stores instructions even across control dependencies. An invariant instruction found in an eager shelf does not proceed further in the pipeline. Similarly, a successful hit in an eager shelf for an eagerly executed instruction is not placed in the reservation station since the result of that instruction is already available. Eager shelves decrease the number of fetch engine stalls since instructions found in an eager shelf (whether invariant or eagerly executed) do not occupy a slot in the reservation stations. This approach allows the processor to find independent instructions from future fetch blocks which increases the utilization of execution units which in turn increases the ILP of the program.
In some cases, the early availability of results also propagates to branch instructions which are now computed early. Thus eager execution also leads to faster branch computation and as a result, lower branch misprediction delays.
A second factor limiting the ILP is load instructions which miss in the data cache.
Generally, loads precede a number of instructions that are dependent on the result of the load. A data cache miss stalls the load instructions and all the instructions dependent on the load. With eager execution, the dependence chain leading to a load is collapsed more quicky, allowing the load to be quickly issued. Therefore data cache misses are triggered earlier and thus the effect of the miss is reduced for future instructions.

The Eager Shelf
The eager shelf can be thought of as a second reservation station for the dataflow part of the pipeline. Any arithmetic or logical instruction can be placed in the shelf.
We will call these instructions as Shelvable instructions. The shelf stores the physical source operands and opcode of Shelvable instructions. The shelf also contains the result of the instruction. The result is stored as a physical register number in the shelf.
We design the shelf in such a way that the shelf entry number directly corresponds to the destination physical register of the instruction stored in that entry. Thus an instruction stored in shelf entry 5 will have its result in physical register number 5.
This direct correlation allows the shelf, in some cases, to maintain the validity of instructions even when the same logical destination is being over-written. This is because the physical mapping is still retained in the shelf.   When the predicate p0 in Figure 4.2 turns true, instruction i8 is searched in the shelf using its physical operands and opcode. The search returns a successful hit at entry 2. Note that instruction i8 has a different logical destination than the instruction placed at shelf entry 2. This distinction does not invalidate the search but actually maps r5 to P2 in the map table. i8 is detected as a redundant instruction and does not execute now. By using physical register identifiers for shelf search, we can map multiple logical registers to the same physical register or the same shelf entry. This is synonymous to Global Value Numbering in compiler literature where underlyiing equivalence is detected regardless of the usage of the logical name space. Because of the mapping of r5 to P2, instruction i9 will also be found in the shelf and will be considered redundant. The shelf is able to dynamically detect equivalence and eliminate redundant instructions just by updating the map table.

Eager Execution Example
We will now introduce an extra dependence in the example code sequence from Chapter 3 to make the code more realistic and nullify redundancy. The new code sequence and its dependency graph is shown in Figure 4.3. Instruction i2 is made to be dependent on i4 from the previous iteration. This dependence breaks the reusability of i2 since one of its source operands now change at every iteration of the loop.    Algorithm This chapter provides a detailed explanation of our eager execution algorithm.
The algorithm makes use of a shelf structure called the eager shelf. The structure is similar to reservation stations in operation, permitting broadcasting, changing the source operands of shelved operations, as well as selecting and issuing ready instructions. In addition to the shelf structure, a free list of available shelf entries called shelf queue is provided. We also utilize definition and use counters for each shelf entry for checking when an entry can be safely released. We begin by describing the organization of the eager shelf.

Eager Shelf
An eager shelf entry, although organized similar to a reservation station, requires additional functionality not found in traditional reservation stations. A shelf entry needs to be able to accomplish two functions: have a set of identifiers which can uniquely identify an instruction, and have enough information about an instruction to recreate it for eager execution. The former is required to correctly identify whether an instruction has been buffered in the shelf while the latter is required to assemble Shelf entry 5 will be written into physical register 5.
We now describe the operations that can be performed on the shelf. Each of these operations support the functioning of the algorithm.
When trying to buffer an instruction in the shelf, obtaining an empty entry and placing the instruction are two key operations.
Get free entry: This operation returns an empty Shelf entry by popping from the shelf queue.
Place instruction: This operation places a new instruction into an empty entry using its physical source operands and opcode.
With the help of these two operations, we can buffer an instruction in a shelf entry.
After an instruction has been placed in the shelf, the result of that instruction is available to any subsequent iterations of that instruction. To utilize the result, the subsequent instruction needs to search the shelf to check whether a previous version of that instruction was buffered. This is accomplished by the following operation.
Shelf Search: Search the shelf using two register identifiers and an opcode. There can either be one unique Shelf hit or no hit. A hit is termed as Shelf hit and no hit is termed as Shelf miss.
After an instruction has been buffered in the shelf, it remains redundant until there is a change to any of its source operands. The shelf needs to be made aware of this change. This is done by the instruction which is writing to the same location. This A high level illustration of the pipeline is shown in Figure 5.1. In the next section, we

Fetch Stage
A wider front end is employed in the pipeline to exploit redundancy as mentioned in Shelvable instructions which do not hit in the shelf look for an empty entry in the shelf. The instruction is placed in the empty entry and its destination is renamed to that entry number. Since this instruction did not hit in the shelf, we do not have a result for this instruction. Thus this instruction needs to be executed and is sent to the issue window after updating the map table. Now that the instruction has been placed in the shelf, the next fetch of this instruction will yield a successful shelf hit.
Non-shelvable instructions are renamed as usual by obtaining a free register from the physical register pool. The instruction is sent to the issue window after updating the map table.
The shelf has to be maintained corresponding to the in-order state of the front end The definition count monitors the active utilization of a physical register by instructions in the pipeline. We also need to keep track of the number of entries that use the result of one particular shelf entry. To achieve this, we also equip each shelf entry with a use count. The use count records how many other shelf entries use the result of a particular entry (that is, the number of times a particular shelf entry serves as a source operand to other entries). Whenever a new instruction is placed in the shelf, the use count of its source operands are incremented. We again modify the register release mechanism to not release any shelf entries with a use count greater than 0, even if its definition count is 0. This is because while there are currently no active definitions of this entry in the pipeline, there is still one or more instructions in the shelf which use the result of this entry. The definition count and use count, in conjunction, make sure than a shelf entry can never be released when a use of its result is pending, whether in the pipeline or in the shelf.
We now describe exactly how an instruction flows through this stage.
For each instruction i in the rename block: The source operands of i are renamed from the map table.
If instruction i is shelvable: Search the shelf (Shelf Search:) using the physical identifiers and opcode. The instruction will either hit in the shelf or miss in the shelf.
We describe the operations performed in either case. Stations (previous dest, new dest)).

As mentioned earlier, each instruction broadcasts its previous destination and new
allocation to the shelf. The broadcast is associatively performed and each match invalidates that shelf entry. Since the invalidated shelf entry may not be eligible for release (because of non-zero definition or use counts), we allocate a new entry for the instruction to be copied. The invalidated instruction is copied with the new operand in the empty entry and the entry is marked eligible for eager execution.

procedure update-broadcast and send to RS(old,new)
The old source operand is broadcasted to the source operand fields of the shelf. At every match, we get free entry to place the new instance of the instruction. The new operand is written into the empty entry while all other instruction information is copied from the previous entry. The old entry is invalidated since its source operand was updated. Invalidated entries are not checked during Shelf search. The invalidated entry will remain in the shelf until its definition and use counts are decremented to 0 at which point it will be released and added to the shelf queue.
The instruction invoking this procedure is sent to the reservation station and each new shelf entry where the updated instructions is placed is marked eligible for eager execution by setting their eager bit.

procedure get free entry()
Empty shelf entry numbers are held in the shelf queue. If there are shelf entries available in the queue, the procedure pops an entry and returns it.
Each shelf entry has an extra attribute called age. Every entry starts out with age 0 and its age is incremented by one each cycle until it reaches its max age. In case the shelf queue is empty (i.e. shelf is full), the shelf for an entry which does not have a successful hit on it and whose age is max age is returned by the procedure.
Note that since this entry does not have a successful hit, the definition and use counts of this entry will be 0. We are sacrificing a potential hit on this entry in the future for the ability to place a new instruction in the entry. The max age parameter can be changed accordingly.

Select Stage
In the select stage, we check all shelf entries and select entries which are marked for eager execution (eager bit set) and have their source operands ready, and send them to execution units.
Instructions whose source operands are ready in the reservation stations are also selected and sent to the execution units.

Execute Stage
Instructions are executed and their results are written into their destination register in this stage. ready-broadcast operation is performed on both the eager shelf and the normal path reservation stations. The ready flags of all sources that match with the destination register in both the Eager shelf and the reservation stations are updated.

Retire Stage
The retire stage in a typical superscalar has two purposes: Update the in-order state of the map table using the destination of the instruction, and release the previous destination of the instruction by adding it to the register pool. All instructions in the eager superscalar follow the first step by updating the in-order map table.
Since some physical registers are now correlated with the eager shelf, we need to modify the register release process. All instructions whose previous destination is independent of the eager shelf (in simple terms -the previous destination number is greater than the eager shelf size), release their previous destination normally -by adding it to the free register pool. All other instructions do not directly release their previous destination. Instead, this instruction will decrement the definition counter of its previous destination. Since the current destination of this instruction has reached the in-order state, the processor can be sure that all other uses of this definition have already retired. However, the eager execution paradigm allows multiple definitions of a single physical register. Thus, we decrement the definition counter instead of directly releasing the previous allocation.
When trying to release a shelf entry (and its corresponding physical register), we look at both the definition and use counts. A definition count of 0 implies that there are no active definitions of that entry in the processor while a use count of 0 implies that no other shelf entry is using the corresponding physical register as its source. The shelf entry can be released only when both of these conditions are satisfied. Upon release, the shelf entry decrements the use count of its physical register.

Chapter 6
Results and Analysis

Methodology
We model the eager execution superscalar on ADL [14], an architecture description language which generates cycle accurate simulators. The simulator runs on the MIPS ISA.We use the Spec2006 benchmark suite for performance analysis.
The baseline architecture is a conventional 8-wide issue superscalar with a 12-wide fetch engine. A g-share branch predictor is used in the front end. The complete processor configuration is shown in Figure 6.1.
The baseline architecture and eager superscalar are kept as identical as possible. The

Performance Results
We executed Spec2006 benchmarks on the eager superscalar. Figure 6.2 shows the utilization of the shelf as a percentage of total shelvable instructions. On an average, only around 20% shelvable instructions are placed in the shelf. Since physical registers are correlated with shelf entries, we cannot release a shelf entry until its definition and use counts are 0. In case the shelves are full, incoming shelvable instructions will not be placed in a shelf and will be treated as normal path non-shelvable instructions.
The processor will not be able to take advantage of redundancy of these instructions in case they are encountered again.  the immediate next fetch group will not be able to take advantage of eager execution since the shelf will not have had enough time to start eager execution. On the other hand, dependent instructions in different fetch groups will have a better chance of

Analysis and Future Work
We do not see a substantial decrease in the number of cycles for most benchmarks. This is, in part, because of the low utilization of the shelf. Since we only average around 20% utilization, we cannot take advantage of the other 80% of instructions that may have been reused. Because of low utilization, the shelf may not be able to buffer critical path instructions too. Figure 6.4 shows the percentage of instructions that hit in the shelf to the total number of instructions placed in the shelf. We see around 10% hit rate on 20% utilization which suggests that a very small number of resuable instructions are actually being reused.
There are a number of improvements that can be made on top of the eager execution paradigm. For our simulator, we consider only single cycle arithmetic and logical instructions as shelvable instructions (instructions that can be placed in the shelf).
By changing the design of the eager shelf, one can easily place multiply and division instructions in the shelf. These instructions are multi-cycle instructions which can provide a boost in performance if they are reused or eagerly executed. The downside of eagerly executing these instructions is that every unsuccessful early execution eats up many cycles of an execution unit.
Another category of instructions that can be placed in the shelf are load instructions.
A new paradigm would need to be established in the shelf which would keep track of the dependencies between loads and stores. The shelf will be treated as the top level cache in case of a load reuse while early execution of a load will essentially act as a prefetch mechanism.
We can also program the compiler to use a certain set of logical registers for memory instructions. This will make sure that arithmetic instuctions independent of the load do not invalidate the loads in the Shelf. Instructions writing to a register belonging to the memory set will be the only instructions responsible for invalidating and eagerly executing load instructions which will reduce wasteful early executions.
In our design, we effectively clear the shelf at every branch misspeculation. This is not a necessary requirement. We propose two methods which can be explored in the future with regards to maintaince of shelf during a misspeculation: in an architecture using reorder buffer, the instructions from the tail of the reorder buffer to its head can use their physical destination registers to invalidate any matches in the shelf. This will not invalidate any entries which were placed before the misspeculation and these entries can be reused after the recovery process ends. The second method is motivated from the concept of checkpoint processors. A checkpoint of the eager shelf can be taken at every branch or at a regular interval of cycles. Upon a misspeculation, the last safe checkpoint is copied to the shelf. The ILP immediately following a misprediction is very low in typical programs and having redudant results available in the shelf will lead to a boost in the ILP. Certain complier techniques may also be able to aid in eager execution. Consider a simple for loop that runs for a million iterations as shown in Figure 6.5. The shelf, by virtue of correlating physical register and shelf entry numbers, is able to capture the entire dependency chain between instructions i1, i2 and i3. All three instructions will be considered redudant instructions for all iterations of the loop except the first.
However, instruction i4 serves as a bottleneck for this loop. i4 creates a million instruction long sequential dependence chain. Although the shelf eliminates every instruction in the loop, we see almost negligible gain in performance. By using some compiler techniques like loop unrolling, we may be able to reduce the impact of i4 on the performance. This loop unrolled by a factor of 10 will see almost a 10-fold decrease in execution time since every instruction apart from i4 will be eliminated because of redundancy. As seen in Figure 6.2, the utilization of shelf is not even 50%, even for a shelf size of 128. Increasing the shelf size is counterproductive, since searching the shelf will take more time, leading to an increase in the clock speed. Increasing shelf size also leads to an increase in the area of the shelf. Instead of adding more shelf entries, we can provide more physical registers to each shelf entry. Instead of each entry being correlated with one physical register, now each entry is correlated with 2 physical registers as shown in Figure 6.6. Thus, each entry will have its own set of physical registers. Each set of physical registers has its own definition and use counts. When an instruction is placed in an entry, one of the registers from the set is allocated to that entry. The definition and use counts for that register are updated per the algorithm. When the shelf is full, we empty an entry for the new instruction but do not necessarily release the register associated with that entry. This register will be released only when its definition and use counts reach 0. Instead, the register from the other set is allocated to the new instruction. In this way, multiple registers can be provided for each shelf entry. We can increase the utilization of the shelf without increasing the shelf size by a sizeable amount.

Chapter 7 Conclusion
Eager execution is a novel idea that builds upon the work of instruction reuse. Many instruction reuse techniques have been proposed in the past but all of them destroy invariance with a change in source operands. Our work utilizes the changes in source operands and treats it as a potential future producer to start eager execution. Eager execution helps eliminate dynamic redundancy and helps collapse dependency chains earlier even in highly sequential programs.
Our preliminary work has shown small but promising improvement in ILP with the help of eager execution. We believe that there is a large scope for improvement in this paradigm and even better improvement in ILP in the future.