Hazards, Forwarding, Branch Prediction
Hazards
Pipeline | Instruction | Stage |
---|---|---|
Immediate Memory | 4 | FETCH (IF) |
Pipeline Register(PR1) | 4 | FETCH (IF) |
Register (READ part) | 3 | DECODE (ID) |
PR2 | 3 | DECODE (ID) |
ALU | 2 | EXECUTE (IDEX) |
PR3 | 2 | EXECUTE (IDEX) |
DM | 1 | Memory (EX/MEM) |
PR4 | 1 | Memory (EX/MEM) |
Register (WRITE part) | 0 | WRITE BACK (WB) |
Example 1
Instruction | Assembly | Data Hazard |
---|---|---|
0 | add t0, t1, t2 | |
1 | sub t4, t0, t3 | RAW |
2 | and t5, t0, t6 | RAW |
3 | or t7, t0, t8 | RAW |
4 | xor t9, t0, t10 |
Data Hazards
- RAW: (0,1), (0,2), (0,3)
- WAR
- WAW
Data hazard types
- RAW (Read After Write)
- WAR (Write After Read)
- WAW (Write After Write)
Example 2
0: li t1, 1
1: li t2, 2
2: addi t1, t1, 1
3: beq t1, t2, crypt
4: and t3, t1, t2
5: j exit
crypt:
6: xori t3, t1, 0xff # xor 0x2 0xff = 0xfd
exit:
7: addi t3, t1, 0x00
Here there is a RAW hazard happening between instruction 0 and 2 as instruction 2 depends on t1, but as instruction 2 executes in EX/MEM instruction 0 is still in FETCH stage.
The solution to this hazard is to insert a NOP between instruction 0 and 2 so instruction 0 is decoded as instruction 2 is executed. (The instructions can execute in parallel)
0: li t1, 1
1: li t2, 2
2: nop # addi 0x, 0x, 0
3: addi t1, t1, 1
...
Another problem is the branching on 3 because as the instruction is in WB, instructions 4 and 5 has been predictably queued and are being FETCHED and DECODED. If these instructions were executed it would throw the program out of whack.
The solution here is pipeline flushing. The ripes simulator does this automatically, seemingly by replacing the flushed instructions with NOPs. Seems like it does this by clearing the pipeline registers containing the instructions of 4 and 5
Branch Prediction
How early can we know whether a branch has been taken?
In the DECODE stage you could have a mini BEQ check, thus moving the Branch Taken from the EXECUTE stage to the DECODE stage, 1 pipeline step earlier.
A loop in a program has low locality, it uses only part of the instructions in a program and only part of memory frequently.
Hardware can optimize for a loop with a 1 bit branch predictor and even optimize for a loop within a loop with a 2 bit branch predictor
Forwarding
Too little time for this
Exam
- Starts with Potpurri section from all of pensum
- 3/4 bigger problems for focus ISA, microarchitecture, digital teknikk
- OS paging