Chapter 4: Assessing and Understanding Performance 1. Define response (execution) time. response (or execution) time -- the time between the start and the finish of a task 2. Define throughput. throughput -- total amount of work done in a given time 3. Describe why using the clock rate of a processor is a bad way to measure performance. Provide a concrete example using the performance equation to back up your assertion. CPI must also be provided when clock rate is used as a performance metric. CPI “links” clock rate to instructions executed. Machine A is 200MHz, CPI of 1 Machine B is 400MHz, CPI of 4 For a given program... 6 1 200 10 instructioncountMachine Aruntime ×= × 6 4 400 10 instructioncountMachine B runtime ×= × Machine B will be slower for any program, in spite of its higher clock rate. 4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds? 9 6 12 (1.4 10 ) 1344 12.5 10 CPI × ×= =× 5. What is the CPI of a program execution that consists of the following instruction types (classes) and CPI: Instruction class CPI Percentage A 1.4 25% B 2.4 70% C 2 5% CPI = .25(1.4) + .70(2.4) + .05(2) = 2.13 6. Suppose one machine, A, executes a program with an average CPI of 1.9. Suppose another machine, B (with the same instruction set and an enhanced compiler), executes the same program with 20% less instructions and with a CPI of 1.1 at 800MHz. In order for the two machines to have the same performance, what does the clock rate of the first machine need to be? 6 1.9 1.1 (1.0 .20) 800 10 1.7 A A IC IC CR CR GHz × × −= × = 7. Use Amdahl’s Law to compute the new execution time for an architecture that previously required 20 seconds to execute a program, where 20% of the instructions executed were load/stores, if the time required for a load/store operation is reduced by 30% (amount of improvement for load/stores = 1/.70 = 1.44). 20(.20) 20(.80) 1.44 18.8 Timeaffected by improvementTimeafter improvement Timeunaffected by improvement Amount of improvement Time Time s = + = + = 8. What’s the best way to measure performance of a machine? Clock rate, CPI, MIPS, FLOPS (floating-point operations per second), memory latency, or average execution time? Why? Average execution time, since that’s all we care about in the end (performance is based solely on execution time). 9. This problem compares the performance of the single-cycle MIPS architecture, the multi-cycle MIPS architecture, and the pipelined MIPS architecture. Assume it requires 10 ns to perform any of the following operations: main memory access, register file access, and arithmetic operation. Assume the delay for multiplexors, registers (i.e. setup/hold time), and lookup tables is negligible. a. What is the minimum clock period and frequency for each of the three implementations? single-cycle = 10 ns (fetch) + 10 ns (decode) + 10 ns (execute) + 10 ns (memory) + 10 ns write back = 50 ns (20 MHz) multi-cycle and pipelined = 10 ns b. What is the CPI for each of the implementations, assuming a program execution with the following instruction mix: Instruction Execution Frequency r-type arithmetic, logical, comparison 60% branch 15% load 15% store 5% jump 5% Assume the pipelined CPU performs forwarding, that the compiler schedules the code to contain no load hazards, branches have a fixed 3-cycle latency (requires 2 trailing no-ops), and you may disregard the time required to fill the pipeline. single-cycle CPI = 1 multi-cycle CPI = .60 * 4 + .15 * 3 + .15 * 5 + .05 * 4 + .05 * 3 = 3.95 pipelined CPI = .85 * 1 + .15 * 3 = 1.30 c. Assume the program from part b executed 10 million instructions. What is the speedup of the pipelined processor versus both the single-cycle processor and the multi-cycle processor? single-cycle => (1 * 50) / (1.30 * 10) = 3.85 multi-cycle = (3.95 * 10) / (1.30 * 10) = 3.04 Chapter 5: The Processor: Datapath and Control 1. Describe the differences between single-cycle and multi-cycle processor architectures. Single-cycle processor executes each instruction in a single clock cycle, as the only sequential logic portion is fetch (the PC). Multi-cycle processors execute instructions over multiple clock cycles, carrying intermediate values from one cycle to the next using registers. 2. Describe in detail everything that is needed in order to add the addi, subi, andi, and ori instructions to the processor design above. Must provide a way for the ALU controller to interpret I-type instructions. All other data paths already exist. 3. Design a datapath for the R-type sll, sra, and srl instructions. Don’t worry about control signals. Assuming the ALU can perform the necessary shift operations on input B, add a datapath from IR[10..6] (SHAMT) into the ALU and add the necessary rows into the ALU controller truth table. 4. For a multicycle processor design, describe what steps of the instruction execution are performed in each clock cycle for a load instruction (Hint: there are 5 clock cycles). Cycle 1: fetch the instruction Cycle 2: read the contents of the base register rs Cycle 3: compute the effective load address by adding the contents of the base register to the sign-extended version of the immediate field Cycle 4: read word from memory at that address Cycle 5: write the word back to the register file at address rt 5. Provide all control signals for a LW instruction using the following processor design. Assume ALUOp is 10 for interpretation of instruction function code, 00 for add, and 01 for subtract. RegDst __0__ Branch __0__ MemRead __1__ MemtoReg __1__ ALUOp _00__ MemWrite __0__ ALUSrc __1__ RegWrite __1__ 6. Describe briefly why the MIPS designers use a separate ALU controller rather than centralize all control in the main control unit? This allows the main control unit to consolidate all the R-type instructions into a single control state, allowing the ALU controller to interpret the instruction. 7. Draw the design of a simple datapath that multiplies the contents of register A by 3 and latches the value into another register B. This datapath can be represented by the following RTL: B <- (A) * 3 You are provided with a library of the following components: • register • a combinational logic block named “shift-left by one”, that provides an output whose value is the input value shifted to the left by one bit • adder Assume these components may be scaled to any arbitrary bit width, so you may abstract away the size of each of the signals. Chapter 6: Enhancing Performance with Pipelining 1. List and describe the pipeline hazards. Structural hazards: Two operations require a single piece of hardware Structural hazards can be overcome by adding additional hardware Control hazards: Conditional control instructions are not resolved until late in the pipeline, requiring subsequent instruction fetches to be predicted Flushed if prediction does not hold (make sure no state change) Branch hazards can use dynamic prediction/speculation, branch delay slot Data hazards: Instruction from one pipeline stage is “dependant” of data computed in another pipeline stage 2. For the following segment of code, indicate all data dependences for the 5-stage MIPS pipeline, and show the state of the pipeline for each cycle. Assume a fixed 3-cycle branch latency. add $6,$5,$2 lw $7,0($6) addi $7,$7,10 add $6,$4,$2 sw $7,0($6) addi $2,$2,4 blt $2,$3,loop add $6,$5,$2 (see Chapter 6 lecture slide 17) A shift add B 3. Describe the disadvantage of moving branch resolution to the decode stage, thus reducing branch latency (mis-prediction penalty). This would increase the logic latency of the decode stage, as the register file would need to be accessed in series with the branch resolution logic. This may increase the cycle period time. 4. Describe a reason why a 2-bit branch predictor can be more effective than a 1-bit branch predictor. A 1-bit predictor would change its prediction for a particular branch on the last iteration of a loop, causing a mis-prediction after the first iteration when the loop is executed again. 5. Describe why predicting a branch taken still incurs at least one cycle penalty? The branch instruction must be fetched in order to compute the target address (computed in decode). 6. Assume an architecture has a branch predictor that correctly predicts a branch 60% of the time. A correctly predicted branch requires a 2 cycle latency, and a mis-predicted branch requires a 3 cycle latency. Suppose a program requires 100 seconds to run, and 30 seconds are spent executing branches. If you enhance the design of the branch predictor such that it correctly predicts branches 90% of the time, what is the new execution time? old averaged branch penalty = .6 * 2 + .4 * 3 = 2.4 cycles new averaged branch penalty = .9 * 2 + .1 * 3 = 2.1 cycles branch speedup = 2.4 / 2.1 = 1.14 new execution time = 30 / 1.14 + 70 = 96.32 (Amdahl’s Law) 7. For the following segment of code, show the RAW data dependences for the five-state MIPS pipeline. add $2, $3, $4 sub $3, $2, $4 slt $4, $3, $2 sw $3, 0($2) (see Chapter 6 lecture slide 17) 8. Assume the following instructions are in the following pipeline stages: add $2, $3, $4 in WB or $2, $5, $6 in MEM sub $4, $2, $7 in EXECUTE Show or describe which forwarding paths are currently in use. (see Chapter 6 lecture slide 17) 9. For the following program: main: addi $2, $0, 0 loop: lw $3, vals($2) addi $2, $2, 4 beqz $3, done j loop done: sra $3, $2, 2 addi $3, $3, -1 jr $31 Assume we want to run this program on an architecture that has a branch delay slot. How must we modify the code such that it behaves exactly as it would on a machine without a branch delay slot, without introducing any additional pipeline stalls resulting from the new instruction ordering? Move the sra instruction below the beqz instruction, since this will be performed if the branch is taken and will not affect the behavior if it is executed if the branch is not taken.