Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Chapter 4:  Assessing and Understanding Performance 
 
1. Define response (execution) time. 
 
 response (or execution) time -- the time between the start and the finish of a task 
 
2. Define throughput. 
 
throughput -- total amount of work done in a given time 
 
3. Describe why using the clock rate of a processor is a bad way to measure performance.  
Provide a concrete example using the performance equation to back up your assertion. 
 
 CPI must also be provided when clock rate is used as a performance metric.  CPI 
“links” clock rate to instructions executed. 
 
 Machine A is 200MHz, CPI of 1 
 Machine B is 400MHz, CPI of 4 
 
 For a given program... 
6
1
200 10
instructioncountMachine Aruntime ×= ×  
6
4
400 10
instructioncountMachine B runtime ×= ×  
 
Machine B will be slower for any program, in spite of its higher clock rate. 
 
4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 
12 seconds? 
 
 
9
6
12 (1.4 10 ) 1344
12.5 10
CPI × ×= =×  
 
5. What is the CPI of a program execution that consists of the following instruction types 
(classes) and CPI: 
 
Instruction class CPI Percentage 
A 1.4 25% 
B 2.4 70% 
C 2 5% 
 
 CPI = .25(1.4) + .70(2.4) + .05(2) = 2.13 
 
6. Suppose one machine, A, executes a program with an average CPI of 1.9.  Suppose 
another machine, B (with the same instruction set and an enhanced compiler), executes 
the same program with 20% less instructions and with a CPI of 1.1 at 800MHz.  In order 
for the two machines to have the same performance, what does the clock rate of the first 
machine need to be? 
 
 6
1.9 1.1 (1.0 .20)
800 10
1.7
A
A
IC IC
CR
CR GHz
× × −= ×
=
 
 
7. Use Amdahl’s Law to compute the new execution time for an architecture that previously 
required 20 seconds to execute a program, where 20% of the instructions executed were 
load/stores, if the time required for a load/store operation is reduced by 30% (amount of 
improvement for load/stores = 1/.70 = 1.44). 
 
20(.20) 20(.80)
1.44
18.8
Timeaffected by improvementTimeafter improvement Timeunaffected by improvement
Amount of improvement
Time
Time s
= +
= +
=
 
 
8. What’s the best way to measure performance of a machine?  Clock rate, CPI, MIPS, 
FLOPS (floating-point operations per second), memory latency, or average execution 
time?  Why? 
 
Average execution time, since that’s all we care about in the end (performance is 
based solely on execution time). 
9. This problem compares the performance of the single-cycle MIPS architecture, the 
multi-cycle MIPS architecture, and the pipelined MIPS architecture. 
 
 Assume it requires 10 ns to perform any of the following operations:  main memory 
access, register file access, and arithmetic operation.  Assume the delay for multiplexors, 
registers (i.e. setup/hold time), and lookup tables is negligible. 
 
a. What is the minimum clock period and frequency for each of the three 
implementations? 
 
 single-cycle = 10 ns (fetch) + 10 ns (decode) + 10 ns (execute) + 10 ns (memory) + 
10 ns write back = 50 ns (20 MHz) 
 
 multi-cycle and pipelined = 10 ns 
 
b. What is the CPI for each of the implementations, assuming a program execution with 
the following instruction mix: 
  
Instruction 
Execution 
Frequency 
r-type arithmetic, logical, comparison 60% 
branch 15% 
load 15% 
store 5% 
jump 5% 
 
Assume the pipelined CPU performs forwarding, that the compiler schedules the 
code to contain no load hazards, branches have a fixed 3-cycle latency (requires 2 
trailing no-ops), and you may disregard the time required to fill the pipeline. 
 
single-cycle CPI = 1 
 
multi-cycle CPI = .60 * 4 + .15 * 3 + .15 * 5 + .05 * 4 + .05 * 3 = 3.95 
 
pipelined CPI = .85 * 1 + .15 * 3 = 1.30 
 
c. Assume the program from part b executed 10 million instructions.  What is the 
speedup of the pipelined processor versus both the single-cycle processor and the 
multi-cycle processor? 
   
single-cycle => (1 * 50) / (1.30 * 10) = 3.85 
 
multi-cycle = (3.95 * 10) / (1.30 * 10) = 3.04 
 
 
Chapter 5:  The Processor:  Datapath and Control 
 
1. Describe the differences between single-cycle and multi-cycle processor architectures. 
  
Single-cycle processor executes each instruction in a single clock cycle, as the 
only sequential logic portion is fetch (the PC).  Multi-cycle processors execute 
instructions over multiple clock cycles, carrying intermediate values from one 
cycle to the next using registers. 
 
 
2. Describe in detail everything that is needed in order to add the addi, subi, andi, and ori 
instructions to the processor design above. 
 
Must provide a way for the ALU controller to interpret I-type instructions.  All 
other data paths already exist. 
 
3. Design a datapath for the R-type sll, sra, and srl instructions.  Don’t worry about control 
signals. 
 
Assuming the ALU can perform the necessary shift operations on input B, add a 
datapath from IR[10..6] (SHAMT) into the ALU and add the necessary rows into 
the ALU controller truth table. 
 
4. For a multicycle processor design, describe what steps of the instruction execution are 
performed in each clock cycle for a load instruction (Hint: there are 5 clock cycles).  
 
Cycle 1:  fetch the instruction 
Cycle 2:  read the contents of the base register rs 
Cycle 3:  compute the effective load address by adding the contents of the base 
register to the sign-extended version of the immediate field 
Cycle 4:  read word from memory at that address 
Cycle 5:  write the word back to the register file at address rt 
5. Provide all control signals for a LW instruction using the following processor design.  
Assume ALUOp is 10 for interpretation of instruction function code, 00 for add, and 01 for 
subtract. 
 
 
 
RegDst  __0__ 
Branch  __0__ 
MemRead __1__ 
MemtoReg __1__ 
ALUOp  _00__
MemWrite __0__ 
ALUSrc  __1__ 
RegWrite __1__ 
 
 
6. Describe briefly why the MIPS designers use a separate ALU controller rather than 
centralize all control in the main control unit? 
 
 This allows the main control unit to consolidate all the R-type instructions into a single 
control state, allowing the ALU controller to interpret the instruction. 
7. Draw the design of a simple datapath that multiplies the contents of register A by 3 and 
latches the value into another register B.  This datapath can be represented by the 
following RTL: 
 
B <- (A) * 3 
 
You are provided with a library of the following components: 
• register 
• a combinational logic block named “shift-left by one”, that provides an output 
whose value is the input value shifted to the left by one bit 
• adder 
 
Assume these components may be scaled to any arbitrary bit width, so you may abstract 
away the size of each of the signals. 
 
 
 
 
 
 
 
 
 
Chapter 6:  Enhancing Performance with Pipelining 
 
1. List and describe the pipeline hazards. 
 
Structural hazards:  Two operations require a single piece of hardware 
Structural hazards can be overcome by adding additional hardware 
 
Control hazards:  Conditional control instructions are not resolved until late in 
the pipeline, requiring subsequent instruction fetches to be predicted 
Flushed if prediction does not hold (make sure no state change) 
Branch hazards can use dynamic prediction/speculation, branch delay slot 
 
Data hazards:  Instruction from one pipeline stage is “dependant” of data 
computed in another pipeline stage 
 
2. For the following segment of code, indicate all data dependences for the 5-stage MIPS 
pipeline, and show the state of the pipeline for each cycle.  Assume a fixed 3-cycle 
branch latency. 
 
add $6,$5,$2 
lw $7,0($6) 
addi $7,$7,10 
add $6,$4,$2 
sw $7,0($6) 
addi $2,$2,4 
blt $2,$3,loop 
add $6,$5,$2 
 
 (see Chapter 6 lecture slide 17) 
A 
shift 
add B 
3. Describe the disadvantage of moving branch resolution to the decode stage, thus 
reducing branch latency (mis-prediction penalty). 
 
This would increase the logic latency of the decode stage, as the register file 
would need to be accessed in series with the branch resolution logic.  This may 
increase the cycle period time. 
 
4. Describe a reason why a 2-bit branch predictor can be more effective than a 1-bit branch 
predictor. 
 
A 1-bit predictor would change its prediction for a particular branch on the last 
iteration of a loop, causing a mis-prediction after the first iteration when the loop 
is executed again. 
 
5. Describe why predicting a branch taken still incurs at least one cycle penalty? 
 
The branch instruction must be fetched in order to compute the target address 
(computed in decode). 
 
6. Assume an architecture has a branch predictor that correctly predicts a branch 60% of 
the time.  A correctly predicted branch requires a 2 cycle latency, and a mis-predicted 
branch requires a 3 cycle latency.  Suppose a program requires 100 seconds to run, and 
30 seconds are spent executing branches.  If you enhance the design of the branch 
predictor such that it correctly predicts branches 90% of the time, what is the new 
execution time? 
 
old averaged branch penalty = .6 * 2 + .4 * 3 = 2.4 cycles 
new averaged branch penalty = .9 * 2 + .1 * 3 = 2.1 cycles 
 
branch speedup = 2.4 / 2.1 = 1.14 
 
new execution time = 30 / 1.14 + 70 = 96.32 (Amdahl’s Law) 
 
7. For the following segment of code, show the RAW data dependences for the five-state 
MIPS pipeline. 
 
add $2, $3, $4 
sub $3, $2, $4 
slt $4, $3, $2 
sw $3, 0($2) 
 
(see Chapter 6 lecture slide 17) 
 
8. Assume the following instructions are in the following pipeline stages: 
 
add $2, $3, $4  in WB 
or $2, $5, $6  in MEM 
sub $4, $2, $7  in EXECUTE 
 
Show or describe which forwarding paths are currently in use. 
 
(see Chapter 6 lecture slide 17) 
9. For the following program: 
 
main: addi $2, $0, 0 
loop: lw $3, vals($2) 
 addi $2, $2, 4 
 beqz $3, done 
 j loop 
done: sra $3, $2, 2 
 addi $3, $3, -1 
 jr $31 
 
Assume we want to run this program on an architecture that has a branch delay slot.  
How must we modify the code such that it behaves exactly as it would on a machine 
without a branch delay slot, without introducing any additional pipeline stalls resulting 
from the new instruction ordering? 
 
Move the sra instruction below the beqz instruction, since this will be performed if the 
branch is taken and will not affect the behavior if it is executed if the branch is not taken.