Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
1 
Single-Cycle Processors:

Datapath & Control

Arvind

Computer Science & Artificial Intelligence Lab

M.I.T.

Based on the material prepared by

Arvind and Krste Asanovic

6.823 L5- 2 
Instruction Set Architecture (ISA) 
Arvind 
versus Implementation 
• ISA is the hardware/software interface 
– Defines set of programmer visible state 
– Defines instruction format (bit encoding) and instruction 
semantics

– Examples:  MIPS, x86, IBM 360, JVM

• Many possible implementations of one ISA

– 360 implementations: model 30 (c. 1964), z900 (c. 2001) 
– x86 implementations:  8086 (c. 1978), 80186, 286, 386, 486, 
Pentium, Pentium Pro, Pentium-4 (c. 2000),  AMD Athlon, 
Transmeta Crusoe, SoftPC 
– MIPS implementations: R2000, R4000, R10000, ...

– JVM:  HotSpot, PicoJava, ARM Jazelle, ...

September 26, 2005 
6.823 L5- 3 
Arvind 
Processor Performance

Time = Instructions Cycles Time 
Program  Program  * Instruction * Cycle 
– Instructions per program depends on source code, compiler 
technology, and ISA 
– Cycles per instructions (CPI) depends upon the ISA and the 
microarchitecture 
– Time per cycle depends upon the microarchitecture and the 
base technology 
this lecture

Microarchitecture CPI cycle time 
Microcoded >1 short 
Single-cycle unpipelined 1 long 
Pipelined 1 short 
September 26, 2005 
6.823 L5- 4 
Arvind 
Microarchitecture: Implementation of an ISA

Controller 
Data 
path 
control 
pointsstatus 
lines 
Structure: How components are connected. 
Static 
Behavior: How data moves between components 
Dynamic 
September 26, 2005 
Hardware Elements

• Combinational circuits OpSelect 
– 
Result 
Comp? 
A 
B 
ALU 
Sel 
O 
A0 
A1 
An-1 
Mux... 
A 
D
e
m
u
x ... 
O0 
O1 
On-
1 
Sel 
A 
D
e
c
o
d
e
r
... 
O0 
O1 
On-1 
Mux, Demux, Decoder, ALU, ... - Add, Sub, ... 
- And, Or, Xor, Not, ... 
- GT, LT, EQ, Zero, ... 
lg(n) 
lg(n) 
lg(n) 
• Synchronous state elements 
– Flipflop, Register, Register file, SRAM, DRAM 
Clk 
D 
Q 
Enff 
Q 
D 
Clk 
En 
ff 
Q0 
D0 
Clk 
En 
ff 
Q1 
D1 
ff 
Q2 
D2 
ff 
Qn-1 
Dn-1 
... 
... 
... 
register 
Edge-triggered: Data is sampled at the rising edge 
September 26, 2005 
6.823 L5- 6 
Arvind 
Register Files 
Clock WE 
ReadData1ReadSel1 
ReadSel2 
WriteSel 
Register 
file 
2R+1W 
ReadData2 
WriteData 
rd1rs1 
rs2 
ws 
wd 
rd2 
we 
ws clk 
register 1 
wd 
we rs2 
rs1 
rd1 
rd2 
register 0 
… 
… 
… 
… 
32 
32 
32 
32 
32 
325 5 
5 
register 31 
•	 No timing issues in reading a selected register 
•	 Register files with a large number of ports are difficult 
to design 
–	 Intel’s Itanium, GPR File has 128 registers with 8 read ports and 
4 write ports!!! 
September 26, 2005 
6.823 L5- 7 
Arvind 
A Simple Memory Model

WriteEnable 
Clock 
Address 
ReadData 
WriteData 
Reads and writes are always completed in one cycle 
MAGIC 
RAM 
• a Read can be done any time (i.e. combinational) 
• a Write is performed at the rising clock edge 
if it is enabled 
⇒ 	 the write address and data 
must be stable at the clock edge 
Later in the course we will present a more realistic 
model of memory 
September 26, 2005 
6.823 L5- 8 
Arvind 
Implementing MIPS:

Single-cycle per instruction

datapath & control logic

September 26, 2005 
6.823 L5- 9 
Arvind 
The MIPS ISA

Processor State 
32 32-bit GPRs, R0 always contains a 0

32 single precision FPRs, may also be viewed as

16 double precision FPRs 
FP status register, used for FP compares & exceptions 
PC, the program counter 
some other special registers 
Data types

8-bit byte, 16-bit half word 

32-bit word for integers

32-bit word for single precision floating point

64-bit word for double precision floating point

Load/Store style instruction set

data addressing modes- immediate & indexed 
branch addressing modes- PC relative & register indirect 
Byte addressable memory- big endian mode 
All instructions are 32 bits 
September 26, 2005 
6.823 L5- 10 
Arvind 
Instruction Execution 
Execution of an instruction involves 
1. instruction fetch
2. decode and register fetch
3. ALU operation
4. memory operation (optional)
5. write back
and the computation of the address of the 
next instruction 
September 26, 2005 
6.823 L5- 11 
Arvind 
Datapath: Reg-Reg ALU Instructions

0x4 
Add 
clk 
addr 
inst 
Inst. 
Memory 
PC 
inst<5:0> 
z 
ALU 
Control 
RegWrite 
clk 
rd1 
rs1 
rs2 
ws 
wd rd2 
weinst<25:21> 
inst<20:16> 
inst<15:11> 
OpCode 
ALU 
GPRs 
RegWrite Timing?
6 5 5  5 5 6 
0 rd 0 funcrs rt  rd ← (rs) func (rt) 
31 26 25 21 20 16 15 11 5 0 
September 26, 2005 
6.823 L5- 12 
Arvind 
Datapath: Reg-Imm ALU Instructions

Imm 
Ext 
inst<15:0> 
0x4 
Add 
clk 
addr 
inst 
Inst. 
Memory 
PC 
z 
ALU 
RegWrite 
clk 
rd1 
rs1 
rs2 
ws 
wd rd2 
we 
ALU 
Control 
GPRs 
inst<25:21> 
inst<20:16> 
inst<31:26> 
OpCode ExtSel 
6 5 5  16  
opcode rs rt immediate rt ← (rs) op immediate 
31 26 25 2120 16 15 0 
September 26, 2005 
6.823 L5- 13 
Arvind 
Conflicts in Merging Datapath

0x4 
Add 
addr 
inst 
Inst. 
Memory 
PC 
clk 
RegWrite Introduce 
Imm 
Ext 
z 
ALU 
clk 
rd1 
rs1 
rs2 
ws 
wd rd2 
we 
inst<15:0> 
ALU 
Controlinst<5:0> 
muxes 
GPRs 
inst<25:21> 
inst<20:16> 
inst<31:26> 
inst<15:11> 
OpCode ExtSel 
6 5 5  5 5 6 
0 rs rt  rd 0 func 
opcode rs rt immediate 
rd ← (rs) func (rt) 
rt ← (rs) op immediate 
September 26, 2005 
6.823 L5- 14 
Arvind 
Datapath for ALU Instructions

0x4 
Add 
addr 
inst 
Inst. 
Memory 
PC 
RegWrite 
clk 
<31:26>, <5:0> 
BSrc 
Imm 
Ext 
z 
ALU 
clk 
rd1 
rs1 
rs2 
ws 
wd rd2 
we<25:21> 
<20:16> 
<15:0> 
ALU 
Control 
<15:11> 
GPRs 
OpCode	 RegDst ExtSel OpSel 
rt / rd Reg / Imm
6 5 5  5 5 6

0 rs rt  rd 0 func 
opcode rs rt immediate 
rd ← (rs) func (rt) 
rt ← (rs) op immediate 
September 26, 2005 
6.823 L5- 15 
Arvind 
Datapath for Memory Instructions

Should program and data memory be separate? 
Harvard style: separate (Aiken and Mark 1 influence) 
- read-only program memory 
- read/write data memory 
at some level the two memories have 
to be the same 
Princeton style: the same (von Neumann’s influence) 
- A Load or Store instruction requires 
accessing the memory more than once 
during its execution 
September 26, 2005 
6.823 L5- 16 
Arvind 
Load/Store Instructions:Harvard Datapath

0x4 
Add 
addr 
inst 
Inst. 
Memory 
PC 
RegWrite MemWrite 
clk 
WBSrc 
ALU / Mem 
“base” 
disp 
ALU 
Control 
z 
ALU 
clk 
rd1 
rs1 
rs2 
ws 
wd rd2 
we 
Imm 
Ext 
clk 
addr 
wdata 
rdata
Data 
Memory 
we 
GPRs 
OpCode RegDst ExtSel OpSel BSrc 
opcode rs rt displacement
6 5 5 16 addressing mode 
(rs) + displacement 
31 26 25 21 20  16 15 0 
rs is the base register 
rt is the destination of a Load or the source for a Store 
September 26, 2005 
6.823 L5- 17 
Arvind 
MIPS Control Instructions 
Conditional (on GPR) PC-relative branch 
6 5 5 16 
opcode rs offset BEQZ, BNEZ 
Unconditional register-indirect jumps

6 5 5  16 
opcode rs JR, JALR 
Unconditional absolute jumps 
6 26 
opcode target J, JAL 
• PC-relative branches add offset×4 to PC+4 to calculate the 
target address (offset is in words): ±128 KB range 
• Absolute jumps append target×4 to PC<31:28> to calculate 
the target address: 256 MB range 
• jump-&-link stores PC+4 into the link register (R31) 
• All Control Transfers are delayed by 1 instruction 
we will worry about the branch delay slot later 
September 26, 2005 
6.823 L5- 18 
Arvind 
Conditional Branches (BEQZ, BNEZ) 
Add 
PCSrc 
clk 
WBSrcMemWrite 
addr 
wdata 
rdata
Data 
Memory 
we 
z 
clk 
clk 
addr 
inst 
Inst. 
Memory 
PC rd1 
rs1 
rs2 
ws 
wd rd2 
we 
Imm 
Ext 
ALU 
ALU 
Control 
Add 
br 
pc+4 
RegWrite 
0x4 
zero? 
GPRs 
OpCode RegDst ExtSel OpSel BSrc 
September 26, 2005 
6.823 L5- 19 
Arvind 
Register-Indirect Jumps (JR) 
RegWrite 
Add 
Add 
clk 
WBSrcMemWrite 
addr 
wdata 
rdata
Data 
Memory 
we 
z 
clk 
clk 
addr 
inst 
Inst. 
Memory 
PC rd1 
rs1 
rs2 
ws 
wd rd2 
we 
Imm 
Ext 
ALU 
ALU 
Control 
PCSrc 
br 
pc+4 
rind 
0x4 
zero? 
GPRs 
OpCode RegDst ExtSel OpSel BSrc 
September 26, 2005 
6.823 L5- 20 
Arvind 
Register-Indirect Jump-&-Link (JALR) 
RegWrite 
Add 
Add 
clk 
WBSrcMemWrite 
addr 
wdata 
rdata
Data 
Memory 
we 
z 
clk 
clk 
addr 
inst 
Inst. 
Memory 
PC rd1 
rs1 
rs2 
ws 
wd rd2 
we 
Imm 
Ext 
ALU 
ALU 
Control 
31 
PCSrc 
br 
pc+4 
rind 
0x4 
zero? 
GPRs 
OpCode RegDst ExtSel OpSel BSrc 
September 26, 2005 
6.823 L5- 21 
Arvind 
Absolute Jumps (J, JAL) 
RegWrite 
Add 
Add 
clk 
WBSrcMemWrite 
addr 
wdata 
rdata
Data 
Memory 
we 
z 
clk 
clk 
addr 
inst 
Inst. 
Memory 
PC rd1 
rs1 
rs2 
ws 
wd rd2 
we 
Imm 
Ext 
ALU 
ALU 
Control 
31 
PCSrc 
br 
pc+4 
rind 
jabs 
0x4 
zero? 
GPRs 
OpCode RegDst ExtSel OpSel BSrc 
September 26, 2005 
6.823 L5- 22 
Arvind 
Harvard-Style Datapath for MIPS 
RegWrite 
Add 
Add 
clk 
WBSrc 
addr 
wdata 
rdata
Data 
Memory 
we 
z 
clk 
zero? 
clk 
addr 
inst 
Inst. 
Memory 
PC rd1 
rs1 
rs2 
ws 
wd rd2 
we 
Imm 
Ext 
ALU 
ALU 
Control 
31 
PCSrc 
br 
rind 
jabs 
pc+4 
0x4 
MemWrite 
GPRs 
OpCode RegDst ExtSel OpSel BSrc 
September 26, 2005 
23 
Five-minute break to stretch your legs

6.823 L5- 24 
Arvind 
Single-Cycle Hardwired Control:

Harvard architecture 
We will assume 
• clock period is sufficiently long for all of 

the following steps to be “completed”:

1. instruction fetch
2. decode and register fetch
3. ALU operation
4. data fetch if required
5. register write-back setup time
⇒  tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB 
• At the rising edge of the following clock, the PC,

the register file and the memory are updated

September 26, 2005 
6.823 L5- 25 
Hardwired Control is pure 
Arvind 
Combinational Logic 
combinational 
logic 
ExtSel 
BSrc 
OpSel
op code 
MemWrite 
WBSrc
zero? 
RegDst 
RegWrite 
PCSrc 
September 26, 2005 
6.823 L5- 26 
Arvind 
ALU Control & Immediate Extension

Inst<31:26> (Opcode) 
Decode Map 
Inst<5:0> (Func) 
ALUop 
0? 
+ 
OpSel 
( Func, Op, +, 0? ) 
ExtSel 
( sExt16, uExt16, 
High16) 
September 26, 2005 
6.823 L5- 27 
Arvind 
Hardwired Control Table

Opcode ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc 
ALU * Reg Func no yes ALU rd pc+4 
ALUi sExt16 Imm Op no yes ALU rt pc+4 
ALUiu uExt16 Imm Op no yes ALU rt pc+4 
LW sExt16 Imm + no yes Mem rt pc+4 
SW sExt16 Imm + yes no * * pc+4 
BEQZz=0 sExt16 *  0?  no  no  *  *  br 
BEQZz=1 sExt16 *  0?  no  no  *  *  pc+4 
J * * * no no * * jabs 
JAL * * * no yes PC R31 jabs 
JR *  *  *  no  no  *  *  rind 
JALR *  *  *  no  yes  PC R31 rind  
BSrc = Reg / Imm WBSrc = ALU / Mem / PC    
RegDst = rt / rd / R31 PCSrc = pc+4 / br / rind / jabs 
September 26, 2005 
6.823 L5- 28 
Arvind 
Pipelined MIPS 
To pipeline MIPS: 
•	 First build MIPS without pipelining with CPI=1 
•	 Next, add pipeline registers to reduce cycle 
time while maintaining CPI=1 
September 26, 2005 
6.823 L5- 29 
Arvind 
Pipelined Datapath

0x4 
Add
addrPC 
we 
rs1 
rs2 
rd1 we 
ws addrrdata IR ALUwd rd2 
rdata GPRs Data Inst. 
MemoryImmMemory 
wdataExt 
write
fetch decode & Reg-fetch execute memory -back
phase phase phase phase phase 
Clock period can be reduced by dividing the execution of an 
instruction into multiple cycles 
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) 
However, CPI will increase unless instructions are pipelined 
September 26, 2005 
6.823 L5- 30 
Arvind 
An Ideal Pipeline 
stage 
1 
stage 
2 
stage 
3 
stage 
4 
• All objects go through the same stages 
• No sharing of resources between any two stages 
• Propagation delay through all pipeline stages is equal 
• The scheduling of an object entering the pipeline 
is not affected by the objects in other stages 
These conditions generally hold for industrial 
assembly lines. 
But can an instruction pipeline satisfy the last 
condition? 
September 26, 2005 
6.823 L5- 31 
How to divide the datapath 
Arvind 
into stages 
Suppose memory is significantly slower than 
other stages. In particular, suppose 
= 10 unitstIM

= 10 units
tDM

= 5 units
tALU 
= 1 unittRF

= 1 unit
tRW 
Since the slowest stage determines the clock, it 
may be possible to combine some stages without 
any loss of performance 
September 26, 2005 
tC > max {tIM, tRF, tALU, tDM, tRW}  = tDMI +tALU, tDM, tR    tDM
6.823 L5- 32 
Arvind 
Alternative Pipelining 
write 
-backfetch 
phase 
execute 
phase 
decode & Reg-fetch 
phase 
memory 
phase 
addr 
wdata 
Memory 
we 
ALU 
Imm 
Ext 
0x4 
Add 
addr 
Inst. 
Memory 
rd1 
GPRs 
rs1 
rs2 
ws 
wd rd2 
we 
IR
PC 
rdata 
Data 
rdata 
phase 
t    t , t F + W}  t + tRW 
⇒ increase the critical path by 10% 
Write-back stage takes much less time than other stages. 
Suppose we combined it with the memory phase 
September 26, 2005 
6.823 L5- 33 
Arvind 
Maximum Speedup by Pipelining 
Assumptions 	 Unpipelined  Pipelined Speedup 
tC 	 tC 
t
t
1. tIM = tDM = 10, 

ALU = 5, 

RF = tRW= 1

4-stage pipeline 	 27 10 2.7 
2. tIM = tDM = tALU = tRF = tRW = 5 
4-stage pipeline 25 10 2.5 
3. 	tIM = tDM = tALU = tRF = tRW = 5 
5-stage pipeline 25 5 5.0 
It is possible to achieve higher speedup with more 
stages in the pipeline. 
September 26, 2005 
34 
Thank you !