Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Feb. 2011 Computer Architecture, Advanced Architectures Slide 1
Part VII
Advanced Architectures
Feb. 2011 Computer Architecture, Advanced Architectures Slide 2
About This Presentation
This presentation is intended to support the use of the textbook
Computer Architecture: From Microprocessors to Supercomputers, 
Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated 
regularly by the author as part of his teaching of the upper-division 
course ECE 154, Introduction to Computer Architecture, at the 
University of California, Santa Barbara. Instructors can use these 
slides freely in classroom teaching and for other educational 
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition Released Revised Revised Revised Revised
First July 2003 July 2004 July 2005 Mar. 2007 Feb. 2011*
* Minimal update, due to this part not being used for lectures in ECE 154 at UCSB
Feb. 2011 Computer Architecture, Advanced Architectures Slide 3
VII  Advanced Architectures
Topics in This Part
Chapter 25   Road to Higher Performance
Chapter 26   Vector and Array Processing
Chapter 27   Shared-Memory Multiprocessing
Chapter 28   Distributed Multicomputing
Performance enhancement beyond what we have seen:
• What else can we do at the instruction execution level?
• Data parallelism: vector and array processing
• Control parallelism: parallel and distributed processing
Feb. 2011 Computer Architecture, Advanced Architectures Slide 4
25  Road to Higher Performance
Review past, current, and future architectural trends:
• General-purpose and special-purpose acceleration
• Introduction to data and control parallelism
Topics in This Chapter
25.1   Past and Current Performance Trends
25.2   Performance-Driven ISA Extensions
25.3   Instruction-Level Parallelism
25.4   Speculation and Value Prediction
25.5   Special-Purpose Hardware Accelerators
25.6   Vector, Array, and Parallel Processing
Feb. 2011 Computer Architecture, Advanced Architectures Slide 5
25.1    Past and Current Performance Trends
0.06 MIPS (4-bit processor)
Intel 4004: The first μp (1971) Intel Pentium 4, circa 2005
10,000 MIPS (32-bit processor)
8008
8080
8084
8-bit
8086
80186
80286
16-bit
8088
80188
80386
Pentium, MMX
Pentium Pro, II32-bit
80486
Pentium III, M
Celeron
Feb. 2011 Computer Architecture, Advanced Architectures Slide 6
Architectural Innovations for Improved Performance
Architectural method Improvement factor
1. Pipelining (and superpipelining) 3-8 √
2. Cache memory, 2-3 levels 2-5 √
3. RISC and related ideas 2-3 √
4. Multiple instruction issue (superscalar) 2-3 √
5. ISA extensions (e.g., for multimedia) 1-3 √
6. Multithreading (super-, hyper-) 2-5 ?
7. Speculation and value prediction 2-3 ?
8. Hardware acceleration 2-10 ?
9. Vector and array processing 2-10 ?
10. Parallel/distributed computing 2-1000s ?
E
s
t
a
b
l
i
s
h
e
d
m
e
t
h
o
d
s
N
e
w
e
r
m
e
t
h
o
d
s
P
r
e
v
i
o
u
s
l
y
d
i
s
c
u
s
s
e
d
C
o
v
e
r
e
d
 
i
n
P
a
r
t
 
V
I
I
Available computing power ca. 2000:
GFLOPS on desktop 
TFLOPS in supercomputer center
PFLOPS on drawing board
Computer performance grew by a factor
of about 10000 between 1980 and 2000
100 due to faster technology 
100 due to better architecture
Feb. 2011 Computer Architecture, Advanced Architectures Slide 7
Peak Performance of Supercomputers
PFLOPS
TFLOPS
GFLOPS
1980 20001990 2010
Earth Simulator
ASCI White Pacific
ASCI Red
Cray T3DTMC CM-5
TMC CM-2Cray X-MP
Cray 2
× 10 / 5 years
Dongarra, J., “Trends in High Performance Computing,”
Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]
Feb. 2011 Computer Architecture, Advanced Architectures Slide 8
Energy Consumption is Getting out of Hand
Figure 25.1    Trend in energy consumption for each MIPS of 
computational power in general-purpose processors and DSPs. 
1990 1980 2000 2010
kIPS 
MIPS 
GIPS 
TIPS 
P
e
r
f
o
r
m
a
n
c
e
 
Calendar year 
Absolute 
processor 
performance 
GP processor 
performance 
per watt 
DSP performance 
per watt 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 9
25.2    Performance-Driven ISA Extensions
Adding instructions that do more work per cycle
Shift-add: replace two instructions with one (e.g., multiply by 5)
Multiply-add: replace two instructions with one (x := c + a × b)
Multiply-accumulate: reduce round-off error (s := s + a × b)
Conditional copy: to avoid some branches (e.g., in if-then-else)
Subword parallelism (for multimedia applications)
Intel MMX: multimedia extension
64-bit registers can hold multiple integer operands
Intel SSE: Streaming SIMD extension
128-bit registers can hold several floating-point operands
Feb. 2011 Computer Architecture, Advanced Architectures Slide 10
Intel 
MMX
ISA
Exten-
sion
Table
25.1
Class Instruction Vector Op type Function or results
Register copy 32 bits Integer register ↔ MMX register
Parallel pack 4, 2 Saturate Convert to narrower elements
Parallel unpack low 8, 4, 2 Merge lower halves of 2 vectors
Parallel unpack high 8, 4, 2 Merge upper halves of 2 vectors
Parallel add 8, 4, 2 Wrap/Saturate# Add; inhibit carry at boundaries
Parallel subtract 8, 4, 2 Wrap/Saturate# Subtract with carry inhibition
Parallel multiply low 4 Multiply, keep the 4 low halves
Parallel multiply high 4 Multiply, keep the 4 high halves
Parallel multiply-add 4 Multiply, add adjacent products*
Parallel compare equal 8, 4, 2     All 1s where equal, else all 0s
Parallel compare greater 8, 4, 2     All 1s where greater, else all 0s
Parallel left shift logical 4, 2, 1 Shift left, respect boundaries
Parallel right shift logical 4, 2, 1 Shift right, respect boundaries
Parallel right shift arith 4, 2 Arith shift within each (half)word
Parallel AND 1 Bitwise dest ← (src1) ∧ (src2)
Parallel ANDNOT 1 Bitwise dest ← (src1) ∧ (src2)′
Parallel OR 1 Bitwise dest ← (src1) ∨ (src2)
Parallel XOR 1 Bitwise dest ← (src1) ⊕ (src2)
Parallel load MMX reg 32 or 64 bits Address given in integer register
Parallel store MMX reg 32 or 64 bit Address given in integer register
Control Empty FP tag bits Required for compatibility$
Memory
access
Logic
Shift
Arith-
metic
Copy
Feb. 2011 Computer Architecture, Advanced Architectures Slide 11
MMX Multiplication and Multiply-Add
Figure 25.2    Parallel multiplication and multiply-add in MMX.
a 
(a) Parallel multiply low (b) Parallel multiply-add 
b d e 
e f g h 
s t u v 
e × h 
d × g 
b × f 
a × e 
z v 
y u 
x t 
w s 
a b d e 
e f g h 
s + t u + v 
e × h 
d × g 
b × f 
a × e 
v 
u 
t 
s 
add add 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 12
MMX Parallel Comparisons
Figure 25.3    Parallel comparisons in MMX.
14 
(a) Parallel compare equal (b) Parallel compare greater 
3 58 66 
79 1 58 65 
0 0  0 
5 12 3 32 
12 3 22 
5 12 6 9 
12 5 90 17 8 
65 535 
(all 1s) 
0 0 0 0 0 
255 
(all 1s) 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 13
25.3    Instruction-Level Parallelism
Figure 25.4    Available instruction-level parallelism and the speedup 
due to multiple instruction issue in superscalar processors [John91].
1 
F
r
a
c
t
i
o
n
 
o
f
 
c
y
c
l
e
s
 
Issuable instructions per cycle 
20% 
30% 
10% 
  0% 
2 3 4 5 6 7 8 0 
S
p
e
e
d
u
p
 
a
t
t
a
i
n
e
d
 
Instruction issue width 
3 
2 
1 
2 4 6 8 0 
(a) (b) 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 14
Instruction-Level Parallelism
Figure 25.5    A computation with inherent instruction-level parallelism.
Feb. 2011 Computer Architecture, Advanced Architectures Slide 15
VLIW and EPIC Architectures
Figure 25.6    Hardware organization for IA-64. General and floating-
point registers are 64-bit wide. Predicates are single-bit registers.
VLIW Very long instruction word architecture
EPIC Explicitly parallel instruction computing
Memory 
General 
registers (128) 
Floating-point 
registers (128) 
Predi- 
cates 
(64) 
Execution 
unit 
Execution 
unit 
Execution 
unit 
Execution 
unit 
Execution 
unit 
Execution 
unit . . . 
. . . 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 16
25.4    Speculation and Value Prediction
Figure 25.7    Examples of software speculation in IA-64.
 
---- 
---- 
---- 
---- 
load 
---- 
---- 
 
spec load 
---- 
---- 
---- 
---- 
check load 
---- 
---- 
 
(a) Control speculation 
 
---- 
---- 
store 
---- 
load 
---- 
---- 
 
spec load 
---- 
---- 
store 
---- 
check load 
---- 
---- 
 
(b) Data speculation 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 17
Value Prediction
Figure 25.8    Value prediction for multiplication or 
division via a memo table.
  
Mult/ 
Div 
 
Memo table 
Control 
Mux 
Inputs 
Inputs ready 
Output  
Output ready 
0 
1 
Miss 
Done 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 18
25.5   Special-Purpose Hardware Accelerators
Figure 25.9    General structure of a processor 
with configurable hardware accelerators.
CPU Configuration memory 
 
Accel. 1 
Accel. 2 
Accel. 3 
Data and 
program 
memory 
FPGA-like unit 
on which 
accelerators 
can be formed 
via loading of 
configuration 
registers 
Unused 
resources 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 19
Graphic Processors, Network Processors, etc.
Figure 25.10    Simplified block diagram of Toaster2, 
Cisco Systems’ network processor.
Input 
buffer 
PE 
0 
PE 
1 
PE 
2 
PE 
3 
PE 
4 
PE 
5 PE 6 
PE 
7 
PE 
8 
PE 
9 
PE 
10 
PE 
11 
PE 
12 
PE 
13 
PE 
14 
PE 
15 
Output 
buffer 
 Column memory  
Column 
memory  
Column 
memory  
Column 
memory 
Feedback 
path 
PE
5
Feb. 2011 Computer Architecture, Advanced Architectures Slide 20
25.6   Vector, Array, and Parallel Processing
Figure 25.11    The Flynn-Johnson classification of computer systems.
SISD 
  
SIMD 
  
MISD 
 
MIMD 
  
GMSV 
  
GMMP 
  
DMSV 
  
DMMP 
  
Single data 
stream 
Multiple data 
streams 
S
i
n
g
l
e
 
i
n
s
t
r
 
s
t
r
e
a
m
 
M
u
l
t
i
p
l
e
 
i
n
s
t
r
 
s
t
r
e
a
m
s
 
Flynn’s categories 
J
o
h
n
s
o
n
’
s
 
e
x
p
a
n
s
i
o
n
 
Shared 
variables 
Message 
passing 
G
l
o
b
a
l
 
m
e
m
o
r
y
 
D
i
s
t
r
i
b
u
t
e
d
 
m
e
m
o
r
y
 
Uniprocessors 
Rarely used 
Array or vector 
processors 
Mult iproc’s or 
mult icomputers 
Shared-memory 
mult iprocessors 
Rarely used 
Distributed 
shared memory 
Distrib-memory 
mult icomputers 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 21
SIMD Architectures
Data parallelism: executing one operation on multiple data streams
Concurrency in time  – vector processing
Concurrency in space – array processing
Example to provide context
Multiplying a coefficient vector by a data vector (e.g., in filtering)
y[i] := c[i] × x[i], 0 ≤ i < n
Sources of performance improvement in vector processing 
(details in the first half of Chapter 26)
One instruction is fetched and decoded for the entire operation
The multiplications are known to be independent (no checking)
Pipelining/concurrency in memory access as well as in arithmetic
Array processing is similar (details in the second half of Chapter 26)
Feb. 2011 Computer Architecture, Advanced Architectures Slide 22
MISD Architecture Example
Figure 25.12    Multiple instruction streams 
operating on a single data stream (MISD).
I n s t r u c t i o n    s t r e a m s    1-5 
Data 
in 
Data 
out 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 23
MIMD Architectures
Control parallelism: executing several instruction streams in parallel
GMSV: Shared global memory – symmetric multiprocessors
DMSV: Shared distributed memory – asymmetric multiprocessors
DMMP: Message passing – multicomputers
Figure 27.1   Centralized shared memory. Figure 28.1   Distributed memory.
0 0 
1 1 
 m−1 
Processor-
to- 
memory 
network 
Processor-
to-
processor 
network 
Processors Memory modules
Parallel I/O 
. . . 
. 
. 
. 
. 
. 
. 
p−1 
 
0  
1  
  
Inter- 
connection 
network 
Memories and processors 
P
a
r
a
l
l
e
l
 
i
n
p
u
t
/
o
u
t
p
u
t
 
p−1 
 
. . . 
Routers 
A computing node 
. . .
Feb. 2011 Computer Architecture, Advanced Architectures Slide 24
Amdahl’s Law Revisited
0
10
20
40
50
0 10 20 30 40 50
Enhancement factor (p )
S
p
e
e
d
u
p
 
(
s
)
f  = 0
f  = 0.1
f  = 0.05
f  = 0.02
30
f  = 0.01
Figure 4.4    Amdahl’s law: speedup achieved if a fraction f of a task 
is unaffected and the remaining 1 – f part runs p times as fast.
s =   
≤ min(p, 1/f)
1
f+ (1 – f)/p
f = sequential  
fraction
with p
processors
p = speedup 
of the rest
Feb. 2011 Computer Architecture, Advanced Architectures Slide 25
26  Vector and Array Processing
Single instruction stream operating on multiple data streams
• Data parallelism in time = vector processing
• Data parallelism in space = array processing
Topics in This Chapter
26.1   Operations on Vectors
26.2   Vector Processor Implementation
26.3   Vector Processor Performance
26.4   Shared-Control Systems
26.5   Array Processor Implementation
26.6   Array Processor Performance
Feb. 2011 Computer Architecture, Advanced Architectures Slide 26
26.1  Operations on Vectors
Sequential processor:
for i = 0 to 63 do 
P[i] := W[i] × D[i]
endfor
Vector processor:
load W
load D
P := W × D
store P
for i = 0 to 63 do 
X[i+1] := X[i] + Z[i] 
Y[i+1] := X[i+1]+Y[i] 
endfor
Unparallelizable
Feb. 2011 Computer Architecture, Advanced Architectures Slide 27
26.2   Vector Processor Implementation
Figure 26.1    Simplified generic structure of a vector processor.
  
Function unit 1 pipeline 
T
o
 
a
n
d
 
f
r
o
m
 
m
e
m
o
r
y
 
u
n
i
t
 
From scalar registers 
Vector 
register 
file 
Function unit 2 pipeline 
Function unit 3 pipeline 
Forwarding muxes 
Load 
unit A 
Load 
unit B 
Store 
unit 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 28
Conflict-Free Memory Access
Figure 26.2    Skewed storage of the elements of a 64 × 64 matrix 
for conflict-free memory access in a 64-way interleaved memory. 
Elements of column 0 are highlighted in both diagrams .
0,0 
 
 
2,0 
. 
. 
. 
62,0 
 
63,0 
0,1 
 
 
2,1 
. 
. 
. 
62,1 
 
63,1 
0,2 
 
 
2,2 
. 
. 
. 
62,2 
 
63,2 
0,62 
 
 
2,62 
. 
. 
. 
62,62 
 
63,62 
0,63 
 
 
2,63 
. 
. 
. 
62,63 
 
63,63 
... 
 
 
... 
.  
 . 
  . 
... 
 
... 
0,0 
 
 
2,62 
. 
. 
. 
62,2 
 
63,1 
0,1 
 
 
2,63 
. 
. 
. 
62,3 
 
63,2 
0,2 
 
 
2,0 
. 
. 
. 
62,4 
 
63,3 
0,62 
 
 
2,60 
. 
. 
. 
62,0 
 
63,63 
0,63 
 
 
2,61 
. 
. 
. 
62,1 
 
63,0 
... 
 
 
... 
.  
 . 
  . 
... 
 
... 
(a) Conventional row-major order (b) Skewed row-major order 
Bank 
number 0 1 62 63 2 . . . 0 1 62 63 2 . . . 
1,0 1,1 1,2 1,62 1,63 ... 1,63 1,0 1,0 1,61 1,62 ... 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 29
Overlapped Memory Access and Computation
Figure 26.3    Vector processing via segmented load/store of 
vectors in registers in a double-buffering scheme. Solid (dashed) 
lines show data flow in the current (next) segment.
Vector reg 0 
Vector reg 1 
Vector reg 5 
Vector reg 2 
Vector reg 3 
Vector reg 4 
Load X  
Load Y  
Store Z 
T
o
 
a
n
d
 
f
r
o
m
 
m
e
m
o
r
y
 
u
n
i
t
 
 
Pipelined adder 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 30
26.3   Vector Processor Performance
Figure 26.4    Total latency of the vector computation 
S := X × Y + Z, without and with pipeline chaining.
Multiplication 
start-up 
Addition 
start-up 
+ × 
+ 
× 
Without  
chaining 
With pipeline 
chaining 
Time 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 31
Performance as a Function of Vector Length
Figure 26.5    The per-element execution time in a vector processor 
as a function of the vector length.
Vector length 
100 200 300 400 0 
C
l
o
c
k
 
c
y
c
l
e
s
 
p
e
r
 
v
e
c
t
o
r
 
e
l
e
m
e
n
t
 
5 
4 
3 
2 
1 
0 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 32
26.4   Shared-Control Systems
Figure 26.6    From completely shared control 
to totally separate controls.
(a) Shared-control array  
processor, SIMD 
(b) Multiple shared controls,  
MSIMD 
(c) Separate controls, 
MIMD 
Processing Control 
 
 
 
 
 
 
.
 
.
 
.
 
Processing Control 
 
 
 
 
 
 
.
 
.
 
.
 
Processing Control 
 
 
 
 
 
 
.
 
.
 
.
 
.
 
.
 
.
 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 33
Example Array Processor
Figure 26.7    Array processor with 2D torus 
interprocessor communication network.
Control 
broadcast Parallel 
I/O 
Processor array  Control 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Switches 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 34
26.5   Array Processor Implementation
Figure 26.8     Handling of interprocessor communication 
via a mechanism similar to data forwarding.
ALU Reg file 
CommunDir 
CommunEn 
PE state FF 
Data 
memory 
To array 
state reg 
To reg f ile and 
data memory 
Commun 
buffer 
N 
E 
W 
S To NEWS 
neighbors 
0 
1 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 35
Configuration Switches
Figure 26.9    I/O switch states in the array processor of Figure 26.7.
Control 
broadcast Parallel 
I/O 
Processor array Control 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Switches 
Figure 26.7
(a) Torus operation 
In 
(b) Clockwise I/O (c) Counterclockwise I/O
Out 
In 
Out 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 36
26.6   Array Processor Performance
Array processors perform well for the same class of problems that
are suitable for vector processors
For embarrassingly (pleasantly) parallel problems, array processors
A criticism of array processing:
For conditional computations, a significant part of the array remains
idle while the “then” part is performed; subsequently, idle and busy
processors reverse roles during the “else” part
However:
Considering array processors inefficient due to idle processors
is like criticizing mass transportation because many seats are 
unoccupied most of the time
It’s the total cost of computation that counts, not hardware utilization! 
can be faster and more energy-efficient than vector processors
Feb. 2011 Computer Architecture, Advanced Architectures Slide 37
27  Shared-Memory Multiprocessing
Multiple processors sharing a memory unit seems naïve
• Didn’t we conclude that memory is the bottleneck?
• How then does it make sense to share the memory?
Topics in This Chapter
27.1   Centralized Shared Memory
27.2   Multiple Caches and Cache Coherence
27.3   Implementing Symmetric Multiprocessors
27.4   Distributed Shared Memory
27.5   Directories to Guide Data Access
27.6   Implementing Asymmetric Multiprocessors
Feb. 2011 Computer Architecture, Advanced Architectures Slide 38
Parallel Processing 
as a Topic of Study
Graduate course ECE 254B:
Adv. Computer Architecture –
Parallel Processing
An important area of study
that allows us to overcome
fundamental speed limits
Our treatment of the topic is
quite brief (Chapters 26-27)
Feb. 2011 Computer Architecture, Advanced Architectures Slide 39
27.1  Centralized Shared Memory
Figure 27.1    Structure of a multiprocessor with 
centralized shared-memory.
0 0 
1 1 
 m−1 
Processor-
to- 
memory 
network 
Processor-
to-
processor 
network  
Processors Memory modules
Parallel I/O 
. . . 
. 
. 
. 
. 
. 
. 
p−1 
 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 40
Processor-to-Memory Interconnection Network
Figure 27.2   Butterfly and the related Beneš network as examples 
of processor-to-memory interconnection network in a multiprocessor.
(a) Butterfly network  (b) Beneš network  
0 
2 
4 
6 
8 
10 
12 
14 
Processors Memories 
P 
r 
o 
c 
e 
s 
s 
o 
r 
s 
M 
e 
m 
o 
r 
i 
e 
s 
1 
3 
5 
7 
9 
11 
13 
15 
0 
2 
4 
6 
8 
10 
12 
14 
1 
3 
5 
7 
9 
11 
13 
15 
0 
2 
4 
6 
0 
2 
4 
6 
1 
3 
5 
7 
1 
3 
5 
7 
Row  0 
Row  1 
Row  2 
Row  3 
Row  4 
Row  5 
Row  6 
Row  7 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 41
Processor-to-Memory Interconnection Network
Figure 27.3   Interconnection of eight processors to 256 memory banks 
in Cray Y-MP, a supercomputer with multiple vector processors.
0 
1 
2 
3 
4 
5 
6 
7 
8 ×  8 
8 ×  8 
8 ×  8 
8 ×  8 
4 ×  4 
4 ×  4 
4 ×  4 
4 ×  4 
4 ×  4 
4 ×  4 
4 ×  4 
4 ×  4 
Sections Subsections Memory banks 
0, 4, 8, 12, 16, 20, 24, 28 
32, 36, 40, 44, 48, 52, 56, 60 
1, 5, 9, 13, 17, 21, 25, 29 
2, 6, 10, 14, 18, 22, 26, 30 
3, 7, 11, 15, 19, 23, 27, 31 
Processors  
1 × 8 switches 
224, 228, 232, 236, . . . , 252 
225, 229, 233, 237, . . . , 253 
226, 230, 234, 238, . . . , 254 
227, 231, 235, 239, . . . , 255 
8 
/ 
8 
/ 
8 
/ 
8 
/ 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 42
Shared-Memory Programming: Broadcasting
Copy B[0] into all B[i] so that multiple processors
can read its value without memory access conflicts
for k = 0 to ⎡log2 p⎤ – 1 processor j, 0 ≤ j < p, do
B[j + 2k] := B[j]
endfor
  0 
  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
10 
11
B
Recursive
doubling
Feb. 2011 Computer Architecture, Advanced Architectures Slide 43
Shared-Memory Programming: Summation
Sum reduction of vector X
processor j, 0 ≤ j < p, do Z[j] := X[j]
s := 1
while s < p processor j, 0 ≤ j < p – s, do
Z[j + s] := X[j] + X[j + s]
s := 2 × s
endfor
0 
1 
2 
3 
4 
5 
6 
7 
8 
9
S
  0:0 
  1:1 
  2:2 
  3:3 
  4:4 
  5:5 
  6:6 
  7:7 
  8:8 
  9:9
  0:0 
  0:1 
  1:2 
  2:3 
  3:4 
  4:5 
  5:6 
  6:7 
  7:8 
  8:9
  0:0 
  0:1 
  0:2 
  0:3 
  1:4 
  2:5 
  3:6 
  4:7 
  5:8 
  6:9
  0:0 
  0:1 
  0:2 
  0:3 
  0:4 
  0:5 
  0:6 
  0:7 
  1:8 
  2:9
  0:0 
  0:1 
  0:2 
  0:3 
  0:4 
  0:5 
  0:6 
  0:7 
  0:8 
  0:9
Recursive
doubling
Feb. 2011 Computer Architecture, Advanced Architectures Slide 44
27.2  Multiple Caches and Cache Coherence
Private processor caches reduce memory access traffic through the 
interconnection network but lead to challenging consistency problems.
0 0 
1 1 
 m−1 
Processor-
to- 
memory 
network 
p−1 
Processor-
to-
processor 
network  
Processors Caches Memory modules
Parallel I/O 
. . . 
. 
. 
. 
. 
. 
. 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 45
Status of Data Copies
Figure 27.4   Various types of cached data blocks in a parallel processor 
with centralized main memory and private processor caches.
0 
1 
 
Processor-
to- 
memory 
network 
p–1  
Processor-
to-
processor 
network  
Processors Caches Memory modules 
Parallel I/O 
. . . 
. 
. 
. 
. 
. 
. 
w 
x 
y 
z ′ 
w 
z ′ 
w 
y ′ 
x 
z 
Multiple consistent 
Single consistent 
Single inconsistent 
Invalid 
m–1 
0  
1  
Feb. 2011 Computer Architecture, Advanced Architectures Slide 46
A Snoopy 
Cache Coherence 
Protocol
Figure 27.5   Finite-state control mechanism for a bus-based 
snoopy cache coherence protocol with write-back caches.
CPU read  
or write hit 
 Invalid 
Shared 
(read-only) 
Exclusive 
(writable) 
CPU read hit 
CPU read miss: 
signal read miss 
on bus 
CPU w rite miss: 
signal write miss  
on bus 
CPU w rite hit: signal write miss on bus 
Bus write miss:  
write back cache line 
Bus write miss 
Bus read miss: write back cache line 
P
C
P
C
P
C
P
C
Bus
Memory
Feb. 2011 Computer Architecture, Advanced Architectures Slide 47
27.3   Implementing Symmetric Multiprocessors
Figure 27.6   Structure of a generic bus-based symmetric multiprocessor.
 
Computing nodes 
(typically, 1-4 CPUs 
and caches per node) 
 
Interleaved memory 
       
Bus adapter 
 
 
I/O modules 
 
Standard interfaces 
Bus adapter 
Very wide, high-bandwidth bus 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 48
Bus Bandwidth Limits Performance
Example 27.1
Consider a shared-memory multiprocessor built around a single bus with 
a data bandwidth of x GB/s. Instructions and data words are 4 B wide, 
each instruction requires access to an average of  1.4 memory words 
(including the instruction itself). The combined hit rate for caches is 98%. 
Compute an upper bound on the multiprocessor performance in GIPS. 
Address lines are separate and do not affect the bus data bandwidth.
Solution
Executing an instruction implies a bus transfer of 1.4 × 0.02 × 4 = 0.112B. 
Thus, an absolute upper bound on performance is x/0.112 = 8.93x GIPS. 
Assuming a bus width of 32 B, no bus cycle or data going to waste, and 
a bus clock rate of y GHz, the performance bound becomes 286y GIPS. 
This bound is highly optimistic. Buses operate in the range 0.1 to 1 GHz. 
Thus, a performance level approaching 1 TIPS (perhaps even ¼ TIPS) is 
beyond reach with this type of architecture.
Feb. 2011 Computer Architecture, Advanced Architectures Slide 49
Implementing Snoopy Caches
Figure 27.7   Main structure for a snoop-based cache coherence algorithm.
Tags 
Cache 
data 
array 
 
Duplicate tags 
and state store 
for snoop side 
CPU 
 
Main tags and 
state store for 
processor side 
=? 
=? 
Processor side 
cache control 
Snoop side 
cache control 
Addr Addr Cmd Cmd Buffer Buffer 
Snoop 
state 
System 
bus 
Tag 
Addr Cmd 
State 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 50
27.4   Distributed Shared Memory
Figure 27.8   Structure of a distributed shared-memory multiprocessor.
0  
1 z : 0 
 x : 0  y : 1  
Inter- 
connection 
network 
Processors with memory 
P
a
r
a
l
l
e
l
 
i
n
p
u
t
/
o
u
t
p
u
t
 
. . . 
p−1 
 
y := -1 
z := 1 
while z=0 do 
  x := x + y 
endwhile 
Routers 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 51
27.5   Directories to Guide Data Access
Figure 27.9   Distributed shared-memory multiprocessor with a cache, 
directory, and memory module associated with each processor.
0  
1  
  
Inter- 
connection 
network 
Processors & caches 
P
a
r
a
l
l
e
l
 
i
n
p
u
t
/
o
u
t
p
u
t
 
. . . 
p−1 
 
 
 
 
Memories  
Directories  Communication & memory interfaces 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 52
Directory-Based Cache Coherence
Figure 27.10   States and transitions for a directory entry in a directory-
based cache coherence protocol (c is the requesting cache).
Write miss: return value,  
set sharing set to {c} 
 Uncached 
Shared 
(read-only) 
Exclusive 
(writable) 
Read miss:  
return value,  
include c in  
sharing set 
Read miss: return value, 
set sharing set to {c} 
Write miss: invalidate all cached copies,  
set sharing set to {c}, return value 
Data w rite-back: 
set sharing set to { } 
Read miss: fetch data from owner, 
return value, include c in sharing set 
Write miss:  
fetch data from owner,  
request invalidation,  
return value,  
set sharing set to {c} 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 53
27.6   Implementing Asymmetric Multiprocessors
Figure 27.11   Structure of a ring-based distributed-memory multiprocessor.
Computing nodes (typically, 1-4 CPUs and associated memory) 
 
Link 
To I/O controllers 
   
Memory 
Ring 
network 
Link Link Link 
Node 0 Node 1 Node 2 Node 3 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 54
Scalable 
Coherent 
Interface 
(SCI)
Figure 27.11   Structure of a ring-based 
distributed-memory multiprocessor.
0  
1  
  
Processors  
and caches 
T
o
 
i
n
t
e
r
c
o
n
n
e
c
t
i
o
n
 
n
e
t
w
o
r
k
 
3 
 
 
 
Memories  
2   
Feb. 2011 Computer Architecture, Advanced Architectures Slide 55
28  Distributed Multicomputing
Computer architects’ dream: connect computers like toy blocks
• Building multicomputers from loosely connected nodes
• Internode communication is done via message passing
Topics in This Chapter
28.1   Communication by Message Passing
28.2   Interconnection Networks
28.3   Message Composition and Routing
28.4   Building and Using Multicomputers
28.5   Network-Based Distributed Computing
28.6   Grid Computing and Beyond
Feb. 2011 Computer Architecture, Advanced Architectures Slide 56
28.1  Communication by Message Passing
Figure 28.1    Structure of a distributed multicomputer.
0  
1  
  
Inter- 
connection 
network 
Memories and processors 
P
a
r
a
l
l
e
l
 
i
n
p
u
t
/
o
u
t
p
u
t
 
p−1 
 
. . . 
Routers  
A computing node 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 57
Router Design
Figure 28.2   The structure of a generic router.
Switch 
I
n
p
u
t
 
c
h
a
n
n
e
l
s
 
 Routing and arbitration 
 
Input queues 
Q  
Q  
Q  
Q  
Q  
Q  
Q  
Q  
LC 
LC 
LC 
LC 
LC 
LC 
LC 
LC 
O
u
t
p
u
t
 
c
h
a
n
n
e
l
s
 
Output queues 
 Q  Q 
 LC  LC Link controller 
Message queue 
Injection channel Ejection channel 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 58
Building Networks from Switches
Straight through Crossed connection Lower broadcast Upper broadcast 
Figure 28.3   Example 2 × 2 switch with point-to-point 
and broadcast connection capabilities.
(a) Butterfly network (b) Beneš network  
0 
2 
4 
6 
8 
10 
12 
14 
Processors Memories 
P 
r 
o 
c 
e 
s 
s 
o 
r 
s 
M 
e 
m 
o 
r 
i 
e 
s 
1 
3 
5 
7 
9 
11 
13 
15 
0 
2 
4 
6 
8 
10 
12 
14 
1 
3 
5 
7 
9 
11 
13 
15 
0 
2 
4 
6 
0 
2 
4 
6 
1 
3 
5 
7 
1 
3 
5 
7 
Row  0 
Row  1 
Row  2 
Row  3 
Row  4 
Row  5 
Row  6 
Row  7 
Figure 27.2
Butterfly and 
Beneš networks
Feb. 2011 Computer Architecture, Advanced Architectures Slide 59
Interprocess Communication via Messages
Figure 28.4   Use of send and receive message-passing 
primitives to synchronize two processes.
Process A Process B 
... 
... 
... 
... 
... 
... 
send x 
... 
... 
... 
... 
... 
... 
... 
... 
... 
 
 
 
 
 
 
 
 
receive x 
... 
... 
... Time 
Communication 
latency 
Process B is suspended
Process B is awakened 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 60
28.2  Interconnection Networks
Figure 28.5    Examples of direct and indirect interconnection networks.
(a) Direct network  (b) Indirect network  
Routers  Nodes Nodes
Feb. 2011 Computer Architecture, Advanced Architectures Slide 61
Direct 
Interconnection 
Networks
Figure 28.6   A sampling of common direct interconnection networks. 
Only routers are shown; a computing node is implicit for each router.
(a) 2D torus (b) 4D hypercube 
(c) Chordal ring (d) Ring of rings 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 62
Indirect Interconnection Networks
Figure 28.7   Two commonly used indirect interconnection networks.
(a) Hierarchical buses (b) Omega network  
Level-1 bus 
Level-2 bus 
Level-3 bus 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 63
28.3  Message Composition and Routing
Figure 28.8    Messages and their parts for message passing.
Message Padding 
Packet 
data 
Last packet 
Header Trailer 
A transmitted 
packet 
Flow control 
digits (flits) 
Data or payload 
First packet 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 64
Wormhole Switching
Figure 28.9   Concepts of wormhole switching.
Worm 1: 
moving 
(a) Two worms en route to their 
respective destinations 
Source 2 
Source 1 
Destination 1 
Destination 2 
Worm 2: 
blocked 
(b) Deadlock due to circular waiting 
of four blocked worms 
Each worm is blocked at the 
point of attempted right turn 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 65
28.4  Building and Using Multicomputers
Figure 28.10    A task system and schedules on 1, 2, and 3 computers.
(a) Static task graph (b) Schedules on 1-3 computers 
Inputs 
Outputs 
t = 1 
t = 1 
t = 2 
t = 2 t = 2 
t = 3 
B 
A C 
D 
E 
F 
G 
H 
t = 1 
t = 2 
B A C D E F G H 
B A 
C 
D 
E 
H F 
G 
B A 
C 
D 
E 
F 
G H 
0 5 10 15 
Time 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 66
Building Multicomputers from Commodity Nodes
Figure 28.11   Growing clusters using modular nodes.
(a) Current racks of modules (b) Futuristic toy-block construction 
Expansion 
slots 
One module: 
CPU, 
memory, 
disks 
One module:
CPU(s), 
memory, 
disks 
Wireless 
connection 
surfaces 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 67
28.5  Network-Based Distributed Computing
Figure 28.12    Network of workstations.
System or 
I/O bus  PC 
Fast network 
interface with 
large memory 
NIC 
Network built of 
high-speed 
wormhole switches 
Feb. 2011 Computer Architecture, Advanced Architectures Slide 68
28.6  Grid Computing and Beyond
Computational grid is analogous to the power grid
Decouples the “production” and “consumption” of computational power
Homes don’t have an electricity generator; why should they have a computer?
Advantages of computational grid:
Near continuous availability of computational and related resources
Resource requirements based on sum of averages, rather than sum of peaks
Paying for services based on actual usage rather than peak demand
Distributed data storage for higher reliability, availability, and security
Universal access to specialized and one-of-a-kind computing resources
Still to be worked out as of late 2000s: How to charge for compute usage
Feb. 2011 Computer Architecture, Advanced Architectures Slide 69
Computing in the Cloud
Image from Wikipedia
Computational resources,
both hardware and software,
are provided by, and 
managed within, the cloud
Users pay a fee for access
Managing / upgrading is 
much more efficient in large, 
centralized facilities 
(warehouse-sized data 
centers or server farms)
This is a natural continuation of the outsourcing trend for special services, 
so that companies can focus their energies on their main business
Feb. 2011 Computer Architecture, Advanced Architectures Slide 70
The Shrinking Supercomputer
Feb. 2011 Computer Architecture, Advanced Architectures Slide 71
Warehouse-Sized Data Centers
Image from 
IEEE Spectrum, 
June 2009