Feb. 2011 Computer Architecture, Advanced Architectures Slide 1 Part VII Advanced Architectures Feb. 2011 Computer Architecture, Advanced Architectures Slide 2 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar. 2007 Feb. 2011* * Minimal update, due to this part not being used for lectures in ECE 154 at UCSB Feb. 2011 Computer Architecture, Advanced Architectures Slide 3 VII Advanced Architectures Topics in This Part Chapter 25 Road to Higher Performance Chapter 26 Vector and Array Processing Chapter 27 Shared-Memory Multiprocessing Chapter 28 Distributed Multicomputing Performance enhancement beyond what we have seen: • What else can we do at the instruction execution level? • Data parallelism: vector and array processing • Control parallelism: parallel and distributed processing Feb. 2011 Computer Architecture, Advanced Architectures Slide 4 25 Road to Higher Performance Review past, current, and future architectural trends: • General-purpose and special-purpose acceleration • Introduction to data and control parallelism Topics in This Chapter 25.1 Past and Current Performance Trends 25.2 Performance-Driven ISA Extensions 25.3 Instruction-Level Parallelism 25.4 Speculation and Value Prediction 25.5 Special-Purpose Hardware Accelerators 25.6 Vector, Array, and Parallel Processing Feb. 2011 Computer Architecture, Advanced Architectures Slide 5 25.1 Past and Current Performance Trends 0.06 MIPS (4-bit processor) Intel 4004: The first μp (1971) Intel Pentium 4, circa 2005 10,000 MIPS (32-bit processor) 8008 8080 8084 8-bit 8086 80186 80286 16-bit 8088 80188 80386 Pentium, MMX Pentium Pro, II32-bit 80486 Pentium III, M Celeron Feb. 2011 Computer Architecture, Advanced Architectures Slide 6 Architectural Innovations for Improved Performance Architectural method Improvement factor 1. Pipelining (and superpipelining) 3-8 √ 2. Cache memory, 2-3 levels 2-5 √ 3. RISC and related ideas 2-3 √ 4. Multiple instruction issue (superscalar) 2-3 √ 5. ISA extensions (e.g., for multimedia) 1-3 √ 6. Multithreading (super-, hyper-) 2-5 ? 7. Speculation and value prediction 2-3 ? 8. Hardware acceleration 2-10 ? 9. Vector and array processing 2-10 ? 10. Parallel/distributed computing 2-1000s ? E s t a b l i s h e d m e t h o d s N e w e r m e t h o d s P r e v i o u s l y d i s c u s s e d C o v e r e d i n P a r t V I I Available computing power ca. 2000: GFLOPS on desktop TFLOPS in supercomputer center PFLOPS on drawing board Computer performance grew by a factor of about 10000 between 1980 and 2000 100 due to faster technology 100 due to better architecture Feb. 2011 Computer Architecture, Advanced Architectures Slide 7 Peak Performance of Supercomputers PFLOPS TFLOPS GFLOPS 1980 20001990 2010 Earth Simulator ASCI White Pacific ASCI Red Cray T3DTMC CM-5 TMC CM-2Cray X-MP Cray 2 × 10 / 5 years Dongarra, J., “Trends in High Performance Computing,” Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04] Feb. 2011 Computer Architecture, Advanced Architectures Slide 8 Energy Consumption is Getting out of Hand Figure 25.1 Trend in energy consumption for each MIPS of computational power in general-purpose processors and DSPs. 1990 1980 2000 2010 kIPS MIPS GIPS TIPS P e r f o r m a n c e Calendar year Absolute processor performance GP processor performance per watt DSP performance per watt Feb. 2011 Computer Architecture, Advanced Architectures Slide 9 25.2 Performance-Driven ISA Extensions Adding instructions that do more work per cycle Shift-add: replace two instructions with one (e.g., multiply by 5) Multiply-add: replace two instructions with one (x := c + a × b) Multiply-accumulate: reduce round-off error (s := s + a × b) Conditional copy: to avoid some branches (e.g., in if-then-else) Subword parallelism (for multimedia applications) Intel MMX: multimedia extension 64-bit registers can hold multiple integer operands Intel SSE: Streaming SIMD extension 128-bit registers can hold several floating-point operands Feb. 2011 Computer Architecture, Advanced Architectures Slide 10 Intel MMX ISA Exten- sion Table 25.1 Class Instruction Vector Op type Function or results Register copy 32 bits Integer register ↔ MMX register Parallel pack 4, 2 Saturate Convert to narrower elements Parallel unpack low 8, 4, 2 Merge lower halves of 2 vectors Parallel unpack high 8, 4, 2 Merge upper halves of 2 vectors Parallel add 8, 4, 2 Wrap/Saturate# Add; inhibit carry at boundaries Parallel subtract 8, 4, 2 Wrap/Saturate# Subtract with carry inhibition Parallel multiply low 4 Multiply, keep the 4 low halves Parallel multiply high 4 Multiply, keep the 4 high halves Parallel multiply-add 4 Multiply, add adjacent products* Parallel compare equal 8, 4, 2 All 1s where equal, else all 0s Parallel compare greater 8, 4, 2 All 1s where greater, else all 0s Parallel left shift logical 4, 2, 1 Shift left, respect boundaries Parallel right shift logical 4, 2, 1 Shift right, respect boundaries Parallel right shift arith 4, 2 Arith shift within each (half)word Parallel AND 1 Bitwise dest ← (src1) ∧ (src2) Parallel ANDNOT 1 Bitwise dest ← (src1) ∧ (src2)′ Parallel OR 1 Bitwise dest ← (src1) ∨ (src2) Parallel XOR 1 Bitwise dest ← (src1) ⊕ (src2) Parallel load MMX reg 32 or 64 bits Address given in integer register Parallel store MMX reg 32 or 64 bit Address given in integer register Control Empty FP tag bits Required for compatibility$ Memory access Logic Shift Arith- metic Copy Feb. 2011 Computer Architecture, Advanced Architectures Slide 11 MMX Multiplication and Multiply-Add Figure 25.2 Parallel multiplication and multiply-add in MMX. a (a) Parallel multiply low (b) Parallel multiply-add b d e e f g h s t u v e × h d × g b × f a × e z v y u x t w s a b d e e f g h s + t u + v e × h d × g b × f a × e v u t s add add Feb. 2011 Computer Architecture, Advanced Architectures Slide 12 MMX Parallel Comparisons Figure 25.3 Parallel comparisons in MMX. 14 (a) Parallel compare equal (b) Parallel compare greater 3 58 66 79 1 58 65 0 0 0 5 12 3 32 12 3 22 5 12 6 9 12 5 90 17 8 65 535 (all 1s) 0 0 0 0 0 255 (all 1s) Feb. 2011 Computer Architecture, Advanced Architectures Slide 13 25.3 Instruction-Level Parallelism Figure 25.4 Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91]. 1 F r a c t i o n o f c y c l e s Issuable instructions per cycle 20% 30% 10% 0% 2 3 4 5 6 7 8 0 S p e e d u p a t t a i n e d Instruction issue width 3 2 1 2 4 6 8 0 (a) (b) Feb. 2011 Computer Architecture, Advanced Architectures Slide 14 Instruction-Level Parallelism Figure 25.5 A computation with inherent instruction-level parallelism. Feb. 2011 Computer Architecture, Advanced Architectures Slide 15 VLIW and EPIC Architectures Figure 25.6 Hardware organization for IA-64. General and floating- point registers are 64-bit wide. Predicates are single-bit registers. VLIW Very long instruction word architecture EPIC Explicitly parallel instruction computing Memory General registers (128) Floating-point registers (128) Predi- cates (64) Execution unit Execution unit Execution unit Execution unit Execution unit Execution unit . . . . . . Feb. 2011 Computer Architecture, Advanced Architectures Slide 16 25.4 Speculation and Value Prediction Figure 25.7 Examples of software speculation in IA-64. ---- ---- ---- ---- load ---- ---- spec load ---- ---- ---- ---- check load ---- ---- (a) Control speculation ---- ---- store ---- load ---- ---- spec load ---- ---- store ---- check load ---- ---- (b) Data speculation Feb. 2011 Computer Architecture, Advanced Architectures Slide 17 Value Prediction Figure 25.8 Value prediction for multiplication or division via a memo table. Mult/ Div Memo table Control Mux Inputs Inputs ready Output Output ready 0 1 Miss Done Feb. 2011 Computer Architecture, Advanced Architectures Slide 18 25.5 Special-Purpose Hardware Accelerators Figure 25.9 General structure of a processor with configurable hardware accelerators. CPU Configuration memory Accel. 1 Accel. 2 Accel. 3 Data and program memory FPGA-like unit on which accelerators can be formed via loading of configuration registers Unused resources Feb. 2011 Computer Architecture, Advanced Architectures Slide 19 Graphic Processors, Network Processors, etc. Figure 25.10 Simplified block diagram of Toaster2, Cisco Systems’ network processor. Input buffer PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE 8 PE 9 PE 10 PE 11 PE 12 PE 13 PE 14 PE 15 Output buffer Column memory Column memory Column memory Column memory Feedback path PE 5 Feb. 2011 Computer Architecture, Advanced Architectures Slide 20 25.6 Vector, Array, and Parallel Processing Figure 25.11 The Flynn-Johnson classification of computer systems. SISD SIMD MISD MIMD GMSV GMMP DMSV DMMP Single data stream Multiple data streams S i n g l e i n s t r s t r e a m M u l t i p l e i n s t r s t r e a m s Flynn’s categories J o h n s o n ’ s e x p a n s i o n Shared variables Message passing G l o b a l m e m o r y D i s t r i b u t e d m e m o r y Uniprocessors Rarely used Array or vector processors Mult iproc’s or mult icomputers Shared-memory mult iprocessors Rarely used Distributed shared memory Distrib-memory mult icomputers Feb. 2011 Computer Architecture, Advanced Architectures Slide 21 SIMD Architectures Data parallelism: executing one operation on multiple data streams Concurrency in time – vector processing Concurrency in space – array processing Example to provide context Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] × x[i], 0 ≤ i < n Sources of performance improvement in vector processing (details in the first half of Chapter 26) One instruction is fetched and decoded for the entire operation The multiplications are known to be independent (no checking) Pipelining/concurrency in memory access as well as in arithmetic Array processing is similar (details in the second half of Chapter 26) Feb. 2011 Computer Architecture, Advanced Architectures Slide 22 MISD Architecture Example Figure 25.12 Multiple instruction streams operating on a single data stream (MISD). I n s t r u c t i o n s t r e a m s 1-5 Data in Data out Feb. 2011 Computer Architecture, Advanced Architectures Slide 23 MIMD Architectures Control parallelism: executing several instruction streams in parallel GMSV: Shared global memory – symmetric multiprocessors DMSV: Shared distributed memory – asymmetric multiprocessors DMMP: Message passing – multicomputers Figure 27.1 Centralized shared memory. Figure 28.1 Distributed memory. 0 0 1 1 m−1 Processor- to- memory network Processor- to- processor network Processors Memory modules Parallel I/O . . . . . . . . . p−1 0 1 Inter- connection network Memories and processors P a r a l l e l i n p u t / o u t p u t p−1 . . . Routers A computing node . . . Feb. 2011 Computer Architecture, Advanced Architectures Slide 24 Amdahl’s Law Revisited 0 10 20 40 50 0 10 20 30 40 50 Enhancement factor (p ) S p e e d u p ( s ) f = 0 f = 0.1 f = 0.05 f = 0.02 30 f = 0.01 Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast. s = ≤ min(p, 1/f) 1 f+ (1 – f)/p f = sequential fraction with p processors p = speedup of the rest Feb. 2011 Computer Architecture, Advanced Architectures Slide 25 26 Vector and Array Processing Single instruction stream operating on multiple data streams • Data parallelism in time = vector processing • Data parallelism in space = array processing Topics in This Chapter 26.1 Operations on Vectors 26.2 Vector Processor Implementation 26.3 Vector Processor Performance 26.4 Shared-Control Systems 26.5 Array Processor Implementation 26.6 Array Processor Performance Feb. 2011 Computer Architecture, Advanced Architectures Slide 26 26.1 Operations on Vectors Sequential processor: for i = 0 to 63 do P[i] := W[i] × D[i] endfor Vector processor: load W load D P := W × D store P for i = 0 to 63 do X[i+1] := X[i] + Z[i] Y[i+1] := X[i+1]+Y[i] endfor Unparallelizable Feb. 2011 Computer Architecture, Advanced Architectures Slide 27 26.2 Vector Processor Implementation Figure 26.1 Simplified generic structure of a vector processor. Function unit 1 pipeline T o a n d f r o m m e m o r y u n i t From scalar registers Vector register file Function unit 2 pipeline Function unit 3 pipeline Forwarding muxes Load unit A Load unit B Store unit Feb. 2011 Computer Architecture, Advanced Architectures Slide 28 Conflict-Free Memory Access Figure 26.2 Skewed storage of the elements of a 64 × 64 matrix for conflict-free memory access in a 64-way interleaved memory. Elements of column 0 are highlighted in both diagrams . 0,0 2,0 . . . 62,0 63,0 0,1 2,1 . . . 62,1 63,1 0,2 2,2 . . . 62,2 63,2 0,62 2,62 . . . 62,62 63,62 0,63 2,63 . . . 62,63 63,63 ... ... . . . ... ... 0,0 2,62 . . . 62,2 63,1 0,1 2,63 . . . 62,3 63,2 0,2 2,0 . . . 62,4 63,3 0,62 2,60 . . . 62,0 63,63 0,63 2,61 . . . 62,1 63,0 ... ... . . . ... ... (a) Conventional row-major order (b) Skewed row-major order Bank number 0 1 62 63 2 . . . 0 1 62 63 2 . . . 1,0 1,1 1,2 1,62 1,63 ... 1,63 1,0 1,0 1,61 1,62 ... Feb. 2011 Computer Architecture, Advanced Architectures Slide 29 Overlapped Memory Access and Computation Figure 26.3 Vector processing via segmented load/store of vectors in registers in a double-buffering scheme. Solid (dashed) lines show data flow in the current (next) segment. Vector reg 0 Vector reg 1 Vector reg 5 Vector reg 2 Vector reg 3 Vector reg 4 Load X Load Y Store Z T o a n d f r o m m e m o r y u n i t Pipelined adder Feb. 2011 Computer Architecture, Advanced Architectures Slide 30 26.3 Vector Processor Performance Figure 26.4 Total latency of the vector computation S := X × Y + Z, without and with pipeline chaining. Multiplication start-up Addition start-up + × + × Without chaining With pipeline chaining Time Feb. 2011 Computer Architecture, Advanced Architectures Slide 31 Performance as a Function of Vector Length Figure 26.5 The per-element execution time in a vector processor as a function of the vector length. Vector length 100 200 300 400 0 C l o c k c y c l e s p e r v e c t o r e l e m e n t 5 4 3 2 1 0 Feb. 2011 Computer Architecture, Advanced Architectures Slide 32 26.4 Shared-Control Systems Figure 26.6 From completely shared control to totally separate controls. (a) Shared-control array processor, SIMD (b) Multiple shared controls, MSIMD (c) Separate controls, MIMD Processing Control . . . Processing Control . . . Processing Control . . . . . . Feb. 2011 Computer Architecture, Advanced Architectures Slide 33 Example Array Processor Figure 26.7 Array processor with 2D torus interprocessor communication network. Control broadcast Parallel I/O Processor array Control Switches Feb. 2011 Computer Architecture, Advanced Architectures Slide 34 26.5 Array Processor Implementation Figure 26.8 Handling of interprocessor communication via a mechanism similar to data forwarding. ALU Reg file CommunDir CommunEn PE state FF Data memory To array state reg To reg f ile and data memory Commun buffer N E W S To NEWS neighbors 0 1 Feb. 2011 Computer Architecture, Advanced Architectures Slide 35 Configuration Switches Figure 26.9 I/O switch states in the array processor of Figure 26.7. Control broadcast Parallel I/O Processor array Control Switches Figure 26.7 (a) Torus operation In (b) Clockwise I/O (c) Counterclockwise I/O Out In Out Feb. 2011 Computer Architecture, Advanced Architectures Slide 36 26.6 Array Processor Performance Array processors perform well for the same class of problems that are suitable for vector processors For embarrassingly (pleasantly) parallel problems, array processors A criticism of array processing: For conditional computations, a significant part of the array remains idle while the “then” part is performed; subsequently, idle and busy processors reverse roles during the “else” part However: Considering array processors inefficient due to idle processors is like criticizing mass transportation because many seats are unoccupied most of the time It’s the total cost of computation that counts, not hardware utilization! can be faster and more energy-efficient than vector processors Feb. 2011 Computer Architecture, Advanced Architectures Slide 37 27 Shared-Memory Multiprocessing Multiple processors sharing a memory unit seems naïve • Didn’t we conclude that memory is the bottleneck? • How then does it make sense to share the memory? Topics in This Chapter 27.1 Centralized Shared Memory 27.2 Multiple Caches and Cache Coherence 27.3 Implementing Symmetric Multiprocessors 27.4 Distributed Shared Memory 27.5 Directories to Guide Data Access 27.6 Implementing Asymmetric Multiprocessors Feb. 2011 Computer Architecture, Advanced Architectures Slide 38 Parallel Processing as a Topic of Study Graduate course ECE 254B: Adv. Computer Architecture – Parallel Processing An important area of study that allows us to overcome fundamental speed limits Our treatment of the topic is quite brief (Chapters 26-27) Feb. 2011 Computer Architecture, Advanced Architectures Slide 39 27.1 Centralized Shared Memory Figure 27.1 Structure of a multiprocessor with centralized shared-memory. 0 0 1 1 m−1 Processor- to- memory network Processor- to- processor network Processors Memory modules Parallel I/O . . . . . . . . . p−1 Feb. 2011 Computer Architecture, Advanced Architectures Slide 40 Processor-to-Memory Interconnection Network Figure 27.2 Butterfly and the related Beneš network as examples of processor-to-memory interconnection network in a multiprocessor. (a) Butterfly network (b) Beneš network 0 2 4 6 8 10 12 14 Processors Memories P r o c e s s o r s M e m o r i e s 1 3 5 7 9 11 13 15 0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15 0 2 4 6 0 2 4 6 1 3 5 7 1 3 5 7 Row 0 Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Feb. 2011 Computer Architecture, Advanced Architectures Slide 41 Processor-to-Memory Interconnection Network Figure 27.3 Interconnection of eight processors to 256 memory banks in Cray Y-MP, a supercomputer with multiple vector processors. 0 1 2 3 4 5 6 7 8 × 8 8 × 8 8 × 8 8 × 8 4 × 4 4 × 4 4 × 4 4 × 4 4 × 4 4 × 4 4 × 4 4 × 4 Sections Subsections Memory banks 0, 4, 8, 12, 16, 20, 24, 28 32, 36, 40, 44, 48, 52, 56, 60 1, 5, 9, 13, 17, 21, 25, 29 2, 6, 10, 14, 18, 22, 26, 30 3, 7, 11, 15, 19, 23, 27, 31 Processors 1 × 8 switches 224, 228, 232, 236, . . . , 252 225, 229, 233, 237, . . . , 253 226, 230, 234, 238, . . . , 254 227, 231, 235, 239, . . . , 255 8 / 8 / 8 / 8 / Feb. 2011 Computer Architecture, Advanced Architectures Slide 42 Shared-Memory Programming: Broadcasting Copy B[0] into all B[i] so that multiple processors can read its value without memory access conflicts for k = 0 to ⎡log2 p⎤ – 1 processor j, 0 ≤ j < p, do B[j + 2k] := B[j] endfor 0 1 2 3 4 5 6 7 8 9 10 11 B Recursive doubling Feb. 2011 Computer Architecture, Advanced Architectures Slide 43 Shared-Memory Programming: Summation Sum reduction of vector X processor j, 0 ≤ j < p, do Z[j] := X[j] s := 1 while s < p processor j, 0 ≤ j < p – s, do Z[j + s] := X[j] + X[j + s] s := 2 × s endfor 0 1 2 3 4 5 6 7 8 9 S 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9 0:0 0:1 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9 0:0 0:1 0:2 0:3 1:4 2:5 3:6 4:7 5:8 6:9 0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 1:8 2:9 0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 Recursive doubling Feb. 2011 Computer Architecture, Advanced Architectures Slide 44 27.2 Multiple Caches and Cache Coherence Private processor caches reduce memory access traffic through the interconnection network but lead to challenging consistency problems. 0 0 1 1 m−1 Processor- to- memory network p−1 Processor- to- processor network Processors Caches Memory modules Parallel I/O . . . . . . . . . Feb. 2011 Computer Architecture, Advanced Architectures Slide 45 Status of Data Copies Figure 27.4 Various types of cached data blocks in a parallel processor with centralized main memory and private processor caches. 0 1 Processor- to- memory network p–1 Processor- to- processor network Processors Caches Memory modules Parallel I/O . . . . . . . . . w x y z ′ w z ′ w y ′ x z Multiple consistent Single consistent Single inconsistent Invalid m–1 0 1 Feb. 2011 Computer Architecture, Advanced Architectures Slide 46 A Snoopy Cache Coherence Protocol Figure 27.5 Finite-state control mechanism for a bus-based snoopy cache coherence protocol with write-back caches. CPU read or write hit Invalid Shared (read-only) Exclusive (writable) CPU read hit CPU read miss: signal read miss on bus CPU w rite miss: signal write miss on bus CPU w rite hit: signal write miss on bus Bus write miss: write back cache line Bus write miss Bus read miss: write back cache line P C P C P C P C Bus Memory Feb. 2011 Computer Architecture, Advanced Architectures Slide 47 27.3 Implementing Symmetric Multiprocessors Figure 27.6 Structure of a generic bus-based symmetric multiprocessor. Computing nodes (typically, 1-4 CPUs and caches per node) Interleaved memory Bus adapter I/O modules Standard interfaces Bus adapter Very wide, high-bandwidth bus Feb. 2011 Computer Architecture, Advanced Architectures Slide 48 Bus Bandwidth Limits Performance Example 27.1 Consider a shared-memory multiprocessor built around a single bus with a data bandwidth of x GB/s. Instructions and data words are 4 B wide, each instruction requires access to an average of 1.4 memory words (including the instruction itself). The combined hit rate for caches is 98%. Compute an upper bound on the multiprocessor performance in GIPS. Address lines are separate and do not affect the bus data bandwidth. Solution Executing an instruction implies a bus transfer of 1.4 × 0.02 × 4 = 0.112B. Thus, an absolute upper bound on performance is x/0.112 = 8.93x GIPS. Assuming a bus width of 32 B, no bus cycle or data going to waste, and a bus clock rate of y GHz, the performance bound becomes 286y GIPS. This bound is highly optimistic. Buses operate in the range 0.1 to 1 GHz. Thus, a performance level approaching 1 TIPS (perhaps even ¼ TIPS) is beyond reach with this type of architecture. Feb. 2011 Computer Architecture, Advanced Architectures Slide 49 Implementing Snoopy Caches Figure 27.7 Main structure for a snoop-based cache coherence algorithm. Tags Cache data array Duplicate tags and state store for snoop side CPU Main tags and state store for processor side =? =? Processor side cache control Snoop side cache control Addr Addr Cmd Cmd Buffer Buffer Snoop state System bus Tag Addr Cmd State Feb. 2011 Computer Architecture, Advanced Architectures Slide 50 27.4 Distributed Shared Memory Figure 27.8 Structure of a distributed shared-memory multiprocessor. 0 1 z : 0 x : 0 y : 1 Inter- connection network Processors with memory P a r a l l e l i n p u t / o u t p u t . . . p−1 y := -1 z := 1 while z=0 do x := x + y endwhile Routers Feb. 2011 Computer Architecture, Advanced Architectures Slide 51 27.5 Directories to Guide Data Access Figure 27.9 Distributed shared-memory multiprocessor with a cache, directory, and memory module associated with each processor. 0 1 Inter- connection network Processors & caches P a r a l l e l i n p u t / o u t p u t . . . p−1 Memories Directories Communication & memory interfaces Feb. 2011 Computer Architecture, Advanced Architectures Slide 52 Directory-Based Cache Coherence Figure 27.10 States and transitions for a directory entry in a directory- based cache coherence protocol (c is the requesting cache). Write miss: return value, set sharing set to {c} Uncached Shared (read-only) Exclusive (writable) Read miss: return value, include c in sharing set Read miss: return value, set sharing set to {c} Write miss: invalidate all cached copies, set sharing set to {c}, return value Data w rite-back: set sharing set to { } Read miss: fetch data from owner, return value, include c in sharing set Write miss: fetch data from owner, request invalidation, return value, set sharing set to {c} Feb. 2011 Computer Architecture, Advanced Architectures Slide 53 27.6 Implementing Asymmetric Multiprocessors Figure 27.11 Structure of a ring-based distributed-memory multiprocessor. Computing nodes (typically, 1-4 CPUs and associated memory) Link To I/O controllers Memory Ring network Link Link Link Node 0 Node 1 Node 2 Node 3 Feb. 2011 Computer Architecture, Advanced Architectures Slide 54 Scalable Coherent Interface (SCI) Figure 27.11 Structure of a ring-based distributed-memory multiprocessor. 0 1 Processors and caches T o i n t e r c o n n e c t i o n n e t w o r k 3 Memories 2 Feb. 2011 Computer Architecture, Advanced Architectures Slide 55 28 Distributed Multicomputing Computer architects’ dream: connect computers like toy blocks • Building multicomputers from loosely connected nodes • Internode communication is done via message passing Topics in This Chapter 28.1 Communication by Message Passing 28.2 Interconnection Networks 28.3 Message Composition and Routing 28.4 Building and Using Multicomputers 28.5 Network-Based Distributed Computing 28.6 Grid Computing and Beyond Feb. 2011 Computer Architecture, Advanced Architectures Slide 56 28.1 Communication by Message Passing Figure 28.1 Structure of a distributed multicomputer. 0 1 Inter- connection network Memories and processors P a r a l l e l i n p u t / o u t p u t p−1 . . . Routers A computing node Feb. 2011 Computer Architecture, Advanced Architectures Slide 57 Router Design Figure 28.2 The structure of a generic router. Switch I n p u t c h a n n e l s Routing and arbitration Input queues Q Q Q Q Q Q Q Q LC LC LC LC LC LC LC LC O u t p u t c h a n n e l s Output queues Q Q LC LC Link controller Message queue Injection channel Ejection channel Feb. 2011 Computer Architecture, Advanced Architectures Slide 58 Building Networks from Switches Straight through Crossed connection Lower broadcast Upper broadcast Figure 28.3 Example 2 × 2 switch with point-to-point and broadcast connection capabilities. (a) Butterfly network (b) Beneš network 0 2 4 6 8 10 12 14 Processors Memories P r o c e s s o r s M e m o r i e s 1 3 5 7 9 11 13 15 0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15 0 2 4 6 0 2 4 6 1 3 5 7 1 3 5 7 Row 0 Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Figure 27.2 Butterfly and Beneš networks Feb. 2011 Computer Architecture, Advanced Architectures Slide 59 Interprocess Communication via Messages Figure 28.4 Use of send and receive message-passing primitives to synchronize two processes. Process A Process B ... ... ... ... ... ... send x ... ... ... ... ... ... ... ... ... receive x ... ... ... Time Communication latency Process B is suspended Process B is awakened Feb. 2011 Computer Architecture, Advanced Architectures Slide 60 28.2 Interconnection Networks Figure 28.5 Examples of direct and indirect interconnection networks. (a) Direct network (b) Indirect network Routers Nodes Nodes Feb. 2011 Computer Architecture, Advanced Architectures Slide 61 Direct Interconnection Networks Figure 28.6 A sampling of common direct interconnection networks. Only routers are shown; a computing node is implicit for each router. (a) 2D torus (b) 4D hypercube (c) Chordal ring (d) Ring of rings Feb. 2011 Computer Architecture, Advanced Architectures Slide 62 Indirect Interconnection Networks Figure 28.7 Two commonly used indirect interconnection networks. (a) Hierarchical buses (b) Omega network Level-1 bus Level-2 bus Level-3 bus Feb. 2011 Computer Architecture, Advanced Architectures Slide 63 28.3 Message Composition and Routing Figure 28.8 Messages and their parts for message passing. Message Padding Packet data Last packet Header Trailer A transmitted packet Flow control digits (flits) Data or payload First packet Feb. 2011 Computer Architecture, Advanced Architectures Slide 64 Wormhole Switching Figure 28.9 Concepts of wormhole switching. Worm 1: moving (a) Two worms en route to their respective destinations Source 2 Source 1 Destination 1 Destination 2 Worm 2: blocked (b) Deadlock due to circular waiting of four blocked worms Each worm is blocked at the point of attempted right turn Feb. 2011 Computer Architecture, Advanced Architectures Slide 65 28.4 Building and Using Multicomputers Figure 28.10 A task system and schedules on 1, 2, and 3 computers. (a) Static task graph (b) Schedules on 1-3 computers Inputs Outputs t = 1 t = 1 t = 2 t = 2 t = 2 t = 3 B A C D E F G H t = 1 t = 2 B A C D E F G H B A C D E H F G B A C D E F G H 0 5 10 15 Time Feb. 2011 Computer Architecture, Advanced Architectures Slide 66 Building Multicomputers from Commodity Nodes Figure 28.11 Growing clusters using modular nodes. (a) Current racks of modules (b) Futuristic toy-block construction Expansion slots One module: CPU, memory, disks One module: CPU(s), memory, disks Wireless connection surfaces Feb. 2011 Computer Architecture, Advanced Architectures Slide 67 28.5 Network-Based Distributed Computing Figure 28.12 Network of workstations. System or I/O bus PC Fast network interface with large memory NIC Network built of high-speed wormhole switches Feb. 2011 Computer Architecture, Advanced Architectures Slide 68 28.6 Grid Computing and Beyond Computational grid is analogous to the power grid Decouples the “production” and “consumption” of computational power Homes don’t have an electricity generator; why should they have a computer? Advantages of computational grid: Near continuous availability of computational and related resources Resource requirements based on sum of averages, rather than sum of peaks Paying for services based on actual usage rather than peak demand Distributed data storage for higher reliability, availability, and security Universal access to specialized and one-of-a-kind computing resources Still to be worked out as of late 2000s: How to charge for compute usage Feb. 2011 Computer Architecture, Advanced Architectures Slide 69 Computing in the Cloud Image from Wikipedia Computational resources, both hardware and software, are provided by, and managed within, the cloud Users pay a fee for access Managing / upgrading is much more efficient in large, centralized facilities (warehouse-sized data centers or server farms) This is a natural continuation of the outsourcing trend for special services, so that companies can focus their energies on their main business Feb. 2011 Computer Architecture, Advanced Architectures Slide 70 The Shrinking Supercomputer Feb. 2011 Computer Architecture, Advanced Architectures Slide 71 Warehouse-Sized Data Centers Image from IEEE Spectrum, June 2009