Transport Layer (TCP/UDP) Where we are in the Course •Moving on up to the Transport Layer! CSE 461 University of Washington 2 Physical Link Network Transport Application Recall • Transport layer provides end-to-end connectivity across the network CSE 461 University of Washington 3 RouterHost Host TCP IP 802.11 app IP 802.11 IP Ethernet TCP IP Ethernet app Recall (2) • Segments carry application data across the network • Segments are carried within packets within frames CSE 461 University of Washington 4 802.11 IP TCP App, e.g., HTTP Segment Packet Frame Transport Layer Services •Provide different kinds of data delivery across the network to applications CSE 461 University of Washington 5 Unreliable Reliable Messages Datagrams (UDP) Bytestream Streams (TCP) Comparison of Internet Transports • TCP is full-featured, UDP is a glorified packet CSE 461 University of Washington 6 TCP (Streams) UDP (Datagrams) Connections Datagrams Bytes are delivered once, reliably, and in order Messages may be lost, reordered, duplicated Arbitrary length content Limited message size Flow control matches sender to receiver Can send regardless of receiver state Congestion control matches sender to network Can send regardless of network state Socket API • Simple abstraction to use the network • The “network” API (really Transport service) used to write all Internet apps • Part of all major OSes and languages; originally Berkeley (Unix) ~1983 • Supports both Internet transport services (Streams and Datagrams) CSE 461 University of Washington 7 Socket API (2) • Sockets let apps attach to the local network at different ports CSE 461 University of Washington 8 Socket, Port #1 Socket, Port #2 Socket API (3) • Same API used for Streams and Datagrams CSE 461 University of Washington 9 Primitive Meaning SOCKET Create a new communication endpoint BIND Associate a local address (port) with a socket LISTEN Announce willingness to accept connections ACCEPT Passively establish an incoming connection CONNECT Actively attempt to establish a connection SEND(TO) Send some data over the socket RECEIVE(FROM) Receive some data over the socket CLOSE Release the socket Only needed for Streams To/From for Datagrams Ports •Application process is identified by the tuple IP address, transport protocol, and port • Ports are 16-bit integers representing local “mailboxes” that a process leases • Servers often bind to “well-known ports” • <1024, require administrative privileges •Clients often assigned “ephemeral” ports • Chosen by OS, used temporarily CSE 461 University of Washington 10 Some Well-Known Ports CSE 461 University of Washington 11 Port Protocol Use 20, 21 FTP File transfer 22 SSH Remote login, replacement for Telnet 25 SMTP Email 80 HTTP World Wide Web 110 POP-3 Remote email access 143 IMAP Remote email access 443 HTTPS Secure Web (HTTP over SSL/TLS) 543 RTSP Media player control 631 IPP Printer sharing Topics • Service models • Socket API and ports • Datagrams, Streams • User Datagram Protocol (UDP) • Connections (TCP) • Sliding Window (TCP) • Flow control (TCP) • Retransmission timers (TCP) • Congestion control (TCP) CSE 461 University of Washington 12 UDP User Datagram Protocol (UDP) •Used by apps that don’t want reliability or bytestreams • Like what? CSE 461 University of Washington 14 User Datagram Protocol (UDP) •Used by apps that don’t want reliability or bytestreams • Voice-over-IP • DNS, RPC • DHCP (If application wants reliability and messages then it has work to do!) CSE 461 University of Washington 15 Datagram Sockets CSE 461 University of Washington 16 Client (host 1) Server (host 2)Time request reply Datagram Sockets (2) CSE 461 University of Washington 17 Client (host 1) Server (host 2)Time 1: socket 2: bind 1: socket 6: sendto 3: recvfrom* 4: sendto 5: recvfrom* 7: close 7: close *= call blocks request reply UDP Buffering CSE 461 University of Washington 18 App Port Mux/Demux App AppApplication Transport (UDP) Network (IP) packet Message queues Ports UDP Header •Uses ports to identify sending and receiving application processes •Datagram length up to 64K •Checksum (16 bits) for reliability CSE 461 University of Washington 19 UDP Header (2) •Optional checksum covers UDP segment and IP pseudoheader • Checks key IP fields (addresses) • Value of zero means “no checksum” CSE 461 University of Washington 20 TCP TCP • TCP Consists of 3 primary phases: • Connection Establishment (Setup) • Sliding Windows/Flow Control • Connection Release (Teardown) Connection Establishment •Both sender and receiver must be ready before we start the transfer of data • Need to agree on a set of parameters • e.g., the Maximum Segment Size (MSS) • This is signaling • It sets up state at the endpoints • Like “dialing” for a telephone call CSE 461 University of Washington 23 CSE 461 University of Washington 24 Three-Way Handshake • Used in TCP; opens connection for data in both directions • Each side probes the other with a fresh Initial Sequence Number (ISN) • Sends on a SYNchronize segment • Echo on an ACKnowledge segment • Chosen to be robust even against delayed duplicates Active party (client) Passive party (server) CSE 461 University of Washington 25 Three-Way Handshake (2) • Three steps: • Client sends SYN(x) • Server replies with SYN(y)ACK(x+1) • Client replies with ACK(y+1) • SYNs are retransmitted if lost • Sequence and ack numbers carried on further segments 1 2 3 Active party (client) Passive party (server) Time CSE 461 University of Washington 26 Three-Way Handshake (3) • Suppose delayed, duplicate copies of the SYN and ACK arrive at the server! • Improbable, but anyhow … Active party (client) Passive party (server) CSE 461 University of Washington 27 Three-Way Handshake (4) • Suppose delayed, duplicate copies of the SYN and ACK arrive at the server! • Improbable, but anyhow … •Connection will be cleanly rejected on both sides Active party (client) Passive party (server) X X REJECT REJECT TCP Connection State Machine •Captures the states ([]) and transitions (->) • A/B means event A triggers the transition, with action B Both parties run instances of this state machine TCP Connections (2) • Follow the path of the client: TCP Connections (3) • And the path of the server: TCP Connections (4) • Again, with states … CSE 461 University of Washington 31 LISTEN SYN_RCVD SYN_SENT ESTABLISHED ESTABLISHED 1 2 3 Active party (client) Passive party (server) Time CLOSEDCLOSED TCP Connections (5) • Finite state machines are a useful tool to specify and check the handling of all cases that may occur • TCP allows for simultaneous open • i.e., both sides open instead of the client-server pattern • Try at home to confirm it works CSE 461 University of Washington 32 Connection Release •Orderly release by both parties when done • Delivers all pending data and “hangs up” • Cleans up state in sender and receiver •Key problem is to provide reliability while releasing • TCP uses a “symmetric” close in which both sides shutdown independently CSE 461 University of Washington 33 CSE 461 University of Washington 34 TCP Connection Release • Two steps: • Active sends FIN(x), passive ACKs • Passive sends FIN(y), active ACKs • FINs are retransmitted if lost • Each FIN/ACK closes one direction of data transfer Active party Passive party CSE 461 University of Washington 35 TCP Connection Release (2) • Two steps: • Active sends FIN(x), passive ACKs • Passive sends FIN(y), active ACKs • FINs are retransmitted if lost • Each FIN/ACK closes one direction of data transfer Active party Passive party 1 2 TCP Connection State Machine CSE 461 University of Washington 36 Both parties run instances of this state machine •Captures the states ([]) and transitions (->) • A/B means event A triggers the transition, with action B TCP Release • Follow the active party CSE 461 University of Washington 37 TCP Release (2) • Follow the passive party CSE 461 University of Washington 38 TCP Release (3) •Again, with states … CSE 461 University of Washington 39 1 2 CLOSED Active party Passive party FIN_WAIT_1 CLOSE_WAIT LAST_ACK FIN_WAIT_2 TIME_WAIT CLOSED ESTABLISHED (timeout) ESTABLISHED TIME_WAIT State •Wait a long time after sending all segments and before completing the close • Two times the maximum segment lifetime of 60 seconds •Why? CSE 461 University of Washington 40 TIME_WAIT State •Wait a long time after sending all segments and before completing the close • Two times the maximum segment lifetime of 60 seconds •Why? • ACK might have been lost, in which case FIN will be resent for an orderly close • Could otherwise interfere with a subsequent connection CSE 461 University of Washington 41 Flow Control Recall •ARQ with one message at a time is Stop-and-Wait (normal case below) CSE 461 University of Washington 43 Frame 0 ACK 0Timeout Time Sender Receiver Frame 1 ACK 1 Limitation of Stop-and-Wait • It allows only a single message to be outstanding from the sender: • Fine for LAN (only one frame fits in network anyhow) • Not efficient for network paths with BD >> 1 packet CSE 461 University of Washington 44 Limitation of Stop-and-Wait (2) • Example: R=1 Mbps, D = 50 ms, 10kb packets • RTT (Round Trip Time) = 2D = 100 ms • How many packets/sec? • What if R=10 Mbps? CSE 461 University of Washington 45 Sliding Window •Generalization of stop-and-wait • Allows W packets to be outstanding • Can send W packets per RTT (=2D) • Pipelining improves performance • Need W=2BD to fill network path CSE 461 University of Washington 46 Sliding Window (2) • What W will use the network capacity? • Assume 10kb packets • Ex: R=1 Mbps, D = 50 ms • Ex: What if R=10 Mbps? CSE 461 University of Washington 47 Sliding Window (3) • Ex: R=1 Mbps, D = 50 ms • 2BD = 106 b/sec x 100. 10-3 sec = 100 kbit • W = 2BD = 10 packets of 1250 bytes • Ex: What if R=10 Mbps? • 2BD = 1000 kbit • W = 2BD = 100 packets of 1250 bytes CSE 461 University of Washington 48 Sliding Window Protocol •Many variations, depending on how buffers, acknowledgements, and retransmissions are handled •Go-Back-N • Simplest version, can be inefficient • Selective Repeat • More complex, better performance CSE 461 University of Washington 49 Sliding Window – Sender • Sender buffers up to W segments until they are acknowledged • LFS=LAST FRAME SENT, LAR=LAST ACK REC’D • Sends while LFS – LAR < W CSE 461 University of Washington 50 5 6 7 . . 2 3 4 5 2 3 . . LAR LFS W=5 Acked Unacked 3Unavailable Available seq. number Sliding Window Sliding Window – Sender (2) • Transport accepts another segment of data from the Application ... • Transport sends it (as LFS–LAR < 5) CSE 461 University of Washington 51 5 6 7 . . 2 3 4 5 2 3 . . LAR W=5 Acked Unacked 3Unavailable Sent seq. number Sliding Window LFS Sliding Window – Sender (3) •Next higher ACK arrives from peer… • Window advances, buffer is freed • LFS–LAR < 5 (can send one more) CSE 461 University of Washington 52 5 6 7 . . 2 3 4 5 2 3 . . LAR W=5 Acked Unacked 3Unavailable Available seq. number Sliding Window LFS Sliding Window – Go-Back-N •Receiver keeps only a single packet buffer for the next segment • State variable, LAS = LAST ACK SENT •On receive: • If seq. number is LAS+1, accept and pass it to app, update LAS, send ACK • Otherwise discard (as out of order) CSE 461 University of Washington 53 Sliding Window – Selective Repeat • Receiver passes data to app in order, and buffers out-of- order segments to reduce retransmissions • ACK conveys highest in-order segment, plus hints about out- of-order segments • TCP uses a selective repeat design; we’ll see the details later CSE 461 University of Washington 54 Sliding Window – Selective Repeat (2) •Buffers W segments, keeps state variable LAS = LAST ACK SENT •On receive: • Buffer segments [LAS+1, LAS+W] • Send app in-order segments from LAS+1, and update LAS • Send ACK for LAS regardless CSE 461 University of Washington 55 5Sliding Window – Selective Retransmission (3) •Keep normal sliding window • If receive something out of order • Send last unacked packet again! CSE 461 University of Washington 56 5 6 7 . . 2 4 5 3 . . LAR+1 again W=5 Acked Unacked 3Unavailable Ack Arrives Out of Order! seq. number Sliding Window LFS . . 5Sliding Window – Selective Retransmission (4) •Keep normal sliding window • If correct packet arrives, move window and LAR, send more messages CSE 461 University of Washington 57 5 6 7 . . 4 5 3 . . LAR W=5 Acked Unacked 3 Correct ack arrives… seq. number Sliding Window LFS . . . . Now Available Sliding Window – Retransmissions •Go-Back-N uses a single timer to detect losses • On timeout, resends buffered packets starting at LAR+1 • Selective Repeat uses a timer per unacked segment to detect losses • On timeout for segment, resend it • Hope to resend fewer segments CSE 461 University of Washington 58 Sequence Numbers •Need more than 0/1 for Stop-and-Wait … •But how many? • For Selective Repeat, need W numbers for packets, plus W for acks of earlier packets • 2W seq. numbers • Fewer for Go-Back-N (W+1) •Typically implement seq. number with an N-bit counter that wraps around at 2N—1 • E.g., N=8: …, 253, 254, 255, 0, 1, 2, 3, … CSE 461 University of Washington 59 Sequence Time Plot CSE 461 University of Washington 60 Time Se q . N u m b er Acks (at Receiver) Delay (=RTT/2) Transmissions (at Sender) Sequence Time Plot (2) CSE 461 University of Washington 61 Time Se q . N u m b er Go-Back-N scenario Sequence Time Plot (3) CSE 461 University of Washington 62 Time Se q . N u m b er Loss Timeout Retransmissions ACK Clocking Sliding Window ACK Clock • Each in-order ACK advances the sliding window and lets a new segment enter the network • ACKs “clock” data segments CSE 461 University of Washington 64 Ack 1 2 3 4 5 6 7 8 9 10 20 19 18 17 16 15 14 13 12 11 Data Benefit of ACK Clocking •Consider what happens when sender injects a burst of segments into the network CSE 461 University of Washington 65 Fast link Fast linkSlow (bottleneck) link Queue Benefit of ACK Clocking (2) • Segments are buffered and spread out on slow link CSE 461 University of Washington 66 Fast link Fast linkSlow (bottleneck) link Segments “spread out” Benefit of ACK Clocking (3) • ACKs maintain the spread back to the original sender CSE 461 University of Washington 67 Slow link Acks maintain spread Benefit of ACK Clocking (4) • Sender clocks new segments with the spread • Now sending at the bottleneck link without queuing! CSE 461 University of Washington 68 Slow link Segments spread Queue no longer builds Benefit of ACK Clocking (4) •Helps run with low levels of loss and delay! • The network smooths out the burst of data segments • ACK clock transfers this smooth timing back to sender • Subsequent data segments are not sent in bursts so do not queue up in the network CSE 461 University of Washington 69 TCP Uses ACK Clocking • TCP uses a sliding window because of the value of ACK clocking • Sliding window controls how many segments are inside the network • TCP only sends small bursts of segments to let the network keep the traffic smooth CSE 461 University of Washington 70 Problem • Sliding window has pipelining to keep network busy • What if the receiver is overloaded? CSE 461 University of Washington 71 Streaming videoBig Iron Wee Mobile Arg … Sliding Window – Receiver •Consider receiver with W buffers • LAS=LAST ACK SENT, app pulls in-order data from buffer with recv() call CSE 461 University of Washington 72 Sliding Window 5 6 7 5 2 3 . . LAS W=5 Finished 3Too high seq. number 555 5Acceptable Sliding Window – Receiver (2) • Suppose the next two segments arrive but app does not call recv() CSE 461 University of Washington 73 5 6 7 5 2 3 . . LAS W=5 Finished 3Too high seq. number 555 5Acceptable Sliding Window – Receiver (3) • Suppose the next two segments arrive but app does not call recv() • LAS rises, but we can’t slide window! CSE 461 University of Washington 74 5 6 7 5 2 3 . . LAS W=5 Finished 3Too high seq. number 555 5Acked Sliding Window – Receiver (4) • Further segments arrive (in order) we fill buffer • Must drop segments until app recvs! CSE 461 University of Washington 75 Nothing Acceptable! 5 6 7 5 2 3 . . W=5 Finished 3Too high seq. number 555 5Acked LAS Sliding Window – Receiver (5) •App recv() takes two segments • Window slides (phew) CSE 461 University of Washington 76 Acceptable 5 6 7 5 2 3 . . W=5 Finished 3 seq. number 555 5Acked LAS Flow Control •Avoid loss at receiver by telling sender the available buffer space • WIN=#Acceptable, not W (from LAS) CSE 461 University of Washington 77 Acceptable 5 6 7 5 2 3 . . W=5 Finished 3 seq. number 555 5Acked LAS Flow Control (2) • Sender uses lower of the sliding window and flow control window (WIN) as the effective window size CSE 461 University of Washington 78 Acceptable 5 6 7 5 2 3 . . LAS W=3 Finished 3Too high seq. number 555 5Acked CSE 461 University of Washington 79 Flow Control (3) • TCP-style example • SEQ/ACK sliding window • Flow control with WIN • SEQ + length < ACK+WIN • 4KB buffer at receiver • Circular buffer of bytes Topic •How to set the timeout for sending a retransmission • Adapting to the network path CSE 461 University of Washington 80 Lost? Network Retransmissions •With sliding window, detecting loss with timeout • Set timer when a segment is sent • Cancel timer when ack is received • If timer fires, retransmit data as lost CSE 461 University of Washington 81 Retransmit! Timeout Problem • Timeout should be “just right” • Too long wastes network capacity • Too short leads to spurious resends • But what is “just right”? • Easy to set on a LAN (Link) • Short, fixed, predictable RTT •Hard on the Internet (Transport) • Wide range, variable RTT CSE 461 University of Washington 82 Example of RTTs CSE 461 University of Washington 83 0 100 200 300 400 500 600 700 800 900 1000 0 20 40 60 80 100 120 140 160 180 200 R o u n d T ri p T im e (m s) BCN→SEA→BCN Seconds Example of RTTs (2) CSE 461 University of Washington 84 0 100 200 300 400 500 600 700 800 900 1000 0 20 40 60 80 100 120 140 160 180 200 R o u n d T ri p T im e (m s) Variation due to queuing at routers, changes in network paths, etc. BCN→SEA→BCN Propagation (+transmission) delay ≈ 2D Second s Example of RTTs (3) CSE 461 University of Washington 85 0 100 200 300 400 500 600 700 800 900 1000 0 20 40 60 80 100 120 140 160 180 200 R o u n d T ri p T im e (m s) Timer too high! Timer too low! Need to adapt to the network conditions Seconds Adaptive Timeout • Smoothed estimates of the RTT (1) and variance in RTT (2) • Update estimates with a moving average 1. SRTTN+1 = 0.9*SRTTN + 0.1*RTTN+1 2. SvarN+1 = 0.9*SvarN + 0.1*|RTTN+1– SRTTN+1| • Set timeout to a multiple of estimates • To estimate the upper RTT in practice • TCP TimeoutN = SRTTN + 4*SvarN CSE 461 University of Washington 86 Example of Adaptive Timeout CSE 461 University of Washington 87 0 100 200 300 400 500 600 700 800 900 1000 0 20 40 60 80 100 120 140 160 180 200 R TT ( m s) SRTT Svar Seconds Example of Adaptive Timeout (2) CSE 461 University of Washington 88 0 100 200 300 400 500 600 700 800 900 1000 0 20 40 60 80 100 120 140 160 180 200 R TT ( m s) Timeout (SRTT + 4*Svar) Early timeout Seconds Adaptive Timeout (2) • Simple to compute, does a good job of tracking actual RTT • Little “headroom” to lower • Yet very few early timeouts • Turns out to be important for good performance and robustness CSE 461 University of Washington 89 Congestion TCP to date: •We can set up a connection (connection establishment) • Tear down a connection (connection release) •Keep the sending and receiving buffers from overflowing (flow control) What’s missing? Network Congestion •A “traffic jam” in the network • Later we will learn how to control it CSE 461 University of Washington 92 What’s the hold up? Networ k Congestion Collapse in the 1980s • Early TCP used fixed size window (e.g., 8 packets) • Initially fine for reliability •But something happened as the ARPANET grew • Links stayed busy but transfer rates fell by orders of magnitude! CSE 461 University of Washington 93 Nature of Congestion •Routers/switches have internal buffering CSE 461 University of Washington 94 . . . . . . . . . . . . Input Buffer Output Buffer Fabric Input Output Nature of Congestion (2) • Simplified view of per port output queues • Typically FIFO (First In First Out), discard when full CSE 461 University of Washington 95 Router = (FIFO) Queue Queued Packets Route r Nature of Congestion (3) •Queues help by absorbing bursts when input > output rate •But if input > output rate persistently, queue will overflow • This is congestion •Congestion is a function of the traffic patterns – can occur even if every link has the same capacity CSE 461 University of Washington 96 Effects of Congestion •What happens to performance as we increase load? Effects of Congestion (2) •What happens to performance as we increase load? Effects of Congestion (3) •As offered load rises, congestion occurs as queues begin to fill: • Delay and loss rise sharply with more load • Throughput falls below load (due to loss) • Goodput may fall below throughput (due to spurious retransmissions) •None of the above is good! • Want network performance just before congestion CSE 461 University of Washington 99 Van Jacobson (1950—) •Widely credited with saving the Internet from congestion collapse in the late 80s • Introduced congestion control principles • Practical solutions (TCP Tahoe/Reno) •Much other pioneering work: • Tools like traceroute, tcpdump, pathchar • IP header compression, multicast tools CSE 461 University of Washington 100 TCP Tahoe/Reno • TCP extensions and features we will study: • AIMD • Fair Queuing • Slow-start • Fast Retransmission • Fast Recovery CSE 461 University of Washington 101 TCP Timeline CSE 461 University of Washington 102 19901970 19801975 1985 Origins of “TCP” (Cerf & Kahn, ’74) 3-way handshake (Tomlinson, ‘75) TCP Reno (Jacobson, ‘90) Congestion collapse Observed, ‘86 TCP/IP “flag day” (BSD Unix 4.2, ‘83) TCP Tahoe (Jacobson, ’88) Pre-history Congestion control . . . TCP and IP (RFC 791/793, ‘81) TCP Timeline (2) CSE 461 University of Washington 103 201020001995 2005 ECN (Floyd, ‘94) TCP Reno (Jacobson, ‘90) TCP New Reno (Hoe, ‘95) TCP BIC (Linux, ‘04 TCP with SACK (Floyd, ‘96) DiversificationClassic congestion control . . . 1990 TCP LEDBAT (IETF ’08) TCP Vegas (Brakmo, ‘93) TCP CUBIC (Linux, ’06) . . . BackgroundRouter support Delay based FAST TCP (Low et al., ’04) Compound TCP (Windows, ’07) Bandwidth Allocation • Important task for network is to allocate its capacity to senders • Good allocation is both efficient and fair • Efficient means most capacity is used but there is no congestion • Fair means every sender gets a reasonable share the network CSE 461 University of Washington 104 Bandwidth Allocation (2) •Key observation: • In an effective solution, Transport and Network layers must work together •Network layer witnesses congestion • Only it can provide direct feedback • Transport layer causes congestion • Only it can reduce offered load CSE 461 University of Washington 105 Bandwidth Allocation (3) •Why is it hard? (Just split equally!) • Number of senders and their offered load changes • Senders may lack capacity in different parts of network • Network is distributed; no single party has an overall picture of its state CSE 461 University of Washington 106 Bandwidth Allocation (4) • Solution context: • Senders adapt concurrently based on their own view of the network • Design this adaption so the network usage as a whole is efficient and fair • Adaption is continuous since offered loads continue to change over time CSE 461 University of Washington 107 Fair Allocations Fair Allocation • What’s a “fair” bandwidth allocation? • The max-min fair allocation CSE 461 University of Washington 109 Recall •We want a good bandwidth allocation to be both fair and efficient • Now we learn what fair means •Caveat: in practice, efficiency is more important than fairness CSE 461 University of Washington 110 Efficiency vs. Fairness •Cannot always have both! • Example network with traffic: • A→B, B→C and A→ C • How much traffic can we carry? CSE 461 University of Washington 111 A B C 1 1 Efficiency vs. Fairness (2) • If we care about fairness: • Give equal bandwidth to each flow • A→B: ½ unit, B→C: ½, and A→C, ½ • Total traffic carried is 1 ½ units CSE 461 University of Washington 112 A B C 1 1 Efficiency vs. Fairness (3) • If we care about efficiency: • Maximize total traffic in network • A→B: 1 unit, B→C: 1, and A→C, 0 • Total traffic rises to 2 units! CSE 461 University of Washington 113 A B C 1 1 The Slippery Notion of Fairness •Why is “equal per flow” fair anyway? • A→C uses more network resources than A→B or B→C • Host A sends two flows, B sends one •Not productive to seek exact fairness • More important to avoid starvation • A node that cannot use any bandwidth • “Equal per flow” is good enough CSE 461 University of Washington 114 Generalizing “Equal per Flow” •Bottleneck for a flow of traffic is the link that limits its bandwidth • Where congestion occurs for the flow • For A→C, link A–B is the bottleneck CSE 461 University of Washington 115 A B C 1 10 Bottleneck Generalizing “Equal per Flow” (2) • Flows may have different bottlenecks • For A→C, link A–B is the bottleneck • For B→C, link B–C is the bottleneck • Can no longer divide links equally … CSE 461 University of Washington 116 A B C 1 10 Max-Min Fairness • Intuitively, flows bottlenecked on a link get an equal share of that link •Max-min fair allocation is one that: • Increasing the rate of one flow will decrease the rate of a smaller flow • This “maximizes the minimum” flow CSE 461 University of Washington 117 Max-Min Fairness (2) • To find it given a network, imagine “pouring water into the network” 1. Start with all flows at rate 0 2. Increase the flows until there is a new bottleneck in the network 3. Hold fixed the rate of the flows that are bottlenecked 4. Go to step 2 for any remaining flows CSE 461 University of Washington 118 Max-Min Example • Example: network with 4 flows, link bandwidth = 1 • What is the max-min fair allocation? CSE 461 University of Washington 119 Max-Min Example (2) •When rate=1/3, flows B, C, and D bottleneck R4— R5 • Fix B, C, and D, continue to increase A CSE 461 University of Washington 120 Bottlenec k Bottlenec k Max-Min Example (3) •When rate=2/3, flow A bottlenecks R2—R3. Done. CSE 461 University of Washington 121 Bottlenec k Bottlenec k Max-Min Example (4) • End with A=2/3, B, C, D=1/3, and R2/R3, R4/R5 full • Other links have extra capacity that can’t be used • CSE 461 University of Washington 122 Adapting over Time •Allocation changes as flows start and stop CSE 461 University of Washington 123 Time Adapting over Time (2) CSE 461 University of Washington 124 Flow 1 slows when Flow 2 starts Flow 1 speeds up when Flow 2 stops Time Flow 3 limit is elsewhere Bandwidth Allocation Recall •Want to allocate capacity to senders • Network layer provides feedback • Transport layer adjusts offered load • A good allocation is efficient and fair •How should we perform the allocation? • Several different possibilities … CSE 461 University of Washington 126 Bandwidth Allocation Models •Open loop versus closed loop • Open: reserve bandwidth before use • Closed: use feedback to adjust rates •Host versus Network support • Who is sets/enforces allocations? •Window versus Rate based • How is allocation expressed? CSE 461 University of Washington 127 TCP is a closed loop, host-driven, and window-based Bandwidth Allocation Models (2) •We’ll look at closed-loop, host-driven, and window- based too •Network layer returns feedback on current allocation to senders • For TCP signal is “a packet dropped” • Transport layer adjusts sender’s behavior via window in response • How senders adapt is a control law CSE 461 University of Washington 128 Additive Increase Multiplicative Decrease •AIMD is a control law hosts can use to reach a good allocation • Hosts additively increase rate while network not congested • Hosts multiplicatively decrease rate when congested • Used by TCP • Let’s explore the AIMD game … CSE 461 University of Washington 129 AIMD Game •Hosts 1 and 2 share a bottleneck • But do not talk to each other directly •Router provides binary feedback • Tells hosts if network is congested CSE 461 University of Washington 130 Rest of Network Bottleneck Router Host 1 Host 2 1 1 1 AIMD Game (2) • Each point is a possible allocation CSE 461 University of Washington 131 Host 1 Host 2 0 1 1 Fair Efficient Optimal Allocation Congested AIMD Game (3) •AI and MD move the allocation CSE 461 University of Washington 132 Host 1 Host 2 0 1 1 Fair, y=x Efficient, x+y=1 Optimal Allocation Congested Multiplicative Decrease Additive Increase AIMD Game (4) •Play the game! CSE 461 University of Washington 133 Host 1 Host 2 0 1 1 Fair Efficient Congested A starting point AIMD Game (5) •Always converge to good allocation! CSE 461 University of Washington 134 Host 1 Host 2 0 1 1 Fair Efficient Congested A starting point AIMD Sawtooth •Produces a “sawtooth” pattern over time for rate of each host • This is the TCP sawtooth (later) CSE 461 University of Washington 135 Multiplicative Decrease Additive Increase Time Host 1 or 2’s Rate AIMD Properties •Converges to an allocation that is efficient and fair when hosts run it • Holds for more general topologies •Other increase/decrease control laws do not! (Try MIAD, MIMD, MIAD) •Requires only binary feedback from the network CSE 461 University of Washington 136 Feedback Signals • Several possible signals, with different pros/cons • We’ll look at classic TCP that uses packet loss as a signal CSE 461 University of Washington 137 Signal Example Protocol Pros / Cons Packet loss TCP NewReno Cubic TCP (Linux) Hard to get wrong Hear about congestion late Packet delay TCP BBR (Youtube) Hear about congestion early Need to infer congestion Router indication TCPs with Explicit Congestion Notification Hear about congestion early Require router support Slow Start (TCP Additive Increase) Practical AIMD •We want TCP to follow an AIMD control law for a good allocation • Sender uses a congestion window or cwnd to set its rate (≈cwnd/RTT) • Sender uses loss as network congestion signal •Need TCP to work across a very large range of rates and RTTs CSE 461 University of Washington 139 TCP Startup Problem •We want to quickly near the right rate, cwndIDEAL, but it varies greatly • Fixed sliding window doesn’t adapt and is rough on the network (loss!) • Additive Increase with small bursts adapts cwnd gently to the network, but might take a long time to become efficient CSE 461 University of Washington 140 Slow-Start Solution • Start by doubling cwnd every RTT • Exponential growth (1, 2, 4, 8, 16, …) • Start slow, quickly reach large values 141 AI Fixed Time W in d o w ( cw n d ) Slow-start Slow-Start Solution (2) • Eventually packet loss will occur when the network is congested • Loss timeout tells us cwnd is too large • Next time, switch to AI beforehand • Slowly adapt cwnd near right value • In terms of cwnd: • Expect loss for cwndC ≈ 2BD+queue • Use ssthresh = cwndC/2 to switch to AI CSE 461 University of Washington 142 Slow-Start Solution (3) •Combined behavior, after first time • Most time spend near right value 143 AI Fixed Time Window ssthresh cwndC cwndIDEAL AI phase Slow-start Slow-Start (Doubling) Timeline CSE 461 University of Washington 144 Increment cwnd by 1 packet for each ACK Additive Increase Timeline CSE 461 University of Washington 145 Increment cwnd by 1 packet every cwnd ACKs (or 1 RTT) TCP Tahoe (Implementation) • Initial slow-start (doubling) phase • Start with cwnd = 1 (or small value) • cwnd += 1 packet per ACK •Later Additive Increase phase • cwnd += 1/cwnd packets per ACK • Roughly adds 1 packet per RTT •Switching threshold (initially infinity) • Switch to AI when cwnd > ssthresh • Set ssthresh = cwnd/2 after loss • Begin with slow-start after timeout CSE 461 University of Washington 146 Timeout Misfortunes •Why do a slow-start after timeout? • Instead of MD cwnd (for AIMD) • Timeouts are sufficiently long that the ACK clock will have run down • Slow-start ramps up the ACK clock •We need to detect loss before a timeout to get to full AIMD CSE 461 University of Washington 147 Fast Recovery (TCP Multiplicative Decrease) Practical AIMD (2) •We want TCP to follow an AIMD control law for a good allocation • Sender uses a congestion window or cwnd to set its rate (≈cwnd/RTT) • Sender uses slow-start to ramp up the ACK clock, followed by Additive Increase •But after a timeout, sender slow-starts again with cwnd=1 (as it no ACK clock) CSE 461 University of Washington 149 Inferring Loss from ACKs • TCP uses a cumulative ACK • Carries highest in-order seq. number • Normally a steady advance •Duplicate ACKs give us hints about what data hasn’t arrived • Tell us some new data did arrive, but it was not next segment • Thus the next segment may be lost CSE 461 University of Washington 150 Fast Retransmit • Treat three duplicate ACKs as a loss • Retransmit next expected segment • Some repetition allows for reordering, but still detects loss quickly CSE 461 University of Washington 151 Ack 1 2 3 4 5 5 5 5 5 5 Fast Retransmit (2) CSE 461 University of Washington 152 Ack 10 Ack 11 Ack 12 Ack 13 . . . Ack 13 Ack 13 Ack 13 Data 14 . . . Ack 13 Ack 20 . . . . . . Data 20Third duplicate ACK, so send 14 Retransmission fills in the hole at 14 ACK jumps after loss is repaired . . . . . . Data 14 was lost earlier, but got 15 to 20 Fast Retransmit (3) • It can repair single segment loss quickly, typically before a timeout •However, we have quiet time at the sender/receiver while waiting for the ACK to jump •And we still need to MD cwnd … CSE 461 University of Washington 153 Inferring Non-Loss from ACKs •Duplicate ACKs also give us hints about what data has arrived • Each new duplicate ACK means that some new segment has arrived • It will be the segments after the loss • Thus advancing the sliding window will not increase the number of segments stored in the network CSE 461 University of Washington 154 Fast Recovery • First fast retransmit, and MD cwnd • Then pretend further duplicate ACKs are the expected ACKs • Lets new segments be sent for ACKs • Reconcile views when the ACK jumps CSE 461 University of Washington 155 Ack 1 2 3 4 5 5 5 5 5 5 Fast Recovery (2) CSE 461 University of Washington 156 Ack 12 Ack 13 Ack 13 Ack 13 Ack 13 Data 14 Ack 13 Ack 20 . . . . . . Data 20 Third duplicate ACK, so send 14 Data 14 was lost earlier, but got 15 to 20 Retransmission fills in the hole at 14 Set ssthresh, cwnd = cwnd/2 Data 21Data 22 More ACKs advance window; may send segments before jump Ack 13 Exit Fast Recovery Fast Recovery (3) •With fast retransmit, it repairs a single segment loss quickly and keeps the ACK clock running • This allows us to realize AIMD • No timeouts or slow-start after loss, just continue with a smaller cwnd • TCP Reno combines slow-start, fast retransmit and fast recovery • Multiplicative Decrease is ½ CSE 461 University of Washington 157 TCP Reno CSE 461 University of Washington 158 MD of ½ , no slow-start ACK clock running TCP sawtooth TCP Reno, NewReno, and SACK •Reno can repair one loss per RTT • Multiple losses cause a timeout •NewReno further refines ACK heuristics • Repairs multiple losses without timeout • Selective ACK (SACK) is a better idea • Receiver sends ACK ranges so sender can retransmit without guesswork CSE 461 University of Washington 159 TCP CUBIC • Standard TCP Stack in Linux (> 2.6.19) and Windows (> 10.1709) • Internet grows to have more long-distance, high bandwidth connections • Seeks to resolve two key problems with “standard” TCP: ● Flows with lower RTT’s “grow” faster than those with higher RTTs ● Flows grow too “slowly” (linearly) after congestion CSE 461 University of Washington 160 TCP CUBIC 1) At the time of experiencing congestion event the window size for that instant will be recorded as Wmax or the maximum window size. 2) The Wmax value will be set as the inflection point of the cubic function that will govern the growth of the congestion window. 3) The transmission will then be restarted with a smaller window value (20%) and, if no congestion is experienced, this value will increase according to the concave portion of the cubic function (not depending on received ACKs for cadence). 4) As the window approaches Wmax the increments will slow down. 5) Once the tipping point has been reached, i.e. Wmax, the value of the window will continue to increase discreetly. 6) Finally, if the network is still not experiencing any congestion, the window size will continue to increase according to the convex portion of the function. CSE 461 University of Washington 161 TCP CUBIC CSE 461 University of Washington 162 TCP CUBIC vs Everyone CSE 461 University of Washington 163 TCP BBR • Bottleneck Bandwidth and Round-trip propagation time • Developed at Google in 2016 primarily for YouTube traffic • Attempting to solve “bufflerbloat” problem • “Model-based” (Vegas) rather than “Loss-based” (CUBIC) ● Measure RTT, latency, bottleneck bandwidth ● Use this to predict window size CSE 461 University of Washington 164 Bufferbloat • Larger queues are better than smaller queues right? CSE 461 University of Washington 165 Bufferbloat • Given TCP loss semantics… • Performance can decrease buffer size is increased • Consider a full buffer: ● New packets arrive and are dropped (‘tail drop’) ● SACK doesn’t arrive until entire buffer sent CSE 461 University of Washington 166 TCP BBR • BBR Has 4 Distinct Phases 1) Startup: Basically identical to Cubic. Expontentially grow until RTTs start to increase (instead of dropped packet). Set cwnd. 2) Drain: Startup filled a queue. Temporarily reduce sending rate (known as “pacing gain”) 3) Probe Bandwidth: Increase sending rate to see if there’s more capacity. If not, drain again. 4) Probe RTT: Reduce rate dramatically (4 packets) to measure RTT. Use this as our baseline for above. CSE 461 University of Washington 167 TCP BBR vs Everyone CSE 461 University of Washington 168 Network-Side Congestion Control Congestion Avoidance vs. Control •Classic TCP drives the network into congestion and then recovers • Needs to see loss to slow down •Would be better to use the network but avoid congestion altogether! • Reduces loss and delay •But how can we do this? CSE 461 University of Washington 170 Feedback Signals •Delay and router signals can let us avoid congestion CSE 461 University of Washington 171 Signal Example Protocol Pros / Cons Packet loss Classic TCP Cubic TCP (Linux) Hard to get wrong Hear about congestion late Packet delay TCP BBR (Youtube) Hear about congestion early Need to infer congestion Router indication TCPs with Explicit Congestion Notification Hear about congestion early Require router support ECN (Explicit Congestion Notification) •Router detects the onset of congestion via its queue • When congested, it marks affected packets (IP header) CSE 461 University of Washington 172 ECN (2) •Marked packets arrive at receiver; treated as loss • TCP receiver reliably informs TCP sender of the congestion CSE 461 University of Washington 173 ECN (3) •Advantages: • Routers deliver clear signal to hosts • Congestion is detected early, no loss • No extra packets need to be sent •Disadvantages: • Routers and hosts must be upgraded (currently 1%) • More work at router CSE 461 University of Washington 174 Random Early Detection (RED) • Jacobson (again!) and Floyd •Alternative idea: instead of marking packets, drop • We know they’re using TCP, make use of that fact • Signals congestion to sender • But without adding headers or doing packet inspection •Drop at random, depending on queue size • If queue empty, accept packet always • If queue full, always drop • As queue approaches full, increase likelihood of packet drop • Example: 1 queue slot left, 10 packets expected, 90% chance of drop RED (Random Early Detection) •Router detects the onset of congestion via its queue • Prior to congestion, drop a packet to signal CSE 461 University of Washington 176 Drop packet RED (Random Early Detection) • Sender enters MD, slows packet flow • We shed load, everyone is happy CSE 461 University of Washington 177 Drop packet