The Network Stack (1) Lecture 5, Part 2: Network Stack Implementation Prof. Robert N. M. Watson 2021-2022 Memory flow in hardware • Key idea: follow the memory • Historically, memory copying is avoided due to instruction count • Today, memory copying is avoided due to cache footprint • Recent Intel CPUs push and pull DMA via the LLC (“DDIO”) • If we differentiate ‘send’ and ‘transmit’, ‘receive’ vs. ‘deliver’, is this a good idea? • … it depends on the latency between DMA and processing PCI Last-Level Cache (LLC) DRAM DDIO NIC Ethernet CPU CPU L1 Cache L2 Cache L1 Cache32K,3-4 cycles 256K, 8-12 cycles 25M 32-40 cycles DRAM up to 256-290 cycles 2 Memory flow in software • Socket API implies one software-driven copy to/from user memory • Historically, zero-copy VM tricks for socket API ineffective • Network buffers cycle through the slab allocator • Receive: allocate in NIC driver, free in socket layer • Transmit: allocate in socket layer, free in NIC driver • DMA performs second copy; can affect cache/memory bandwidth • NB: what if packet-buffer working set is larger than the cache? Kernel Socket/protocol deliver User process NIC NIC receive network memory allocator DMA receive Socket/protocol send NIC transmit DMA transmit copyout() copyin() send( )recv( ) free() alloc() alloc() free() 3 The mbuf abstraction • Unit of work allocation and distribution throughout the stack • mbuf chains represent in-flight packets, streams, etc. • Operations: alloc, free, prepend, append, truncate, enqueue, dequeue • Internal or external data buffer (e.g., VM page) • Reflects bi-modal packet-size distribution (e.g., TCP ACKs vs data) • Similar structures in other OSes – e.g., skbuff in Linux mbuf mbuf mbuf data struct mbuf mbuf header packet header data external storage pad m _d at a m _le n current data VM/ buffer- cache page m _le n m _d at a current data mbuf packet queue m bu f c ha in socket buffer network interface queue TCP reassembly queue netisr work queue 4 Send/receive paths in the network stack (1/2) NIC send() System call layer Socket layer TCP layer Application Link layer Device driver send() sosend() sbappend() tcp_send() tcp_output() ether_output() em_start() em_entr() recv() recv() soreceive() sbappend() tcp_reass() tcp_input() ether_input() em_intr() IP layer ip_output()ip_input() 5 Im plem entation (and som etim es protocol) layers Output pathInput path Deliver Send TransmitReceive Send/receive paths in the network stack (2/2) NIC send() System call layer Socket layer TCP layer Application Link layer Device driver send() sosend() sbappend() tcp_send() tcp_output() ether_output() em_start() em_entr() recv() recv() soreceive() sbappend() tcp_reass() tcp_input() ether_input() em_intr() IP layer ip_output()ip_input() 6 Link-layer protocol dispatch Queuing and async dispatch TCP reassembly, socket delivery Enqueued mbufs in global data structure In-flight mbuf is thread local Forwarding path in the network stack ip_forward() NIC Link layer Device driver ether_output() em_start() em_entr() ether_input() em_intr() IP layer ip_output()ip_input() 7 In pu t p ro ce ss in g (d ec ap su la tio n, e tc .) For non-local delivery, routing lookup Output processing (encapsulation, etc.) Work dispatch: input path • Deferred dispatch: ithread → netisr thread → user thread • Direct dispatch: ithread → user thread • Pros: reduced latency, better cache locality, drop early on overload • Cons: reduced parallelism and work placement opportunities Hardware UserspaceKernel ithread user thread ithread netisr software ithread user thread netisr dispatch Direct dispatch ApplicationLinker layer + driver IP TCP + Socket Socket Data stream to application Reassemble segments, deliver to socket Interpret and strips link layer header Kernel copies out mbufs + clusters Receive, validate checksum Device Look up socket Validate checksum, strip IP header Validate checksum, strip TCP header 8 Work dispatch: output path • Fewer deferred dispatch opportunities implemented • (Deferred dispatch on device-driver handoff in new iflib KPIs) • Gradual shift of work from software to hardware • Checksum calculation, segmentation, … Userspace Kernel Hardware user thread ithread 2k, 4k, 9k, 16k 256 MSS MSS Data stream from application TCP segmentation, header encapsulation, checksum IP header encapsul- ation, checksum Kernel copies in data to mbufs + clusters Application Socket TCP IP Link layer + driver Checksum + transmit Ethernet frame encapsulation, insert in descriptor ring Device 9 Work dispatch: TOE input path • Kernel provides socket buffers and resource allocation • Remainder, including state, retransmissions, etc., in NIC • But: two network stacks? Less flexible/updateable structure? • Better with an explicit HW/SW architecture – e.g., Microsoft Chimney Hardware Kernel ithread user thread Userspace Driver + socket layer Strip IP header Interpret and strips link layer header Kernel copies out mbufs + clusters Receive, validate ethernet, IP, TCP checksums Reassemble segments Application Data stream to application Look up and deliver to socket Strip TCP header Move majority of TCP processing through socket deliver to hardware Device 10