Evolving TCP Using FreeBSD Code, Tools, Research & Results Lawrence Stewart lastewart@swin.edu.au Centre for Advanced Internet Architectures (CAIA) Swinburne University of Technology Outline 1 Recap 2 FreeBSD As A RnD Platform 3 Some Research Results 4 Wrapping Up FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 2 Detailed outline (section 1 of 4) 1 Recap 2 FreeBSD As A RnD Platform 3 Some Research Results 4 Wrapping Up 1 Recap Where are we today Open issues FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 3 Where are we today Many incremental (partially implemented) improvements State of the CC union NewReno is defacto standard with warts (LFN, wireless) Many new proposals BSD still uses NewReno Linux uses CUBIC Windows Vista uses Compound TCP/IP stack enhancements e.g. CSO/TSO/LRO/TOE Various locking/caching tricks Socket buffer autotuning FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 4 Open issues High-speed CC algorithms 1 FAST, HS-TCP, H-TCP, CTCP, CUBIC, etc. Delay based CC algorithms How do we compare and evaluate TCPs? Multipath CSO/TSO/LRO/TOE obscure behaviours Testing/verification of TCP/IP stack behaviour 1Nice summary: http://kb.pert.geant2.net/PERTKB/TcpHighSpeedVariants FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 5 Detailed outline (section 2 of 4) 1 Recap 2 FreeBSD As A RnD Platform 3 Some Research Results 4 Wrapping Up 2 FreeBSD As A RnD Platform At a Glance Modular Congestion Control SIFTR ALQ DPD TCP Reassembly Queue FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 6 At a Glance Modular congestion control In svn project branch, coming to FreeBSD 7 and 8 soon BSD licenced Newreno, HTCP & CUBIC implementations available Sponsord by Cisco Systems Statistical Information for TCP Research (SIFTR) FreeBSD kld to gather CSV in-kernel TCP endpoint connection data Similar concept to Web100 with more variables Sponsored by Cisco Systems and the FreeBSD Foundation Deterministic Packet Discard (DPD) Adds ’pls’ (packet loss set) option for dummynet pipes e.g. ipfw pipe 1 config pls 1,5-10,30 would drop packets 1, 5-10 inclusive and 30 Dummynet Forensic logging support Log queue state on each packet event FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 7 Modular Congestion Control NEWS Project moved into public svn repository: projects/tcp_cc_8.x Completed CUBIC implementation (unlikely to be more from me) Significant locking improvements Maintaining both 7.x and 8.x patches TODO for 8.x (roughly in order) Commit ABI breaking parts Finish ECN/ABC/VIMAGE integration Complete documentation Commit to 8.x with experimental status i.e. no ABI guarantees ISSUES Simple framework may be needed for CC-related algorithm-agnostic tasks Should we consider moving more variables into a CC struct? FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 8 Modular Congestion Control Defined instruct cc_algo { /* specify one per CC algorithm */ char name[TCP_CA_NAME_MAX]; int (*mod_init) (struct tcpcb *tp); int (*mod_destroy) (struct tcpcb *tp); int (*cb_init) (struct tcpcb *tp); void (*cb_destroy) (struct tcpcb *tp); void (*conn_init) (struct tcpcb *tp); void (*ack_received) (struct tcpcb *tp, struct tcphdr *th); void (*pre_fr) (struct tcpcb *tp, struct tcphdr *th); void (*post_fr) (struct tcpcb *tp, struct tcphdr *th); void (*after_idle) (struct tcpcb *tp); void (*after_timeout) (struct tcpcb *tp); STAILQ_ENTRY(cc_algo) entries; }; FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 9 Modular Congestion Control Housekeeping /* called during TCP/IP stack initialisation on boot */ void cc_init(void); /* dynamically registers a new CC algorithm */ int cc_register_algo(struct cc_algo *); /* dynamically deregisters a CC algorithm */ int cc_deregister_algo(struct cc_algo *); /* macro that hides housekeeping code from modules */ DECLARE_CC_MODULE(ccname, ccalgo); FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 10 Modular Congestion Control Minor ABI-breaking additions to struct tcpcb struct tcpcb { .... /* CC function pointers to use for this connection */ struct cc_algo *cc_algo; /* connection specific CC algorithm data */ void *cc_data; }; FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 11 KPI/API/Configuration New net.inet.tcp.cc sysctl tree with variables: available: comma-separated list of available CC algorithms algorithm: current system default CC algorithm Removed net.inet.tcp.newreno sysctl variable New socket option TCP_CONGESTION defined in tcp.h Override system default CC algorithm using setsockopt(2) Same as Linux define e.g. Iperf -Z option works FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 12 SIFTR Statistical Information For TCP Research FreeBSD [6,7,8] kernel module BSD licenced source 2 Similar base concept to Web100 Event triggered (not poll based) Currently logs 25 different variables to file as CSV data 3 Plan to integrate into base system for 8.x Work on v1.2.x sponsored by the FreeBSD Foundation 2Available from: http://caia.swin.edu.au/urp/newtcp/tools.html 3See README in SIFTR distribution for specific details FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 13 SIFTR Socket API ip_input() ip_output() tcp_input() tcp_output() L2 In L2 Out User Space Kernel Space Application TCP Control Block src_port: 80 dst_port: 54677 cwnd: 4380 rtt: 100 ... TCP Control Block src_port: 80 dst_port: 54677 cwnd: 4380 rtt: 100 ... TCP Control Block src_port: 80 dst_port: 54677 cwnd: 4380 rtt: 100 ... TCP Control Block src_port: 80 dst_port: 54677 cwnd: 4380 rtt: 100 ... query/update SIFTR IPv4/6 in IPv4/6 out TCP In TCP Out L2 In L2 Out FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 14 SIFTR Packet src_ip: 1.1.1.1 src_port: 1 dst_ip: 2.2.2.2 dst_port: 2 ... TCP Control Block src_port: 1 dst_port: 2 cwnd: 4380 rtt: 100 ... lookup pkt_node copy stats enqueue pkt_node dequeue all pkt_nodes counter == 0? generate & write log message counter = (counter % ppl) get flow’s counter del pkt_node true false pkt_manager thread network thread(s) Packet enters Packet exits possible lock contention Legend counter++ TCP Packet? false true more pkt_nodes to process? yes no FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 15 Asynchronous Logging Queues (ALQ) Jeff Roberson’s KPI for in-kernel file logging Made it build as a LKM Extended KPI to allow variable length message support Under-the-hood reworked to use a circular buffer Useful fallout from SIFTR work Would like to add high water mark triggered flushing Will commit to 8.x, also backportable FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 16 Asynchronous Logging Queues (ALQ) /* unchanged. count=0 now means size arg specifies buffer size */ int alq_open(struct alq **, const char *file, struct ucred *cred, int cmode, int size, int count); /* legacy fixed length write, wraps alq_writen() */ int alq_write(struct alq *alq, void *data, int flags); /* new variable length write */ int alq_writen(struct alq *alq, void *data, int len, int flags); /* legacy fixed length ale, wraps alq_getn()*/ struct ale *alq_get(struct alq *alq, int flags); /* new variable length ale */ struct ale *alq_getn(struct alq *alq, int len, int flags); FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 17 Deterministic Packet Discard (DPD) Patch against FreeBSD 8.x IPFW/Dummynet BSD licenced source 4 Useful for protocol (not just TCP!) verification and testing Adds ’pls’ (packet loss set) option for dummynet pipes e.g. ipfw pipe 1 config pls 1,5-10,30 would drop packets 1, 5-10 inclusive and 30 Need to catch up with Luigi’s work Low priority, but hope to commit to 7.x and 8.x soon 4Available from http://caia.swin.edu.au/urp/newtcp/tools.html FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 18 TCP Reassembly Queue TCP reassembly queue tuning is inherently connection specific Current method is wasteful and can severely damage TCP performance Aim to do away with net.inet.tcp.reass.maxqlen Adapt reassembly queue based on connection dynamics Somewhat akin to socket buffer auto tuning Currently WIP (building on Andre’s work) Sponsored by the FreeBSD Foundation FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 19 Detailed outline (section 3 of 4) 1 Recap 2 FreeBSD As A RnD Platform 3 Some Research Results 4 Wrapping Up 3 Some Research Results Testbed Connection Dynamics Collateral Damage Subtle Queuing Implications FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 20 Testbed Linux/FreeBSD hosts Modular congestion control Web100/SIFTR for Linux/FreeBSD testing Iperf/Tcpreplay for traffic generation FreeBSD dummynet router Endace DAG 3.7GF capture card Host A Host B Router Host C Host D Endace DAG 3.7GF drop-tail queue drop-tail queue RTT/2 delay RTT/2 delay FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 21 Connection Dynamics 1 TCP flow, H-TCP, 100ms RTT, 1Mbps, 60000 byte queue 30 35 40 45 50 55 60 qu eu e oc cu pa nc y (K by tes ) 60 62 64 66 68 70 72 25 30 35 40 45 50 55 time (secs) cw n d (pk ts) flow 1 cwnd queue occupancy FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 22 Appropriate Byte Counting (ABC) 0 10 20 30 40 50 60 0 50 10 0 15 0 20 0 25 0 time (secs) cw n d (pk ts) 100ms RTT, 10Mbps, 62500 byte queue noabc abc FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 23 Collateral Damage Induced delay: 1 TCP vs 1 CBR UDP flow, 50ms RTT, 1Mbps, 60000 byte queue 0 100 300 5000 .0 0. 4 0. 8 delay (ms) CD F newreno htcp cubic FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 24 Collateral Damage Induced delay: 1 TCP vs 1 CBR UDP flow, 50ms RTT, 1.5Mbps/256Kbps, 20000 byte queue 0 20 40 60 80 100 12 25 50 100 O ne w ay q ue ue in g de la y (m s) One way fixed propogation delay , RTT/2, (ms) CUBIC (ns-2) NewReno (ns-2) CUBIC (testbed) NewReno (testbed) FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 25 Collateral Damage Retransmissions: n TCP vs 1 CBR UDP flow, 50ms RTT, 1Mbps, 60000 byte queue 1 2 3 4 5 0 50 0 15 00 # flows a vg re tra ns m its newreno htcp cubic FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 26 Subtle Queuing Implications Induced CBR loss: 1 TCP vs 1 CBR UDP flow, 100ms RTT, 1.5Mbps/256Kbps, NS 0 2 4 6 8 10 12 14 10 20 30 40 50 60 70 80 90 100 CB R % d ro pp ed p ac ke ts Q size 103B FB-loose CUBIC FB-loose NewReno PS CUBIC PS NewReno FB-strict CUBIC FB-strict NewReno FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 27 Detailed outline (section 4 of 4) 1 Recap 2 FreeBSD As A RnD Platform 3 Some Research Results 4 Wrapping Up 4 Wrapping Up Ideas for Future Work Further Information Acknowledgements Questions FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 28 Ideas for Future Work TCP specific: Improve RTT estimator Share CC between TCP/SCTP Rework the host cache Comprehensive RFC compliance check Fix slow-start, FR/FR TCP/IP stack in general: Framework for dealing with CSO/TSO/LRO/TOE DTRACEesque instrumentation Testing framework <- next project I want to tackle FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 29 Further Information Papers Lawrence Stewart, Grenville Armitage, Alana Huebner, “Collateral Damage: The Impact of Optimised TCP Variants On Real-time Traffic Latency in Consumer Broadband Environments”, IFIP/TC6 NETWORKING 2009, Aachen, Germany, 11-15 May 2009. Grenville Armitage, Lawrence Stewart, Michael Welzl, James Healy, “An independent H-TCP implementation under FreeBSD 7.0 - description and observed behaviour ”, ACM SIGCOMM Computer Communication Review, vol. 38 no. 3 pp. 29-38, July 2008. Links http://caia.swin.edu.au/urp/newtcp/ http://caia.swin.edu.au/freebsd/etcp09/ http://people.freebsd.org/~lstewart/ http://lists.freebsd.org/pipermail/freebsd-net/ FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 30 Acknowledgements Cisco Systems The FreeBSD Foundation FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 31 The End tp->t_state = TCPS_QUESTIONS FastSoft http://www.caia.swin.edu.au lastewart@swin.edu.au 32