Bolt (2023) =========== Why do we care about congestion control in the data center in 2023? Widespread 1986-style congestive collapse of the internet not likely But any time you lose packets, get abysmal performance E.g., drive TCP into backoff, get 200ms to over 1 seconds of delay "Incast" can be a problem with many-to-one communication More importantly, buffering hurts latency-sensitive applications Can we avoid buffering by overprovisioning the network? Most applications care about 99 or 99.9th percentile latency So limited utility in reducing median latency through overprovisioning Also sounds like applications need to juice 100% utilization out of network Why are existing CC schemes decreasingly effective? Networks are getting faster, speed of light is not... So the bandwidth-delay product (BDP) is getting huge Traditionally 2 inputs to a CC scheme: packet loss and queuing delay We've ruled out packet loss, leaving only queuing delay. Problems: 1. We don't want queuing delay, which leaves us no signaling mechanism! 2. Need at least 1 RTT to deliver signal, which is many bytes at line rate Entire message transmission could be less than 1 RTT E.g., 10 usec * 400Gbps = 500 KB p. 219: "Figure 1 also reveals that a 400Gbps link with just 40% load sees an RPC arrival or completion roughly every RTT!" - How to calculate this? 400Gb/s BDP looks like about 250KB 250KB/(50GB/s) = 5 usec one-way delay => 10 usec RTT Footnote 2 later talks about 8 usec RTT, so that's about right Mean RPC size is area above the CDF (between blue line and y=1.0) How to calculate? Ask ChatGPT - says 300KB [show printout] 40% of 400Gb/s is 160Gb/s = 20GB/s 300KB / 20GB/s = So RPC made every 15usec on average So start or completion every 7.5usec, roughly checks out to RTT Look at Figure 6. What's happening with Swift? Swift (2020) = modern CC depending only on queuing delay, no packet loss Imposes queuing delay at all times as signal to senders Huge delay spike on flow arrival Takes at least 1 RTT for sender to notice increased delay Additional delay because sender only adjusts rate once per RTT (to avoid reacting multiple times to same signal) But at least network is fully utilized HPCC (high-precision CC) gives us a better tool for CC Leverage In-Network Telemetry (INT) [c.f. XCP, RCP] Stop messing with delay, just stamp packet with actual link load info Receiver echoes to sender, sender knows exactly how fast to send Why is HPCC not ideal in Figure 6? Sender can't react until it hears feedback echoed by sender So big queue spike on flow arrival, underutilization on flow completion Why is there also underutilization after flow arrival? HPCC actually overreacts and induces oscillation Could adjust only once per RTT like Swift, but then slow convergence Let's go over 3 big ideas in Bolt, on p. 222 Idea 1 (SRC): Bolt changes network hardware, what does this buy us? Switches can send feedback directly to senders, no need for receiver to echo New type of Sub-RTT Control (SRC) packets, sent from switches to senders SRC packets used only to decrease sending rate, not increase--why? Any switch can be congested and want sender to reduce cwnd But every switch on path should have capacity before increasing Idea 2 (PRU): Changes the sender to pre-notify of flow completion Figure 3 shows how Proactive Rampup (PRU) works How does the transport layer predict when the application will be done? Just look at boundaries of the write/send system calls Probably 1 RPC request or reply = 1 write system call Note TCP already conveys write boundaries via the Push flag Hint to receiver OS scheduler to schedule receiving process Unlike Push, Bolt signals earlier 1 cwnd before end of flow Leaves time for receiver to echo in Ack before sender increases cwmd Idea 3 (SM): Avoid underutilization when PRU might fail When can PRU lose tokens? Granted tokens to sender that had downstream bottleneck Granted tokens to sender host that fails or is rerouted Look at algorithm 3 (p. 225) Adjust sm_token based on size and time between packets vs. line rate Almost like a kind of negative queue measurement Queue size is 0 no matter how underutilized, while sm_token give signal See listing 1: New 9-byte header in all Bolt packets (data and src) q_size - 24-bit size of queue link_rate - 8-bit rate of bottleneck link (presumably table index) data/ack/src - type of packet (src given priority over data) last/first - sender hint this is first/last cwd of flow inc/dec - in ack or src, tells sender to increase/decrease cwnd t_data_tx - 32-bit sender timestamp (echoed in src/ack), reduces sender state Let's walk through what happens at the switch (Algorithm 1) Sender presumably sets inc bit in all packets (unless doesn't need b/w) Congestion if queue size > CC_thresh (which is just 1 MTU!). Response: Send SRC packet to slow sender Also clear inc bit, mark dec bit in packet before forwarding--why? Don't want to generate too many SRC packets; limit to one per data packet dec suppresses SRC downstream--so at most one SRC packet per data packet Are we in the last cwnd of the flow? Then increase pru_token This will allow other flows to claim the PRU token an ramp up Otherwise: If pru_token is available, decrement pru_token and leave inc = 1 If sm_token, then decrement that and leave inc = 1 Otherwise, clear inc Why are PRU/SM tokens handled differently? pru_token: counts packets, can grow large sm_token: counts bytes, capped at 1 MTU (packet size) pru_token only relevant when first != last, so multi-cwnd-sized flow At that point most packets are probably MTU-sized anyway Ending flow will add 1 cwnd (many packets) of capacity to link sm_token needs finer granularity Supply might exceed demand by 1 packet every 10 packets sm_token more speculative so be conservative Can go very negative under congested, but capped at 1 MUT Let's walk through sender logic (Algorithm 2): What happens when SRC packets signal congestion? t_tx_data lets sender estimate RTT of SRC packets statelessly How much should sender reduce cwnd? Want all senders together to reduce by pkg_src.queue_size Assume aggregate send rate approximately link_rate Sender's portion of reduction is reaction_factor = send_rate/link_rate What if aggregate rate is higher? Still okay as more need to reduce What if aggregate rate is lower? Then probably won't get SRC packet Remember we want to drive queue size essentially to zero (1 MTU) So sender must reduce cwnd by target_q = reaction_factor * queue_size How fast should sender reduce cwnd? Switch keeps sending SRC packet until overload resolved So sender won't see effect of reduction for 1 RTT_src Could reduce by target_q and wait 1 RTT_src But no harm doing it gradually, once packet per RTT_src/target_q If congestion resolves faster than 1 RTT_src, will have reduced less Otherwise will get one SRC per packet sent, so also okay How are ACKs handled? If the inc bit made it through to sender and was echoed, increment cwnd Otherwise, increment once per round trip (exactly like TCP additive increase) Why? Bolt could leak tokens or not be supported on all switches So make sure we never ramp up more slowly than TCP So how well does Bolt do? Look at Fig. 10 simulation How does Bolt do better than RTT ideal? With ideal, new flow still dumps 1 BDP into queue With Bolt, SRC packet slows sender before 1 RTT, so only sends Bw * RTT_src Table 1 shows effectiveness of PRU and SM after flow completion As expected, PRU much more important for scenario it targets So do we really need SM? What if flow re-routed. See Figure 12 How fair is Bolt? See Figure 15 Add or remove sender every 10 msec All we can really conclude is fairness +/-20 Gbps But this is a problem with the graph, not the experiment Should maybe have averaged noise over longer time period Want to see what's happening under those noisy rectangles Let's look at Figure 14 Footnote 4: FCT slowdown is flow’s actual FCT normalized by its ideal FCT when the flow sends at line-rate... We are looking at FCT for the worst flows (99th percentile) Why does Swift mostly trend down above 0.3 BDP Queuing delay get ssmaller relative to transmission time Why does HPCC go nuts at 0.8 BDP? Wouldn't expect queuing delay to be worse than Swift Maybe packet loss since HPCC expects to queue less? Should we believe these simulations? At least Figure 18 shows hardware matches NS3 in one scenario How would you deploy Bolt incrementally in a datacenter? Senders could run Swift on top of Bolt, target min cwnd of both So works with either kind of switch, better with Bolt switches Somehow make Bold coexist with other protocols like TCP CUBIC Maybe use QoS bits to isolate Bolt vs. other traffic Somehow deal with NICs the like to batch transmission How would bold work with QoS segregated traffic? (Appendix B) Queue occupancy could be for sender's specific queue [What about rate? Is 8 bits enough to cover rate now] But sm_token needs to be weighted by queue weight Queues reading each other stats hard to do in hardware Alternative: use probabilistic SRC packets