Bolt (2023)
===========

Why do we care about congestion control in the data center in 2023?
  Widespread 1986-style congestive collapse of the internet not likely
  But any time you lose packets, get abysmal performance
    E.g., drive TCP into backoff, get 200ms to over 1 seconds of delay
  "Incast" can be a problem with many-to-one communication
  More importantly, buffering hurts latency-sensitive applications
  Can we avoid buffering by overprovisioning the network?
    Most applications care about 99 or 99.9th percentile latency
      So limited utility in reducing median latency through overprovisioning
    Also sounds like applications need to juice 100% utilization out of network

Why are existing CC schemes decreasingly effective?
  Networks are getting faster, speed of light is not...
  So the bandwidth-delay product (BDP) is getting huge
  Traditionally 2 inputs to a CC scheme: packet loss and queuing delay
    We've ruled out packet loss, leaving only queuing delay.  Problems:
    1. We don't want queuing delay, which leaves us no signaling mechanism!
    2. Need at least 1 RTT to deliver signal, which is many bytes at line rate
       Entire message transmission could be less than 1 RTT
       E.g., 10 usec * 400Gbps = 500 KB

p. 219: "Figure 1 also reveals that a 400Gbps link with just 40% load
sees an RPC arrival or completion roughly every RTT!" - How to calculate this?
  400Gb/s BDP looks like about 250KB
    250KB/(50GB/s) = 5 usec one-way delay => 10 usec RTT
    Footnote 2 later talks about 8 usec RTT, so that's about right
  Mean RPC size is area above the CDF (between blue line and y=1.0)
    How to calculate?  Ask ChatGPT - says 300KB [show printout]
  40% of 400Gb/s is 160Gb/s = 20GB/s
  300KB / 20GB/s = So RPC made every 15usec on average
  So start or completion every 7.5usec, roughly checks out to RTT

Look at Figure 6.  What's happening with Swift?
  Swift (2020) = modern CC depending only on queuing delay, no packet loss
  Imposes queuing delay at all times as signal to senders
  Huge delay spike on flow arrival
    Takes at least 1 RTT for sender to notice increased delay
    Additional delay because sender only adjusts rate once per RTT
      (to avoid reacting multiple times to same signal)
  But at least network is fully utilized
HPCC (high-precision CC) gives us a better tool for CC
  Leverage In-Network Telemetry (INT) [c.f. XCP, RCP]
  Stop messing with delay, just stamp packet with actual link load info
    Receiver echoes to sender, sender knows exactly how fast to send
Why is HPCC not ideal in Figure 6?
  Sender can't react until it hears feedback echoed by sender
  So big queue spike on flow arrival, underutilization on flow completion
  Why is there also underutilization after flow arrival?
    HPCC actually overreacts and induces oscillation
    Could adjust only once per RTT like Swift, but then slow convergence

Let's go over 3 big ideas in Bolt, on p. 222

Idea 1 (SRC):  Bolt changes network hardware, what does this buy us?
  Switches can send feedback directly to senders, no need for receiver to echo
  New type of Sub-RTT Control (SRC) packets, sent from switches to senders
  SRC packets used only to decrease sending rate, not increase--why?
    Any switch can be congested and want sender to reduce cwnd
    But every switch on path should have capacity before increasing

Idea 2 (PRU):  Changes the sender to pre-notify of flow completion
  Figure 3 shows how Proactive Rampup (PRU) works
  How does the transport layer predict when the application will be done?
    Just look at boundaries of the write/send system calls
    Probably 1 RPC request or reply = 1 write system call
    Note TCP already conveys write boundaries via the Push flag
      Hint to receiver OS scheduler to schedule receiving process
  Unlike Push, Bolt signals earlier 1 cwnd before end of flow
    Leaves time for receiver to echo in Ack before sender increases cwmd

Idea 3 (SM):  Avoid underutilization when PRU might fail
  When can PRU lose tokens?
    Granted tokens to sender that had downstream bottleneck
    Granted tokens to sender host that fails or is rerouted
  Look at algorithm 3 (p. 225)
    Adjust sm_token based on size and time between packets vs. line rate
  Almost like a kind of negative queue measurement
    Queue size is 0 no matter how underutilized, while sm_token give signal

See listing 1: New 9-byte header in all Bolt packets (data and src)
  q_size - 24-bit size of queue
  link_rate - 8-bit rate of bottleneck link (presumably table index)
  data/ack/src - type of packet (src given priority over data)
  last/first - sender hint this is first/last cwd of flow
  inc/dec - in ack or src, tells sender to increase/decrease cwnd
  t_data_tx - 32-bit sender timestamp (echoed in src/ack), reduces sender state

Let's walk through what happens at the switch (Algorithm 1)
  Sender presumably sets inc bit in all packets (unless doesn't need b/w)
  Congestion if queue size > CC_thresh (which is just 1 MTU!). Response:
    Send SRC packet to slow sender
    Also clear inc bit, mark dec bit in packet before forwarding--why?
      Don't want to generate too many SRC packets; limit to one per data packet
      dec suppresses SRC downstream--so at most one SRC packet per data packet
  Are we in the last cwnd of the flow?  Then increase pru_token
    This will allow other flows to claim the PRU token an ramp up
  Otherwise:
    If pru_token is available, decrement pru_token and leave inc = 1
    If sm_token, then decrement that and leave inc = 1
    Otherwise, clear inc
  Why are PRU/SM tokens handled differently?
      pru_token: counts packets, can grow large
      sm_token: counts bytes, capped at 1 MTU (packet size)
    pru_token only relevant when first != last, so multi-cwnd-sized flow
      At that point most packets are probably MTU-sized anyway
      Ending flow will add 1 cwnd (many packets) of capacity to link
    sm_token needs finer granularity
      Supply might exceed demand by 1 packet every 10 packets
    sm_token more speculative so be conservative
      Can go very negative under congested, but capped at 1 MUT

Let's walk through sender logic (Algorithm 2):
What happens when SRC packets signal congestion?
  t_tx_data lets sender estimate RTT of SRC packets statelessly
  How much should sender reduce cwnd?
    Want all senders together to reduce by pkg_src.queue_size
    Assume aggregate send rate approximately link_rate
      Sender's portion of reduction is reaction_factor = send_rate/link_rate
    What if aggregate rate is higher?  Still okay as more need to reduce
    What if aggregate rate is lower?  Then probably won't get SRC packet
    Remember we want to drive queue size essentially to zero (1 MTU)
      So sender must reduce cwnd by target_q = reaction_factor * queue_size
  How fast should sender reduce cwnd?
    Switch keeps sending SRC packet until overload resolved
    So sender won't see effect of reduction for 1 RTT_src
    Could reduce by target_q and wait 1 RTT_src
    But no harm doing it gradually, once packet per RTT_src/target_q
      If congestion resolves faster than 1 RTT_src, will have reduced less
      Otherwise will get one SRC per packet sent, so also okay
How are ACKs handled?
  If the inc bit made it through to sender and was echoed, increment cwnd
  Otherwise, increment once per round trip (exactly like TCP additive increase)
    Why?  Bolt could leak tokens or not be supported on all switches
    So make sure we never ramp up more slowly than TCP

So how well does Bolt do?  Look at Fig. 10 simulation
  How does Bolt do better than RTT ideal?
    With ideal, new flow still dumps 1 BDP into queue
    With Bolt, SRC packet slows sender before 1 RTT, so only sends Bw * RTT_src
  Table 1 shows effectiveness of PRU and SM after flow completion
    As expected, PRU much more important for scenario it targets
  So do we really need SM?
    What if flow re-routed.  See Figure 12

How fair is Bolt?  See Figure 15
  Add or remove sender every 10 msec
  All we can really conclude is fairness +/-20 Gbps
  But this is a problem with the graph, not the experiment
    Should maybe have averaged noise over longer time period
    Want to see what's happening under those noisy rectangles

Let's look at Figure 14
  Footnote 4: FCT slowdown is flow’s actual FCT normalized by its
    ideal FCT when the flow sends at line-rate...
  We are looking at FCT for the worst flows (99th percentile)
  Why does Swift mostly trend down above 0.3 BDP
    Queuing delay get ssmaller relative to transmission time
  Why does HPCC go nuts at 0.8 BDP?
    Wouldn't expect queuing delay to be worse than Swift
    Maybe packet loss since HPCC expects to queue less?

Should we believe these simulations?
  At least Figure 18 shows hardware matches NS3 in one scenario

How would you deploy Bolt incrementally in a datacenter?
  Senders could run Swift on top of Bolt, target min cwnd of both
    So works with either kind of switch, better with Bolt switches
  Somehow make Bold coexist with other protocols like TCP CUBIC
    Maybe use QoS bits to isolate Bolt vs. other traffic
  Somehow deal with NICs the like to batch transmission

How would bold work with QoS segregated traffic?  (Appendix B)
  Queue occupancy could be for sender's specific queue
    [What about rate?  Is 8 bits enough to cover rate now]
  But sm_token needs to be weighted by queue weight
    Queues reading each other stats hard to do in hardware
  Alternative: use probabilistic SRC packets