eRPC (2019)
===========

What is the premise of this paper?

Why is RMDA maybe not the right solution for many problems?
  Requires re-architecting systems (e.g., FaRM instead of normal DB)
    May require many RTTs if you chase pointers and/or need RMW
    Probably not an accident FaRM needed function shipping
  Requires lossless networking to work well
    Specialized infiniband fabric may not be available everywhere
    Ethernet has PFC (priority flow control) what's this?
      Pause frames apply backpressure and avoid overflowing receiver bufs
      Can get head of line blocking, even deadlock from cycles
      Microsoft did deploy at scale, but other cloud providers reluctant
  Limited number of connections (queue pairs) can be cached on NIC

What assumptions does the paper make that others prior systems might not?
  User-level networking access
    Primarily Mellanox libibverbs, but also works with DPDK
  Support for many and larger receive queues
    But via intelligent, not giant NIC caches
  Multi-packet RQ descriptors
    Introduced for multi-packet messages, want to reuse for multi-message
  Dynamic switch buffers can handle many-way incast with no packet loss
    Lightly-loaded datacenter networks (10%) on average
    Assumes small BDP:  25Gbps edge link * 6 usec = 19 KiB
      Hardware seems to require 2usec at NIC + 300ns/switch

The authors wrote 6700 lines of C++--what API does it provide?
  General-purpose RPC lib for off-the shelf software (Raft, Masstree)
  Provides at-most-once RPC semantics, survives packet loss (but expensive)
  No serialization, security, service discovery, etc.
  Might require hint from server about which RPCs will be fast
    Better if longer-running (e.g., SCAN) queries moved to worker threads

How does a 1-packet RPC work, no congestion or packet loss
  Client puts message in msgbuf, hands it off to eRPC, zero copy
    eRPC assigns request to next slot in session
  Puts msgbuf in UD TX queue, does *not* request TX completion event
    Uses unconnected endpoint that can talk to multiple servers
    Why not request completion event?
      Requires NIC to write to memory over PCIe, expensive
      (maybe occasionally request use completion to truncate TX queue)
  Server polls RX completion queue, discovers message
  Will this RPC be fast to process?
    Yes: run RPC directly in the dispatch thread
    No: copy message from RX ring buffer to msgbuf
        hand msgbuf to worker thread
    Either way: dispatch with pre-allocated reply msgbuf
  Place reply msgbuf in TX queue (zero copy), don't request completion event
    (maybe occasionally request completion to truncate TX queue)
  Client receives reply, frees request msgbuf
  Copy message from RX ring buffer to msgbuf, hand back to application
  On next request, client will tell server to advance request window
    Allows server to reclaim reply buffer for slot

How does eRPC prevent packet loss?  Conjunction of 3 CC mechanisms
  Never send more than 1 BDP
    With 12MB switch buffers, can suervive 12MB / 19KB BDP =~ 646-way incast
    Usually worry about 50-way incast
  Use session credits dispatched by server
    Helps pace when many large messages would fill too many BDPs
  Use Timely (rate-based flow control).  Generic timely:
    RTT under 50usec - additive increase
    RTT over 500usec - multiplicative decrease
    between the two - increase/decrease based on rate of change of RTT

How do they actually pace the packets?  Carousel SIGCOMM'17
  For each packet enqueued, set earliest allowed send time
  Quantize send times into buckets in a timing wheel
    Use one wheel/core to reduce cache coherence overhead
  Much more efficient than keeping one queue per session!

What changes for a multi-message RPC, still no packet loss?
  Client uses scatter-gather to add headers to contiguous message (Figure 2)
  Send also limited by *session credits* returned by server (Figure 3)
  Server now always copies from RX ring buffer into msgbuf
    Message and headers interspersed in ring buffer, must make contiguous
  Server only sends one reply packet per request from client
    Client directly paces replies with "request-for-response" messages
  Client copies reply into msgbuf, making payloads contiguous

What happens if a reply gets lost?
  Server keeps session state, including old reply buffers
  Recognizes duplicate request, sends duplicate response

What happens if a request gets lost?
  Client eventually times out, retransmits the request
  Can you reclaim the buffer when you get the reply as usual?  No
    Problem: maybe original request wasn't lost, retransmission still in queue
  What can we do?  Drain the entire send queue
    Queue drain more expensive than TX completion events
    But packet loss happens much less often TX completion, good trade-off
  What happens if retransmitted packet is in the rate limiter (Carousel)?
    Deleting from carousel turns out to be complicated (roll back timing state)
    Just drop the reply and pretend it was lost

How expensive is it to implement Timely/Carousel for congestion control?
  Paper claims 9% in section 5.2, is that right?
  More like 20%, but when you bypass it it's 9%
  Or would need to see experiments from a congested network
    (Maybe CPU doesn't matter if network becomes bottleneck)

Lets go over all the optimizations in Table 3
  Batched RTT timestamps
    Amortize 8 nanosecond RDTSC over batch of received packets
    (Actually matters at 800 nanoseconds/packet!)
  Timely bypass (common case of no congestion)
    If RTT < 50 usec, no congestion, can send at line rate
    Don't waste the CPU time tracking allowed rate with additive increase
  Rate limiter bypass (common case of no congestion)
    Just hand the packets directly to NIC instead of Carousel
  Multi-packet RQ entries makes RQ really small (cache at premium on NIC)
    Intended for individual large messages, but use for multiple messages
  Preallocated responses (straight-forward)
  0-copy request processing
    Because single-packet requests are contiguous in memory

What eval questions should we ask?
  Is eRPC really as fast as RDMA over lossless networks?
  Is this too good be true?
    What are impacts of corner cases/deployment issues?
  Is prototype complete enough to run real applications?

What are the e2e benchmarks?
  Raft (already discussed)
  Why are we comparing to NetChain and ZabFPGA--what are these?
    Netchain - chain replication over programmable tofino switches
      What's chain replication?  Simple way to replicate data
        Feed data to head, gets passed down chain, ack from tail
        If tail acks, you know all nodes have the data - much simpler than Raft
    ZabFPGA - zookeepers broken atomic broadcast in hardware
  Why compare to these?
  Masstree (nice, very optimized parallel key-value store)
    See section 7.2 - Compared to Cell

Multi-packet RX reduces RX queue size, not RX CQ. Is CQ an issue (Appendix  A)
  Only care how many packets have been received
  So could just have single entry RX CQ slot written over and over
    Turns out to cause contention between PCIe and cores, so use 8 entries

Discussion: does eRPC show that FaRM is stupid?
  RPC isn't distributed transactions, so maybe apples-to-oranges comparison
  Bing used the A1 graph database, which is built on FaRM
  Probably lots of production requirements not addressed by eRPC 6700 SLOC
    Editorial: People make systems too general and complex
      Fewer (good) engineers can make right trade-offs, build superior systems
  Microsoft did hire Anuj (now at OpenAI), so obviously found eRPC compelling

Discussion: Is RDMA stupid?
  Meta uses it for AI training, and doesn't even need PFC (SIGCOMM'24)
  Also seems useful for storage (Alibaba NSDI'21)