eRPC (2019) =========== What is the premise of this paper? Why is RMDA maybe not the right solution for many problems? Requires re-architecting systems (e.g., FaRM instead of normal DB) May require many RTTs if you chase pointers and/or need RMW Probably not an accident FaRM needed function shipping Requires lossless networking to work well Specialized infiniband fabric may not be available everywhere Ethernet has PFC (priority flow control) what's this? Pause frames apply backpressure and avoid overflowing receiver bufs Can get head of line blocking, even deadlock from cycles Microsoft did deploy at scale, but other cloud providers reluctant Limited number of connections (queue pairs) can be cached on NIC What assumptions does the paper make that others prior systems might not? User-level networking access Primarily Mellanox libibverbs, but also works with DPDK Support for many and larger receive queues But via intelligent, not giant NIC caches Multi-packet RQ descriptors Introduced for multi-packet messages, want to reuse for multi-message Dynamic switch buffers can handle many-way incast with no packet loss Lightly-loaded datacenter networks (10%) on average Assumes small BDP: 25Gbps edge link * 6 usec = 19 KiB Hardware seems to require 2usec at NIC + 300ns/switch The authors wrote 6700 lines of C++--what API does it provide? General-purpose RPC lib for off-the shelf software (Raft, Masstree) Provides at-most-once RPC semantics, survives packet loss (but expensive) No serialization, security, service discovery, etc. Might require hint from server about which RPCs will be fast Better if longer-running (e.g., SCAN) queries moved to worker threads How does a 1-packet RPC work, no congestion or packet loss Client puts message in msgbuf, hands it off to eRPC, zero copy eRPC assigns request to next slot in session Puts msgbuf in UD TX queue, does *not* request TX completion event Uses unconnected endpoint that can talk to multiple servers Why not request completion event? Requires NIC to write to memory over PCIe, expensive (maybe occasionally request use completion to truncate TX queue) Server polls RX completion queue, discovers message Will this RPC be fast to process? Yes: run RPC directly in the dispatch thread No: copy message from RX ring buffer to msgbuf hand msgbuf to worker thread Either way: dispatch with pre-allocated reply msgbuf Place reply msgbuf in TX queue (zero copy), don't request completion event (maybe occasionally request completion to truncate TX queue) Client receives reply, frees request msgbuf Copy message from RX ring buffer to msgbuf, hand back to application On next request, client will tell server to advance request window Allows server to reclaim reply buffer for slot How does eRPC prevent packet loss? Conjunction of 3 CC mechanisms Never send more than 1 BDP With 12MB switch buffers, can suervive 12MB / 19KB BDP =~ 646-way incast Usually worry about 50-way incast Use session credits dispatched by server Helps pace when many large messages would fill too many BDPs Use Timely (rate-based flow control). Generic timely: RTT under 50usec - additive increase RTT over 500usec - multiplicative decrease between the two - increase/decrease based on rate of change of RTT How do they actually pace the packets? Carousel SIGCOMM'17 For each packet enqueued, set earliest allowed send time Quantize send times into buckets in a timing wheel Use one wheel/core to reduce cache coherence overhead Much more efficient than keeping one queue per session! What changes for a multi-message RPC, still no packet loss? Client uses scatter-gather to add headers to contiguous message (Figure 2) Send also limited by *session credits* returned by server (Figure 3) Server now always copies from RX ring buffer into msgbuf Message and headers interspersed in ring buffer, must make contiguous Server only sends one reply packet per request from client Client directly paces replies with "request-for-response" messages Client copies reply into msgbuf, making payloads contiguous What happens if a reply gets lost? Server keeps session state, including old reply buffers Recognizes duplicate request, sends duplicate response What happens if a request gets lost? Client eventually times out, retransmits the request Can you reclaim the buffer when you get the reply as usual? No Problem: maybe original request wasn't lost, retransmission still in queue What can we do? Drain the entire send queue Queue drain more expensive than TX completion events But packet loss happens much less often TX completion, good trade-off What happens if retransmitted packet is in the rate limiter (Carousel)? Deleting from carousel turns out to be complicated (roll back timing state) Just drop the reply and pretend it was lost How expensive is it to implement Timely/Carousel for congestion control? Paper claims 9% in section 5.2, is that right? More like 20%, but when you bypass it it's 9% Or would need to see experiments from a congested network (Maybe CPU doesn't matter if network becomes bottleneck) Lets go over all the optimizations in Table 3 Batched RTT timestamps Amortize 8 nanosecond RDTSC over batch of received packets (Actually matters at 800 nanoseconds/packet!) Timely bypass (common case of no congestion) If RTT < 50 usec, no congestion, can send at line rate Don't waste the CPU time tracking allowed rate with additive increase Rate limiter bypass (common case of no congestion) Just hand the packets directly to NIC instead of Carousel Multi-packet RQ entries makes RQ really small (cache at premium on NIC) Intended for individual large messages, but use for multiple messages Preallocated responses (straight-forward) 0-copy request processing Because single-packet requests are contiguous in memory What eval questions should we ask? Is eRPC really as fast as RDMA over lossless networks? Is this too good be true? What are impacts of corner cases/deployment issues? Is prototype complete enough to run real applications? What are the e2e benchmarks? Raft (already discussed) Why are we comparing to NetChain and ZabFPGA--what are these? Netchain - chain replication over programmable tofino switches What's chain replication? Simple way to replicate data Feed data to head, gets passed down chain, ack from tail If tail acks, you know all nodes have the data - much simpler than Raft ZabFPGA - zookeepers broken atomic broadcast in hardware Why compare to these? Masstree (nice, very optimized parallel key-value store) See section 7.2 - Compared to Cell Multi-packet RX reduces RX queue size, not RX CQ. Is CQ an issue (Appendix A) Only care how many packets have been received So could just have single entry RX CQ slot written over and over Turns out to cause contention between PCIe and cores, so use 8 entries Discussion: does eRPC show that FaRM is stupid? RPC isn't distributed transactions, so maybe apples-to-oranges comparison Bing used the A1 graph database, which is built on FaRM Probably lots of production requirements not addressed by eRPC 6700 SLOC Editorial: People make systems too general and complex Fewer (good) engineers can make right trade-offs, build superior systems Microsoft did hire Anuj (now at OpenAI), so obviously found eRPC compelling Discussion: Is RDMA stupid? Meta uses it for AI training, and doesn't even need PFC (SIGCOMM'24) Also seems useful for storage (Alibaba NSDI'21)