==== Useful RDMA Links ==== https://www.rdmamojo.com/links/ https://www.pdl.cmu.edu/PDL-FTP/Storage/rdma_bench_atc.pdf https://arxiv.org/pdf/1905.12143.pdf ==== Lecture Notes ==== Takeaways - Pushing hardware to limits means understanding precisely and thinking carefully about what hardware does AND what hardware can do - if hardware doesn't do what you need it to do, then modify the hardware - e.g. this is how we get RDMA-enabled NICs. RDMA origin: let's make networking faster, cut out the CPU Then we get this: RDMA has this nice permissioning property, this lets us get a really fast consensus protocol. Was not really what people were thinking about 30 years ago Hard to tell which is "first", seems like network/CPU interaction was an active topic at the time (as people figure out ideas). Old paper "U-net", worth reading for historical interest Ideas outlining something very much like RDMA Focus applications - high performance computing, massively parallel computation. Their point seems to be "hey, instead of really fancy hardware, we actually can do even better with off the shelf commodity hardware" https://blogs.nvidia.com/blog/2020/04/29/what-is-rdma/ ^claims that an early RDMA system cost 5 million and hit #3 fastest computer (in 2003), compared to 350 million cost fastest. (marketing copy, so many caveats to that). Our story here is essentially the opposite! We have this fancy hardware, how can we get the best performance out of it. Requires understanding precisely what the hardware can do. What problem are they trying to solve? - replicated state machine problem - Want microsecond-scale replication - Example apps - Robot control systems - Trading systems - KV stores Why microsecond? - why are we thinking about "microseconds" and not "nanoseconds"? What's possible in a microsecond? - about 3000 CPU cycles @ 3 Ghz - Computation has to be very limited - Mu handles only the log replication, application processing is downstream. Application of course needs to handle reqs in O(microseconds) - 1 memory access ~ a few hundred cycles Persistence? - Something not possible - disk writes - Modern SSD random write is tens of microseconds, at best - persistent memory? - Intel actually started making persistent memory chips sometime in 2019/2020 Not available on consumer market, cost thousands of dollars (it seems?) - Random latency is maybe 2-3x worse than regular DRAM - obv benefits from CPU cache, but - to our earlier point: requires new instructions to force cache writes to persist - think something like fsync but for CPU cache - another example of takeaway - source: https://arxiv.org/pdf/1903.05714.pdf Network accesses? - What happens in an RDMA request? - CPU -> NIC - NIC processing - NIC -> other NIC - other NIC processing - other NIC <-> other memory - other NIC -> original NIC - original NIC -> CPU NIC <-> CPU - I couldn't find numbers, but presumably pretty fast NIC processing - Implement (not TCP but) reliable message ordering abstraction obviously can't have OS implement TCP Algorithms in this paper would be much more complicated without a reliable connection NIC <-> DRAM - These NICs have direct memory access - About as fast as CPU <-> DRAM, again couldn't find numbers NIC <-> other NIC - message goes through fibre cable - speed of light: 3.0* 10^8 m/s - microsecond: 10^-6 s - in a microsecond, light can travel 300m. - round trip, in a fibre cable: ~100m Why are we not talking about nanosecond (100s of nanoseconds) consensus? "replicas" would have to be all in one physical device, basically Current memory hardware: one access is 100ns from local memory TL;DR O(microsecond) consensus is basically at limit of what current electrical systems can do, and basically limit of replication of separated physical machines - But maybe for some applications, 100ns is possible? -e.g. replication of an airplane control system might need something crazy like a memory stick with its own network card Before we get into system, talk about RDMA spec - System's design carefully uses the precise details of RDMA spec again, thinking about what hardware does, exactly ==== RDMA ==== - Saw this last time What's the interface? Queue Pairs Inside NIC are "work queue"s A QP is something like a connection When you create a queue pair, you specify two work queues an outgoing "send" queue, incoming "receive" queue (possibly the same work queue) We're not going explicitly use the "receive" queue, - mainly exists as a way for CPU to tell NIC to notify it when a request comes in What actions does CPU put in these queues? Read / Write also supports atomic memory operations also "send"/"receive", where receiver CPU has to provide buffers for NIC to copy data into Reliable/Unreliable Reliable uses acknowledgements to ensure messages arrive in order (Think TCP/UDP, but not exactly. There's no initial TCP handshake, timeouts) Note: Queue entries, retrys/acks, buffers, all have to be implemented physically on the NIC Connected/unconnected ("datagram") QPs can be linked 1-1 across NICs, or not Putting it together, you get RC, UC, RD, UD in the RDMA original spec (2003ish), but nobody implements RD. We'll use RC, and as we'll see it's really important that we get all results back and in the right ordering. Memory Regions How do we stop accidental interference? How to keep processes isolated from each other? Prevent accidental write conflicts? Have to register a memory region with the NIC to make it accessible. Register(address region) Give QP an access right to an MR. It'll be important - we can change the access rights on QPs. This means we can disable/enable them at will. A message that comes in on a disabled QP will have an error response sent back by the NIC, not the CPU. This'll be very important for the leader election, and getting election done quickly. ==== Microsecond consensus ==== Architecture (maybe draw diagram?) Replication Plane and Background Plane Replication plane - one QP per each other replica - MR is the log buffer Background plane - One QP per each other replica - MR is metadata for leader election What's in the log at each replica? - Log buffer - "minimum proposal number" - minimum number needed on a proposal for it to be not in conflict with a previously proposed number. - First uncommitted offset (in log). Easy/common case: There exists a single leader, and everyone knows who the leader is Leader generates proposals, adds them to log sequentially with increasing proposal numbers Leader sends out proposal, waits for a majority of responses before sending next Doesn't invoke CPU on followers! So, how do replicas know when a proposal commits? - so replicas proposal N acts as a commit message for proposal N-1. Cost of replication? One RDMA write (parallel across replicas). Doesn't invoke CPU on followers. Note: Not an rpc, can't just say "add this to your log" - To make this happen: leader makes a local copy of follower's log offset addresses What happens if there's two nodes who think they're leaders? Done naively, this could get messy really fast two uncoordinated leaders writing to the same addresses will conflict with each other Each one's maintaining a local copy of a log offset (memory address), so they're going to overwrite each other's entries, overwrite min proposal numbers Solution: replica locally decide on who leader is (without coordination) and configure the NIC to only accept writes from that replica (disable permission on all other QPs). Observation: by offloading just a little bit of computation to the NIC, we keep the CPU out of the loop AND avoid a whole lot of protocol complexity, resource usage executing some kind of de-confliction protocol As things are, we just need the very small amount of filtering in the NIC, and to just wait for new log entries to appear in memory (spin on memory reads) Another aspect: this permission bit automatically signals to leader that follower think it is leader. Signals very quickly to leader when it might not be the leader any longer (i.e. something has gone wrong) and may need to stop proposing or re-establish itself as leader. Leader Election One big source of delay in replication protocols are timeouts - Timeout on leader lease - Timeout on message delivery Is leader not responsive? or is message just delayed? - Don't want too many elections, as this slows down common case. We want fail-over to happen as quickly as possible. They don't get microsecond failover, ~800 microseconds Biggest sources of latency are (1) fail detection (600us) and (2) reconfiguring NIC (200us). Their solution? Heartbeat Every replica continually writes locally to an increasing counter. Replicas continually check these counters on other replicas. On check, if a value differs from previous value, "liveness score++", otherwise, "score--". "live" replicas have high liveness scores Capped between 0-15 so as to only take recent information into account Experimentally chose <2 => fail, >6 => valid, (dampen oscillation). When you need to pick a new leader, pick valid replica with lowest ID. Why this over leases? (make sure to ask class) lease - wait for entire timeout before you can start electing new leader. heartbeat - if network delay is stable, then time to fail is current network delay (plus loop spin time), not a worst-case maximum bound. We don't have to actually compute network delay, or estimate it. Not necessarily "fundamentally different" from a lease, question is how its implemented. There's still some kind of notion of "timeout" built in Not waiting for a single "timeout"/binary event We don't have to do constantly request leases (saves network traffic but more importantly saves CPU interaction). Could constantly just write to followers to maintain lease, but read traffic is generally cheaper than writing to memory How do they do the leader election? Background plane - to request leader, write "1" in your slow in an array, which follower continually looks for. If it thinks you're acceptable as a leader, you're granted write permission on log (and previous leader revoked). If you notice leader down, choose who should be next. If next is you, initiate change. Recovery & Leader Change Have to first bring all replicas (majority, at least) up to date with all committed values Look at all the replicas that have acked us as leader, take the highest proposed value on them, replicate the longest log to all replicas. Can't necessarily tell if highest proposed value committed or not, but that's ok (implicitly commit it). If some replica later acks us as leader (i.e. one that was offline), bring it up to date (copy my log to it, essentially). ==== Evaluation ==== What do you think? (ask class, have discussion) - Normal case latency ~ 1u for Mu - Depends on payload size Failure detection? 800us to switch, or so -- much faster than prior, still main bottleneck is leader failure detection. Very predictable latency vs throughput graph Good that it's predictable, it's predictable in a way that's good for an application too Figure 7 - why don't they show a batch size 1? TBh I wish they did, but maybe we can interpolate from latency? They claim random OS events cause leader descheduling for 10s of micros occasionally. Seems like OS scheduler should be aware of process latency requirements (this is an active area of research). Makes cutting down timeout hard. Wish they explored more the tradeoffs around setting timeouts/liveness thresholds for leader election. What happens if this is run on a microsecond-app-aware scheduler? Can we cut the thresholds down and get a faster leader election? ==== Other implementation things: ==== Permission management Several ways we could disable/enable QPs remap memory region disable QP change access flags What do they find? (ask class) access flags fastest, but sometimes errors out remaps runtime scales linearly with size? Why? size of virtual address table? How do we prevent read shear? canary byte Canary is at END of message. Why? Their NICs give a left-to-right write guarantee Note: This guarantee is NOT in the RDMA spec! Again value of understanding precisely what the hardware does (avoids computing a checksum). Log recycling Replicas maintain a "min apply" counter, logging what they've sent to the application. Leader Improving memory access patterns (ask class) Inline payloads vs. payloads read from memory small payloads fit within message to NIC, saves one memory read from NIC. Do you share threads or separate threads between application and Mu? different threads -- transferring payload from app to Mu is one cache miss (adds overall ~400ns) However, contend for CPU time -- bad if application has expensive computation to do Why not have applications directly send messages to NIC? (i.e. with a small amount of embedded Mu logic in the application, instead of talking to a Mu thread) want a fixed number (usually 1) concurrent request out, so this needs some synchronization protocol between application threads Need interaction with leadership detection module Not an obvious "impossibility" result, but a lot more work, and also not clear that it'll be an actual savings (given the synchronization overhead, lots of atomic memory ops that could mess with cache). pin threads to cores in the same NUMA node as the NIC Complain that OS could migrate application threads leads to weird/bad cache effects Why not pin application threads too? You could, but this can lead to really invasive application modifications and can take a lot of engineering to figure out which threads to pin where not every app is amenable to having a fixed thread placement allocation. Linked it in notes (also in the citations) The authors have a separate, earlier paper where they describe a theoretical design for an RDMA-based replidation protocol. Interesting to compare the theoretical abstraction to how they actually implemented the protocol - No discussion of heartbeats, failover time or efficiency - Very different focus of paper - "just unregister the MR, reregister later" -- whereas these authors find that's very inefficient