==== Useful RDMA Links ====

https://www.rdmamojo.com/links/

https://www.pdl.cmu.edu/PDL-FTP/Storage/rdma_bench_atc.pdf

https://arxiv.org/pdf/1905.12143.pdf

==== Lecture Notes ====


Takeaways
	- Pushing hardware to limits means understanding precisely and thinking carefully about 
	what hardware does AND what hardware can do
		- if hardware doesn't do what you need it to do, then modify the hardware
			- e.g. this is how we get RDMA-enabled NICs.
				RDMA origin: let's make networking faster, cut out the CPU
				Then we get this: RDMA has this nice permissioning property, this lets us get a really fast consensus protocol.
				Was not really what people were thinking about 30 years ago
					Hard to tell which is "first", seems like network/CPU interaction was an active topic at the time
					(as people figure out ideas).
					Old paper "U-net", worth reading for historical interest
						Ideas outlining something very much like RDMA 
							Focus applications - high performance computing, massively parallel computation.
								Their point seems to be "hey, instead of really fancy hardware, we actually can
								do even better with off the shelf commodity hardware"

								https://blogs.nvidia.com/blog/2020/04/29/what-is-rdma/
								^claims that an early RDMA system cost 5 million and hit #3 fastest computer (in 2003),
								compared to 350 million cost fastest. (marketing copy, so many caveats to that).
				
				Our story here is essentially the opposite!
					We have this fancy hardware, how can we get the best performance out of it.
					Requires understanding precisely what the hardware can do.


What problem are they trying to solve?
	- replicated state machine problem
	- Want microsecond-scale replication
		- Example apps
			- Robot control systems
			- Trading systems
			- KV stores

Why microsecond?
	- why are we thinking about "microseconds" and not "nanoseconds"?

What's possible in a microsecond?
	- about 3000 CPU cycles @ 3 Ghz
		- Computation has to be very limited
		- Mu handles only the log replication, application processing
		is downstream.
			Application of course needs to handle reqs in O(microseconds)
	- 1 memory access ~ a few hundred cycles

	Persistence?
		- Something not possible - disk writes
			- Modern SSD random write is tens of microseconds, at best
		- persistent memory?
			- Intel actually started making persistent memory chips
				sometime in 2019/2020
				Not available on consumer market, cost thousands of dollars (it seems?)
			- Random latency is maybe 2-3x worse than regular DRAM
				- obv benefits from CPU cache, but
					- to our earlier point: requires new instructions to force cache writes to persist
						- think something like fsync but for CPU cache
						- another example of takeaway
				- source: https://arxiv.org/pdf/1903.05714.pdf

	Network accesses?
		- What happens in an RDMA request?
			- CPU -> NIC
			- NIC processing
			- NIC -> other NIC
			- other NIC processing
			- other NIC <-> other memory
			- other NIC -> original NIC
			- original NIC -> CPU

		NIC <-> CPU
			- I couldn't find numbers, but presumably pretty fast
		NIC processing
			- Implement (not TCP but) reliable message ordering abstraction
				obviously can't have OS implement TCP
				Algorithms in this paper would be much more complicated without
				a reliable connection
		NIC <-> DRAM
			- These NICs have direct memory access
			- About as fast as CPU <-> DRAM, again couldn't find numbers

		NIC <-> other NIC
			- message goes through fibre cable
			- speed of light: 3.0* 10^8 m/s
			- microsecond: 10^-6 s

			- in a microsecond, light can travel 300m.

			- round trip, in a fibre cable: ~100m

Why are we not talking about nanosecond (100s of nanoseconds) consensus? 
	"replicas" would have to be all in one physical device, basically
	Current memory hardware: one access is 100ns from local memory

	TL;DR O(microsecond) consensus is basically at limit of what
	current electrical systems can do, and basically limit of
	replication of separated physical machines
		- But maybe for some applications, 100ns is possible?
			-e.g. replication of an airplane control system
		might need something crazy like a memory stick with its own network card	

Before we get into system, talk about RDMA spec
	- System's design carefully uses the precise details of RDMA spec
		again, thinking about what hardware does, exactly

==== RDMA ====

	- Saw this last time
	What's the interface?

	Queue Pairs
		Inside NIC are "work queue"s
		
		A QP is something like a connection

		When you create a queue pair, you specify two work queues
			an outgoing "send" queue, incoming "receive" queue (possibly the same work queue)

			We're not going explicitly use the "receive" queue, 
				- mainly exists as a way for CPU to tell NIC to notify it when a request comes in

		What actions does CPU put in these queues?
			Read / Write 
				also supports atomic memory operations
				also "send"/"receive", where receiver CPU has to provide buffers for NIC to copy data into


		Reliable/Unreliable
			Reliable uses acknowledgements to ensure messages arrive in order
			(Think TCP/UDP, but not exactly.  There's no initial TCP handshake, timeouts)

		Note: Queue entries, retrys/acks, buffers, all have to be implemented physically on the NIC

		Connected/unconnected ("datagram")
			QPs can be linked 1-1 across NICs, or not

		Putting it together, you get RC, UC, RD, UD in the RDMA original spec (2003ish), but nobody implements RD.
			We'll use RC, and as we'll see it's really important that we get all results back and in the right ordering.

	Memory Regions

		How do we stop accidental interference?  How to keep processes isolated from each other?  Prevent accidental write conflicts?

		Have to register a memory region with the NIC to make it accessible.

		Register(address region)

		Give QP an access right to an MR.

			It'll be important - we can change the access rights on QPs.  This means we
			can disable/enable them at will.
				A message that comes in on a disabled QP will have an error response
				sent back by the NIC, not the CPU.

				This'll be very important for the leader election, and getting election done quickly.


==== Microsecond consensus ====

Architecture (maybe draw diagram?)

Replication Plane and Background Plane

	Replication plane 
		- one QP per each other replica
		- MR is the log buffer

	Background plane
		- One QP per each other replica
		- MR is metadata for leader election

	What's in the log at each replica?
		- Log buffer
		- "minimum proposal number" - minimum number needed on a proposal for it to be not in conflict
		with a previously proposed number.
		- First uncommitted offset (in log).

Easy/common case:
	There exists a single leader, and everyone knows who the leader is

	Leader generates proposals, adds them to log sequentially with increasing proposal numbers

	Leader sends out proposal, waits for a majority of responses before sending next
		Doesn't invoke CPU on followers!

		So, how do replicas know when a proposal commits?
			- so replicas proposal N acts as a commit message for proposal N-1.

	Cost of replication?
		One RDMA write (parallel across replicas).  Doesn't invoke CPU on followers.  

		Note: Not an rpc, can't just say "add this to your log"
			- To make this happen: leader makes a local copy of follower's log offset addresses

What happens if there's two nodes who think they're leaders?
	Done naively, this could get messy really fast
		two uncoordinated leaders writing to the same addresses will conflict with each other
		Each one's maintaining a local copy of a log offset (memory address), so they're going to overwrite
		each other's entries, overwrite min proposal numbers

	Solution: 
		replica locally decide on who leader is (without coordination)
		and configure the NIC to only accept writes from that replica
		(disable permission on all other QPs).

	Observation: by offloading just a little bit of computation to the NIC, we keep the CPU out of the loop
		AND avoid a whole lot of protocol complexity, resource usage executing some kind of de-confliction protocol
		
		As things are, we just need the very small amount of filtering in the NIC, and to just wait for 
		new log entries to appear in memory (spin on memory reads)

		Another aspect: this permission bit automatically signals to leader that follower think it is leader.
			Signals very quickly to leader when it might not be the leader any longer (i.e. something has gone wrong)
			and may need to stop proposing or re-establish itself as leader.

Leader Election
	One big source of delay in replication protocols are timeouts

		- Timeout on leader lease
		- Timeout on message delivery

	Is leader not responsive? or is message just delayed?
		- Don't want too many elections, as this slows down common case.

	We want fail-over to happen as quickly as possible.  
		They don't get microsecond failover, ~800 microseconds

		Biggest sources of latency are (1) fail detection (600us) and (2) reconfiguring NIC (200us).

	Their solution?
		Heartbeat

		Every replica continually writes locally to an increasing counter.

		Replicas continually check these counters on other replicas.
			On check, if a value differs from previous value, "liveness score++", otherwise, "score--".

		"live" replicas have high liveness scores

		Capped between 0-15 so as to only take recent information into account
			Experimentally chose <2 => fail, >6 => valid, (dampen oscillation).

		When you need to pick a new leader, pick valid replica with lowest ID.

	Why this over leases? (make sure to ask class)

		lease - wait for entire timeout before you can start electing new leader.

		heartbeat - if network delay is stable, then time to fail is current network delay (plus loop spin time),
		not a worst-case maximum bound.
			We don't have to actually compute network delay, or estimate it.
			
			Not necessarily "fundamentally different" from a lease, question is how its implemented.
				There's still some kind of notion of "timeout" built in
				Not waiting for a single "timeout"/binary event

		We don't have to do constantly request leases (saves network traffic but more importantly
		saves CPU interaction).
			Could constantly just write to followers to maintain lease, but read traffic is generally cheaper than writing to memory

	How do they do the leader election?
		Background plane - to request leader, write "1" in your slow in an array,
		which follower continually looks for.  If it thinks you're acceptable as a leader,
		you're granted write permission on log (and previous leader revoked).

		If you notice leader down, choose who should be next.  If next is you, initiate change.

Recovery & Leader Change

	Have to first bring all replicas (majority, at least) up to date with all committed values

		Look at all the replicas that have acked us as leader, take the highest proposed value on them,
		replicate the longest log to all replicas.

			Can't necessarily tell if highest proposed value committed or not, but that's ok (implicitly commit it).

		If some replica later acks us as leader (i.e. one that was offline), bring it up to date (copy my log to it, essentially).

==== Evaluation ====

What do you think? (ask class, have discussion)

	- Normal case latency ~ 1u for Mu
	- Depends on payload size

	Failure detection?
		800us to switch, or so -- much faster than prior, still main bottleneck is leader failure detection.

	Very predictable latency vs throughput graph
		Good that it's predictable, it's predictable in a way that's good for an application too

	Figure 7 - why don't they show a batch size 1?
		TBh I wish they did, but maybe we can interpolate from latency?

	They claim random OS events cause leader descheduling for 10s of micros occasionally.
		Seems like OS scheduler should be aware of process latency requirements (this is an active area of research).
		Makes cutting down timeout hard.

	Wish they explored more the tradeoffs around setting timeouts/liveness thresholds for leader election.
		What happens if this is run on a microsecond-app-aware scheduler?  Can we cut the thresholds down 
		and get a faster leader election?

==== Other implementation things: ====

Permission management
	Several ways we could disable/enable QPs

	remap memory region
	disable QP
	change access flags

	What do they find? (ask class)

		access flags fastest, but sometimes errors out

		remaps runtime scales linearly with size?
			Why?
				size of virtual address table?

How do we prevent read shear?
	canary byte

	Canary is at END of message.  Why?
		Their NICs give a left-to-right write guarantee

		Note: This guarantee is NOT in the RDMA spec!  Again
		value of understanding precisely what the hardware does
		(avoids computing a checksum).

Log recycling
	Replicas maintain a "min apply" counter, logging what they've sent to the application.
	Leader 

Improving memory access patterns (ask class)

	Inline payloads vs. payloads read from memory
		small payloads fit within message to NIC, saves one memory read from NIC.

	Do you share threads or separate threads between application and Mu?
		different threads -- transferring payload from app to Mu is one cache miss (adds overall ~400ns)

		However, contend for CPU time -- bad if application has expensive computation to do

		Why not have applications directly send messages to NIC? (i.e. with a small amount of embedded Mu logic
		in the application, instead of talking to a Mu thread)
			want a fixed number (usually 1) concurrent request out, so this needs some synchronization
			protocol between application threads
			Need interaction with leadership detection module
			
				Not an obvious "impossibility" result, but a lot more work, and also not clear that it'll be
				an actual savings (given the synchronization overhead, lots of atomic memory ops that could mess with
				cache).

	pin threads to cores in the same NUMA node as the NIC

	Complain that OS could migrate application threads
		leads to weird/bad cache effects

		Why not pin application threads too?
			You could, but this can lead to really invasive application modifications
			and can take a lot of engineering to figure out which threads to pin where

			not every app is amenable to having a fixed thread placement allocation.


Linked it in notes (also in the citations)
	The authors have a separate, earlier paper where they describe a
	theoretical design for an RDMA-based replidation protocol.

	Interesting to compare the theoretical abstraction to how they actually implemented the protocol
		- No discussion of heartbeats, failover time or efficiency
		- Very different focus of paper
		- "just unregister the MR, reregister later" -- whereas these authors find that's very inefficient