=================================== Replication in the Harp File System =================================== Outline basic operation. Client, primary, backup, witness. Voting. Reply message. Log. Why does Harp have so many log pointers? CP commit point (real in primary, latest heard in slave) AP highest record sent to disk on this node LB disk has completed up to here GLB all nodes have completed disk up to here? At the primary, what does it mean for an event to be before/after CP? Before CP, n ACKs received, so can reply to client, apply to disk After CP, still waiting for ACKs from other machines How about at the slave? CP is latest you have heard from the server AP(slave) <= AP(primary) -- why? In case applying change would cause a machine to crash Why AP -- why not apply to disk as committed, at CP? And why is LB not AP-1? Want to issue asynchronous disk requests for better performance Structured as separate processes: "apply process" issues async I/O separate process updates LB when I/O's finish Why do we need GLB? Allows up to discard log entries. Why not discard at LB? In case another node lost log, but disk is OK. Though doesn't UPS protect against that? No: crashes due to software will lose the log. When can Harp reclaim log space? Ordinarily, can clean up before GLB But if witness promoted, it keeps complete log. Why? Does not have state to apply log entries to, so preserve complete history Why does Harp even need a log at all? Need state of partially completed operation, i.e. being committed. But mostly to allow concurrent operations. Linear because operations must be ordered. If OP1, then OP2, can't commit OP2 but not OP1. If crashes &c. What if power failure, operation committed, but not done writing? All nodes lose power. Already replied to client, can't forget about committed operations. UPS... What is the point of the witness if it doesn't store data? Breaks ties, ensures majority partition. If primary fails, what does witness do? Promoted to pseudo-backup. Has no copy of file system. Logs all messages, even before GBL. On disk/tape. What is the point? Witness has log required to bring old primary up to date when it recovers. And new primary might fail. We cannot continue then, since we don't have majority. But we can still restore stable storage when someone recovers. Assuming they recover with disk intact. Why is serving reads just on the primary complex? What if primary just became a minority partition? Read will miss committed writes... Could Harp operate over a WAN? Or do the machines have to be in same building? Extra round trips might make it painful over low-bandwidth network What exactly is the UPS for? I think only for simultaneous power failure. They don't depend on it to recover from partial failure. Other nodes' logs are enough for that. Do we even believe UPS story? What if the UPS battery runs out? They flush to disk and halt (?) when main power fails. So not as vulnerable as a RAID controller battery. What exactly are the failures they can survive? One node permanently fails, or loses network connection. Network separates all nodes, then they re-join? Witness and backup permanently fail? All nodes reboot w/o losing any non-volatile data? Why not just use one server with a UPS? Perhaps with a RAID array. Does Harp have performance benefits? Yes, due to UPS, no need for sync disk writes. But in general, not 3x performance. But maybe if you had 3 file systems, could get near 3x performance. Or at least use the witness for something useful. Or buy much cheaper machine for witness Why graph x=load y=response-time? Why does this graph make sense? Why not just graph total time to perform X operations? One reason is that systems sometimes get more/less efficient w/ high load. And we care a lot how they perform w/ overload. Why does response time go up with load? Why first gradual... Queuing and random bursts? And some ops more expensive than others, cause temp delays. Then almost straight up? Probably has hard limits, like disk I/Os per second.