Harp (1991) =========== What are the goals of this work? Replicated network file system server with high Durability, Availability, Performance Backwards compatibility with existing clients Backwards compatibility with existing on-disk file systems (VFS layer) Why not just put existing NFS server implementation behind Raft? Straw man: Feed Raft log of NFS requests to vanilla NFS server code File systems are big; 2f+1 copies is expensive for surviving f faults Semantics/reliability might not be as good Harp can survive simultaneous power failure (UPS not optional) Various sources of non-determinism in NFS: file times, maybe allocation Want resistance to "packets of death" Does Harp maintain ordering of overlapping operations (A-linearizability)? Maybe if you use TCP or UDP and no packet loss Maybe not if NFS client implementation uses multiple TCP connections Client compatibility? Maybe okay with server-side NFS-proxy Performance might not be good File system layouts heavily optimized for access patterns Raft's logging might compete for disk arm at bad times Good throughput requires concurrent ops (e.g., for disk arm scheduling) But concurrency introduces non-determinism Somehow need to work client caching into system Want More availability during view changes Can bring rebooted nodes mostly up to date before interrupting service Optimizations for file reads (primary "lease", update atime in background) What is normal-case operation no failures? One designated primary, f backups, f witnesses Client sends request to primary Primary multicasts requests to backups Backups log requests, reply to primary Primary waits for ack from all f backups then replies to client In background, primary tells backups that operation committed Witnesses send/receive no messages, do absolutely nothing Logs often below VFS layer (e.g., XFS, ext3)--why does Harp use log above? Logging is super fast because it doesn't go to disk Rely on UPS and kernel hack to avoid wiping state in "soft crash" Helps ensure concurrent ops are applied in the same order everywhere Logs make it easy to bring machines up to date after network partition Weird hack in case FS code vulnerable to "packet of death" (sec 4.3) Apply at primary first Primary crashes? View change before applying at backup View change causes witness to log, so no lost state if backup crashes What are all the different log pointers held at each node (Fig 4-1)? top - most recent log entry CP - most recently committed (primary + all backups have it) AP - most recently applied at local node (but disk write might be pending) LB - most recent operation such that it + all prior changes applied to disk GLB - largest index that bounds LB at all nodes from below Invariant: GLB <= LB <= AP <= CP <= top What operations lie between top and CP? Concurrent pending operations--what would cause this? NFS clients use kernel threads to issue concurrent operations Also, could have multiple clients accessing server Why not have CP == AP (apply as soon as all backups ack)? Ops are applied in background by a separate apply process Lowers client latency--primary replies as soon as CP passes request Why not just delay advancing AP until operations on disk, so LB == AP? Lets writes be async (AP advanced when writes issued, not when completed) For performance, want multiple concurrent disk writes from apply process So apply process needs to remember where it is (AP) Also, want primary and backup to be able to overlap writes "Packet of death" trick requires waiting for primary to advance AP, not LB Why track the GLB? Could delete log below LB Can't throw away log records before all backups apply them Might need log to bring other nodes up to date What is needed for correctness in the face of server failures/view changes? Must ensure only one view active at a time (use majority) Must ensure no events lost across view changes (agree on last op) Must ensure f+1 copies of all committed events (promote witnesses) What are steps of view change? Phase 0: Nodes bring themselves up to date after recovery Suffered disk failure+replacement? Have to copy state from non-witness No disk failure? Just get log from witness Phase 1: Coordinator solicits votes At this point, Harp stops processing new client requests Nodes ahead of coordinator send it missing log entries Need f+1 nodes for new view, so can't lose committed op from last view Phase 2: Coordinator informs nodes of new view Coordinator writes new view number to disk Coordinator brings other nodes up to date if missing log entries If promoting witness, must send it everything since GLB Other nodes write log entry to disk Witnesses demoted if enough backups in new view What is the difference between a promoted witness and normal backup? Backup does not have file system state, just log since node failure Hence, can never truncate the log What could go wrong if witnesses didn't log, but just voted on view changes? Would prevent simultaneous active views, but not prevent data loss E.g., designated primary P, designated backup B, designated witness W Network partition causes B & W to form a view, execute some requests B crashes and loses its disk Network partition heals, P & W form view, but lost ops from previous view How are read-only requests handled specially? (p. 5) Primary can reply immediately without talking to backups "they are serialized at the CP"--why? Anything past CP might not complete successfully after view change E.g., backup never heard of it, then formed view with witness Meanwhile client timed out and gave up Don't want to show clients a write that never happened! But what if backup and witness already formed a new view? Primary of old view could violate linearizability--how? New view could complete a write operation Meanwhile, primary of old view replies to read request with stale data How to avoid? Assume roughly synchronized clocks for primary lease In each reply, backups promise not to form new view until T = now+delta Primary can reply to RO requests until T-epsilon (epsilon is clock skew) This kind of promise is known as a *lease* Could you make leases dependent on *drift* instead of *skew*? Primary sends timestamp T_p to backup, received at local time T_r Backup promises lease until T_p+delta, but holds promise to T_r+delta Safer (more conservative), but must lengthen delta by network latency Most systems do leases this way Is reading a file a read-only file system operation? Not technically, because you are supposed to update the atime But maybe we don't care if atime not updated in exceptional circumstances So Harp offers a second, "weaker" mode of operation Allows small chance that atime update could get lost with failure Is this reasonable? Probably--linux relatime (default) arguably weaker Is the file system VFS layer a deterministic replicated finite state machine? I.e., apply same operation on two identical file systems, get same result? No, because file times may not be identical How does Harp address? change VFS to allow caller to specify mtime (p. 9) What's in an event record (Sec 4.4-4.5)? A call to execute on the file system state Enough information to describe the "outcome" of the call completely Why is outcome needed? E.g., if you change file permissions: write A (permission denied), chmod A If you replay this log a second time, get different result E.g., for file creation (Sec 4.4, p. 8): Directory state is more than just file name -> inode mapping At the time, you could read raw directory structures with "cat /dir/" Even today, files stored in a particular order Cookies/offsets used to encode entry positions All of these things depend on order of directory operations create B, create A, unlink A, create A, unlink B != unlink A, create A, unlink B Replay just the last 3 and will get different result Note simpler example might just be inode number, NFS generation number Multiple things could go wrong here: Harp has to reply to requests at CP, not AP So needs to be able to predict outcome before applying event record Harp must ensure backups maintain identical state What's the solution? Keep shadow state for inodes and directory pages (Sec 4.4, p. 9) Also need to violate VFS layering for directory ops to apply outcome Why is fsck (scavenger utility) a problem? (Sec. 4.3, p. 8) Completely messes up outcomes like position in directory entry Solution? Don't run fsck because you have a log How to organize multiple file systems with Harp? Spread the load so nodes are primary, backup, witness for different FSes What happens to duplicate RPCs (same xids, Sec. 4.5)? In single NFS server, RPC layer caches replies to avoid re-executing Harp must embed RPC xids in op results so new primary can build replay cache What is comparison point for evaluation? Single NFS server Is this a fair comparison point? Could also have compared to single NFS server with UPS hack Pro argument: People tolerate existing NFS performance Harp shows availability/reliability is possible at existing performance Don't need to beat hypothetical world's fastest NFS server to be useful Con argument: Maybe the UPS story is scary What if UPS fails and you don't learn until it's too late? What if log somehow corrupted during a reboot? Today's datacenters do often have UPSes, so in retrospect looks okay Why graph x=load y=response-time? Why does this graph make sense? Why not just graph total time to perform X operations? One reason is that systems sometimes get more/less efficient w/ high load And we care a lot how they perform w/ overload. Why does response time go up with load? Why first gradual... Queuing and random bursts? And some ops more expensive than others, cause temp delays. Then almost straight up? Probably has hard limits, like disk I/Os per second. What is Multicast and why does it help? At the time, Ethernet was mostly shared medium (coax) like WiFi today Everyone sees all packets, ignore the ones that aren't destined to you IP multicast maps down to Ethernet multicast addresses With multicast, can send to additional recipients for free E.g., use multicast address that backup and hot witness are listening to Could also use for primary, so clients don't notice change of IP address Problem: possibly not enough multicast buckets Today, would probably use virtual MAC address (+IP) held by active primary E.g., VRRP https://datatracker.ietf.org/doc/html/rfc5798 Footnote: can leverage network properties to simplify consensus, see NOPaxos https://www.usenix.org/system/files/conference/osdi16/osdi16-li.pdf