Replication in the Harp File System =================================== What is goal of the project? Network file system that can remain available when a server crashes Compatibility with existing NFS clients What is basic Harp architecture? Use three servers: Primary, Backup, Witness Outfit machines with small UPSs Have log region of virtual memory that can survive a soft reboot Hope simultaneous crashes will be unlikely Presents NFS interface to clients Clients send requests to primary server, or secondary if primary down Could use multicast, or insert small redirection layer at client Today: Could play tricks with Ethernet ARP (CARP, VRRP) When a server fails, change roles with "view change" If primary dead, backup takes over as new primary Witness becomes like new backup Why do we need three servers? Otherwise, might have two inconsistent servers after network partition But note witness doesn't need copy of the file system Cheaper machine, role is primarily to break ties after partition What happens during normal-case operation for, e.g., a WRITE? Client sends NFS WRITE request to primary Primary places event record in the log, relays record to backup Backup also adds the event record to its log, sends ACK to primary When primary gets ACK, it COMMITS operation, by advancing CP (CP = commit point, index of last committed message) Primary replies to client Note: All messages from primary to backup contains the CP A backup server keeps its own CP, which is max CP it's seen from primary How do log records get applied to file system state? Background process applies committed log records to disk Most recent record written back is AP = application point <= CP Once writes to file system complete, advance LB (lower bound) All messages also contain a sender's LB Also keep track of GLB (lower bound of all known LBs) Reads: Suppose primary just answered w/o logging or contacting backup? Problem 1: atime on files might be wrong This is probably tolerable. Happens anyway with NFS client cache Problem 2: violation of external consistency - Network partitions - Backup becomes primary - Client in backup's partition performs a write - Client in primary's partition reads stale data What's an example where this would be a problem? I call you, say "just changed paper, print latest version" You print and get old version Note communicating *outside* file system (hence "external" consistency) Solution: Primary essentially has read lease from backup Every message from backup says: "I promise not to form new view until time now + d" So server optimizes reads as long as before now + d - max_clock_skew What events trigger a new view change? Primary fails or becomes unreachable Backup fails or becomes unreachable Previously down/unreachable node re-joins system What if backup fails? No view change needed--primary and backup continue Suppose primary drops off network. What happens? Backup discovers missing primary, agrees with Witness to form new view Stop accepting new user requests for duration of view change Backup sends all log records after GLB to backup Witness is "promoted" to act as new backup Witness *never* applies log, since has no state Therefore can never truncate log (until primary recovers) But writes log records to stable storage (tape) Witness's LB is last log record on stable storage Then suppose primary recovers. What happens? Was there a media failure? If so, first transfer state from Backup, which is "acting primary" Backup and Witness agree to form new view with Primary Stop accepting new user requests for duration of view change Witness sends log records to primary, which primary applies If no media failure, recall Witness has everything from GLB forward Backup only replied to client after ACKs from Witness So server can just apply log to its state What's in a log record? Straw man: Log the actual NFS request argument Problem 1: Not all operations are idempotent E.g., client re-transmits MKDIR RPC, gets NFSERR_EXIST Problem 2: Operation's applicability may depend on state E.g., A issues unauthorized WRITE operation B subsequently grants write permission to A Crash after already applying two operations When applying re-do log, A's write will succeed second time Problem 3: Similar to above, but with internal state E.g., CREATE B, CREATE A, (*) REMOVE A, CREATE A, REMOVE B All 5 applied, but after a crash, end up re-doing from (*) A will move from second to first slot in directory list Why do we care? It might cause a READDIR on the client to miss file A Recall in large directory, readdir "cookie" says where to read from What needs to be done to fix above problems? Log RPC XID and result of non-idempotent operations So after view change new primary can populate the replay cache Log whether or not an operation will succeed Include "shadow copies" of inodes & directory blocks in log records When can Harp reclaim log space? If Primary and Backup both exist, can discard log entries up to GLB What if Primary or Backup down? GLB = min (Witness LB, acting primary LB) So recovering node *must* transfer log from Witness What if Primary down, then witness down, then primary recover? Can't progress without transferring whole state from Backup What exactly are the failures they can survive? One node permanently fails, or loses network connection? OK until the Witness's log fills up Network separates all nodes, then they re-join? OK Witness and backup experience media failure? OK Primary and backup experience media failure? Not OK, though could maybe reconstruct from complete log if on tape All nodes reboot w/o losing any non-volatile data? OK Since power outage likely cause of simultaneous failure, use UPS All nodes reboot and lose any non-volatile data? Problematic... may have replied to client for operation that is now lost When might simultaneous primary/backup and lost log occur? Software bugs--don't want to trust logs if there's buggy code What do they do to mitigate this? AP = application point, most recently applied log record (LB <= AP, since data written back to disk in the background) Backup only applies operation once primary has successfully applied it I.e., AP_{backup} <= AP_{primary} So what happens when client sends a "Packet of Death" Primary sends to backup, gets ACK, increments CP, replies to client Primary applies operation, and crashes Backup and Witness form new view Backup transfers log (including PoD) to Witness Backup eventually applies PoD, crashes Note Witness never applies, so will keep PoD in log Will at least make it easier to identify and debug the problem (And no lost work even if bug trashes log) Why does Harp even need a log at all? What if witness kept complete state? Logs make incremental transfers possible--otherwise recovery very expensive Logs also make it easy to agree on the state of the system LB, AP, etc., know if two nodes have same state, or one later Logs make it possible to recover from simultaneous power failure Well, sort of... HARP's logs are in volatile memory But log memory survives soft reboot, plus UPS Probably UPS has serial port, so get signal, write log to disk, shut down Otherwise, what if crash in middle of complex operations like rename? Primary and backup both crash, fsck, have different state. Which to use? How fast can a view change take place? Bottleneck will be transferring log to Witness They suggest solving with warm standby... what's this? Send all events to Witness, but don't wait for ACK Upon pri/back failure, Witness likely to have most/all needed records Why AP -- why not apply to disk as committed, at CP? And why is LB not AP-1? Want to issue asynchronous disk requests for better performance Structured as separate processes: "apply process" issues async I/O separate process updates LB when I/O's finish Could Harp operate over a WAN? Or do the machines have to be in same building? Whole point is that network RTT is faster than disk write Extra round trips painful over high-latency network Does Harp have performance benefits? Yes, due to UPS, no need for sync disk writes. But in general, not 3x performance. But maybe if you had 3 file systems, could get near 3x performance. Or at least use the witness for something useful. Or buy much cheaper machine for witness Why graph x=load y=response-time? Why does this graph make sense? Why not just graph total time to perform X operations? One reason is that systems sometimes get more/less efficient w/ high load. And we care a lot how they perform w/ overload. Why does response time go up with load? Why first gradual... Queuing and random bursts? And some ops more expensive than others, cause temp delays. Then almost straight up? Probably has hard limits, like disk I/Os per second.