Replication in the Harp File System
===================================

What is goal of the project?
  Network file system that can remain available when a server crashes
  Compatibility with existing NFS clients

What is basic Harp architecture?
  Use three servers:  Primary, Backup, Witness
    Outfit machines with small UPSs
    Have log region of virtual memory that can survive a soft reboot
    Hope simultaneous crashes will be unlikely
  Presents NFS interface to clients
    Clients send requests to primary server, or secondary if primary down
    Could use multicast, or insert small redirection layer at client
    Today:  Could play tricks with Ethernet ARP (CARP, VRRP)
  When a server fails, change roles with "view change"
    If primary dead, backup takes over as new primary
    Witness becomes like new backup

Why do we need three servers?
  Otherwise, might have two inconsistent servers after network partition
  But note witness doesn't need copy of the file system
    Cheaper machine, role is primarily to break ties after partition

What happens during normal-case operation for, e.g., a WRITE?
  Client sends NFS WRITE request to primary
  Primary places event record in the log, relays record to backup
  Backup also adds the event record to its log, sends ACK to primary
  When primary gets ACK, it COMMITS operation, by advancing CP
    (CP = commit point, index of last committed message)
  Primary replies to client

Note:  All messages from primary to backup contains the CP
  A backup server keeps its own CP, which is max CP it's seen from primary
How do log records get applied to file system state?
  Background process applies committed log records to disk
  Most recent record written back is AP = application point <= CP
  Once writes to file system complete, advance LB (lower bound)
  All messages also contain a sender's LB
  Also keep track of GLB (lower bound of all known LBs)

Reads:  Suppose primary just answered w/o logging or contacting backup?
  Problem 1:  atime on files might be wrong
    This is probably tolerable.  Happens anyway with NFS client cache
  Problem 2:  violation of external consistency
    - Network partitions
    - Backup becomes primary
    - Client in backup's partition performs a write
    - Client in primary's partition reads stale data
  What's an example where this would be a problem?
    I call you, say "just changed paper, print latest version"
    You print and get old version
    Note communicating *outside* file system (hence "external" consistency)
  Solution:  Primary essentially has read lease from backup
    Every message from backup says:
      "I promise not to form new view until time now + d"
    So server optimizes reads as long as before now + d - max_clock_skew

What events trigger a new view change?
  Primary fails or becomes unreachable
  Backup fails or becomes unreachable
  Previously down/unreachable node re-joins system
  What if backup fails?  No view change needed--primary and backup continue

Suppose primary drops off network.  What happens?
  Backup discovers missing primary, agrees with Witness to form new view
  Stop accepting new user requests for duration of view change
  Backup sends all log records after GLB to backup
  Witness is "promoted" to act as new backup
  Witness *never* applies log, since has no state
    Therefore can never truncate log (until primary recovers)
    But writes log records to stable storage (tape)
    Witness's LB is last log record on stable storage

Then suppose primary recovers.  What happens?
  Was there a media failure?
    If so, first transfer state from Backup, which is "acting primary"
  Backup and Witness agree to form new view with Primary
  Stop accepting new user requests for duration of view change
  Witness sends log records to primary, which primary applies
    If no media failure, recall Witness has everything from GLB forward
    Backup only replied to client after ACKs from Witness
    So server can just apply log to its state

What's in a log record?  Straw man:  Log the actual NFS request argument
  Problem 1:  Not all operations are idempotent
    E.g., client re-transmits MKDIR RPC, gets NFSERR_EXIST
  Problem 2:  Operation's applicability may depend on state
    E.g., A issues unauthorized WRITE operation
          B subsequently grants write permission to A
	  Crash after already applying two operations
	  When applying re-do log, A's write will succeed second time
  Problem 3:  Similar to above, but with internal state
    E.g., CREATE B, CREATE A, (*) REMOVE A, CREATE A, REMOVE B
	  All 5 applied, but after a crash, end up re-doing from (*)
          A will move from second to first slot in directory list
    Why do we care?
      It might cause a READDIR on the client to miss file A
      Recall in large directory, readdir "cookie" says where to read from

What needs to be done to fix above problems?
  Log RPC XID and result of non-idempotent operations
    So after view change new primary can populate the replay cache
  Log whether or not an operation will succeed
  Include "shadow copies" of inodes & directory blocks in log records

When can Harp reclaim log space?
  If Primary and Backup both exist, can discard log entries up to GLB
  What if Primary or Backup down?
    GLB = min (Witness LB, acting primary LB)
    So recovering node *must* transfer log from Witness
    What if Primary down, then witness down, then primary recover?
      Can't progress without transferring whole state from Backup

What exactly are the failures they can survive?
  One node permanently fails, or loses network connection?
    OK until the Witness's log fills up
  Network separates all nodes, then they re-join?  OK
  Witness and backup experience media failure?  OK
  Primary and backup experience media failure?
    Not OK, though could maybe reconstruct from complete log if on tape
  All nodes reboot w/o losing any non-volatile data?  OK
    Since power outage likely cause of simultaneous failure, use UPS
  All nodes reboot and lose any non-volatile data?
    Problematic... may have replied to client for operation that is now lost

When might simultaneous primary/backup and lost log occur?
  Software bugs--don't want to trust logs if there's buggy code
What do they do to mitigate this?
  AP = application point, most recently applied log record
    (LB <= AP, since data written back to disk in the background)
  Backup only applies operation once primary has successfully applied it
    I.e., AP_{backup} <= AP_{primary}
  So what happens when client sends a "Packet of Death"
    Primary sends to backup, gets ACK, increments CP, replies to client
    Primary applies operation, and crashes
    Backup and Witness form new view
    Backup transfers log (including PoD) to Witness
    Backup eventually applies PoD, crashes
  Note Witness never applies, so will keep PoD in log
    Will at least make it easier to identify and debug the problem
    (And no lost work even if bug trashes log)

Why does Harp even need a log at all?  What if witness kept complete state?
  Logs make incremental transfers possible--otherwise recovery very expensive
  Logs also make it easy to agree on the state of the system
    LB, AP, etc., know if two nodes have same state, or one later
  Logs make it possible to recover from simultaneous power failure
    Well, sort of... HARP's logs are in volatile memory
    But log memory survives soft reboot, plus UPS
    Probably UPS has serial port, so get signal, write log to disk, shut down
  Otherwise, what if crash in middle of complex operations like rename?
    Primary and backup both crash, fsck, have different state.  Which to use?

How fast can a view change take place?
  Bottleneck will be transferring log to Witness
  They suggest solving with warm standby... what's this?
    Send all events to Witness, but don't wait for ACK
    Upon pri/back failure, Witness likely to have most/all needed records

Why AP -- why not apply to disk as committed, at CP?  And why is LB not AP-1?
  Want to issue asynchronous disk requests for better performance
  Structured as separate processes:
    "apply process" issues async I/O
    separate process updates LB when I/O's finish

Could Harp operate over a WAN?
  Or do the machines have to be in same building?
  Whole point is that network RTT is faster than disk write
  Extra round trips painful over high-latency network

Does Harp have performance benefits?
  Yes, due to UPS, no need for sync disk writes.
  But in general, not 3x performance.
  But maybe if you had 3 file systems, could get near 3x performance.
  Or at least use the witness for something useful.
    Or buy much cheaper machine for witness

Why graph x=load y=response-time?
  Why does this graph make sense?
  Why not just graph total time to perform X operations?
  One reason is that systems sometimes get more/less efficient w/ high load.
  And we care a lot how they perform w/ overload.

Why does response time go up with load?
  Why first gradual...
    Queuing and random bursts?
    And some ops more expensive than others, cause temp delays.
  Then almost straight up?
    Probably has hard limits, like disk I/Os per second.