Harp
====

What are the goals of this work?
  Replicated network file system server with high
     Durability, Availability, Performance
  Backwards compatibility with existing clients
  Backwards compatibility with existing on-disk file systems (VFS layer)

Why not just use viewstamped replication to implement file server?
  File systems are big; 2f+1 copies is expensive for surviving f faults
  Sacrifices compatibility with existing file systems
  Performance might not be good
    File system layouts heavily optimized for access patterns
    Somehow need to work client caching into system
    VR's aid tracking is overkill, would likely hurt performance
    More availability during view changes
      Can bring rebooted nodes mostly up to date before interrupting service
    Optimizations for file reads (update atime in background)
  Semantics might not be as good
    Harp can survive simultaneous power failure (UPS not optional)
    Harp maintains ordering of overlapping operations
      Implies A-linearizability with TCP (or w. UDP if no packet loss)
  File systems not allowed to abort transactions like VR module groups
  Client compatibility?  Maybe okay with server-side NFS-proxy

What is normal-case operation no failures?
  One designated primary, f backups, f witnesses
  Client sends request to primary
  Primary multicasts requests to backups
  Backups log requests, reply to primary
  Primary waits for ack from all f backups then replies to client
  In background, primary tells backups operation committed
  Witnesses send/receive no messages, do absolutely nothing

Logs often appear below VFS layer (e.g., in ext3)--why does Harp use log?
  Logging is super fast because it doesn't go to disk
    Rely on UPS and kernel hack to avoid wiping state in "soft crash"
  Helps ensure concurrent ops are applied in the same order everywhere
  Logs make it easy to bring machines up to date after network partition
  Weird hack in case FS code vulnerable to "packet of death" (sec 4.3)
    Apply at primary first
    Primary crashes?  View change before applying at backup
    View change causes witness to log, so no lost state if backup crashes

What are all the different log pointers held at each node (Fig 4-1)?
  top - most recent log entry
  CP - most recently committed (primary + all backups have it)
  AP - most recently applied at local node
  LB - most recent operation such that it + all prior changes applied to disk
  GLB - largest index that bounds LB at all nodes
  Invariant: GLB <= LB <= AP <= CP <= top

What operations lie between top and CP?
  Concurrent pending operations
  NFS clients use kernel threads to issue concurrent operations
  Also, could have multiple clients accessing server

Why not have CP == AP (apply as soon as all backups ack)?
  Ops are applied in background by a separate apply process
  Lowers client latency--primary replies as soon as CP passes request
  All writes async (AP advanced when writes issued, not when completed)

Why not just delay advancing AP until operations on disk, so LB == AP?
  For performance, want multiple concurrent disk writes from apply process
    So apply process needs to remember where it is (AP)
  Also, want primary and backup to be able to overlap writes
    "Packet of death" trick requires waiting for primary to apply op

Why track the GLB?  Could delete log below LB
  Can't throw away log records before all backups apply them
  Might need log to bring other nodes up to date

What is needed for correctness in the face of server failures/view changes?
  Must ensure only one view active at a time (use majority)
  Must ensure no events lost across view changes (agree on last op)
  Must ensure f+1 copies of all committed events (promote witnesses)

What are steps of view change?
  Phase 0:  Nodes bring themselves up to date after recovery
    Suffered disk failure+replacement?  Have to copy state from non-witness
    No disk failure?  Just get log from witness
  Phase 1:  Coordinator solicits votes
    At this point, Harp stops processing new client requests
    Nodes ahead of coordinator send it missing log entries
    Need f+1 nodes for new view, so can't lose committed op from last view
  Phase 2:  Coordinator informs nodes of new view
    Coordinator writes new view number to disk
    Coordinator brings other nodes up to date if missing log entries
      If promoting witness, must send it everything since GLB
    Other nodes write log entry to disk
    Witnesses demoted if enough backups in new view

What is the difference between a promoted witness and normal backup?
  Backup does not have file system state, just log since node failure
  Hence, can never truncate the log

What could go wrong if witnesses didn't log, but just voted on view changes?
  Would prevent simultaneous active views, but not prevent data loss
  E.g., designated primary P, designated backup B, designated witness W
    Network partition causes B & W to form a view, execute some requests
    B crashes and loses its disk
    Network partition heals, P & W form view, but lost ops from previous view

How are read-only requests handled specially?  (p. 5)
  Primary can reply immediately without talking to backups
  "they are serialized at the CP"--why?
    Anything past CP might not complete successfully after view change
      E.g., backup never heard of it, then formed view with witness
      Meanwhile client timed out and gave up
    Don't want to show clients a write that never happened!
  But what if backup and witness already formed a new view?
    Primary of old view could violate linearizability--how?
      New view could complete a write operation
      Meanwhile, primary of old view replies to read request with stale data
    How to avoid?  Assume roughly synchronized clocks for primary lease
      Backups promise not to form new view for delta milliseconds

Is reading a file a read-only file system operation?
  Not technically, because you are supposed to update the atime
  But maybe we don't care if atime not updated in exceptional circumstances
  So Harp offers a second "weaker" mode of operation
    Allows small chance that atime update could get lost with failure

Is the file system VFS layer a deterministic replicated finite state machine?
  I.e., apply same operation on two identical file systems, get same result?
  No, because file times may not be identical
  How does Harp address?  change VFS to allow caller to specify mtime (p. 9)

What's in an event record (Sec 4.4-4.5)?
  A call to execute on the file system state
  Enough information to describe the "outcome" of the call completely
  Why is this tricky for file permissions?
    E.g., write A (permission denied), chmod A
    If you replay this log a second time, get different result
  Why is this tricky for file creation (Sec 4.4, p. 8)?
    Directory state is more than just file name -> inode mapping
      At the time, you could read raw directory structures with "cat /dir/"
      Even today, files stored in a particular order
      Cookies/offsets used to encode entry positions
      All of these things depend on order of directory operations
        create B, create A, unlink A, create A, unlink B
                                != unlink A, create A, unlink B
        Replay just the last 3 and will get different result
    Note simpler example might just be inode number
  Multiple things could go wrong here:
    Harp has to reply to requests at CP, not AP
      So needs to be able to predict outcome before applying event record
    Harp must ensure backups maintain identical state
  What's the solution?
    Keep shadow state for inodes and directory pages (Sec 4.4, p. 9)
    Also presumably need to bypass VFS layer for directory ops as well

Why is fsck a problem? (Sec. 4.3, p. 8)
  Completely messes up outcomes like position in directory entry

How to organize multiple file systems with Harp?
  Spread the load so nodes are primary, backup, witness for different FSes

What happens to duplicate RPCs (same xids, Sec. 4.5)?
  In single NFS server, should cache reply to avoid re-executing
  Harp must embed RPC xids in replies so new primary can build replay cache

What is comparison point for evaluation?  Single NFS server
  Is this a fair comparison point?
    Could also have compared to single NFS server with UPS hack
  Pro argument:  People tolerate existing NFS performance
    Harp shows availability/reliability is possible at existing performance
    Don't need to beet hypothetical world's fastest NFS server to be useful
  Con argument:  Maybe the UPS story is scary
    What if UPS fails and you don't learn until it's too late?
    What if log somehow corrupted during a reboot?
    Today's datacenters do often have UPSes, so in retrospect looks okay

Why graph x=load y=response-time?
  Why does this graph make sense?
  Why not just graph total time to perform X operations?
  One reason is that systems sometimes get more/less efficient w/ high load.
  And we care a lot how they perform w/ overload.

Why does response time go up with load?
  Why first gradual...
    Queuing and random bursts?
    And some ops more expensive than others, cause temp delays.
  Then almost straight up?
    Probably has hard limits, like disk I/Os per second.