No compromises (FaRM)
=====================

[Note: some details filled in from previous NSDI'14 FaRM paper]

What is non-volatile DRAM?
  Provides durable memory without cost of synchronous disk/SSD writes
  Key idea: use a "distributed UPS"
    Leverage existing cloud hardware specs, e.g., Open CloudServer (OCS)
    12U chassis holds 24 machines with shared power supplies
    So put Li-ion batteries + SSD in each chassis, dump memory on power outage
    Plausible argument this is cost effective (adds 15% over base DRAM cost)
  Why aren't traditional data-center UPSes adequate for the task?
    Giant centralized lead-acid batteries used while switching to aux generators
    But single rack or machine can still lose power if wrong cable yanked
    (Power failure notification might not get through)

What is RDMA?
  NIC support for access to memory on another machine
    Request does not need to go through kernel on client machine
    Request does not need to go through CPU at all on server machine!
  Uses reliable connections called *queue pairs* between machines
  Regions must be registered on the host offering remote access
    Pins memory; NIC maintains page-table like structures
    Limited on-NIC memory to cache page mappings, else fetch with DMA
      Using 2GB pages helps fit all mappings in cache
    Each region named by a *region capability*
  Also on platform used by authors:
    DMA operations are cache coherent
    Multi-cache-line RDMA writes (but not reads) go in address order
  Note "one-sided RDMA operation" means CPU used at only one side
    Still have round-trip at network level to return write ack or read data

Let's build fast messaging on RDMA + circular buffers:
  Allocate a circular buffer on the receiver, zero out all memory
  Sender & receiver each maintain head pointers, sender has tail pointer
  1. Sender ensures message would not wrap past head (otherwise wait)
  2. Sender writes {Length || Message || Trailer} at tail
  3. Receiver polls at head until Length is non-zero
  4. Receiver polls on head+Length until Trailer is non-zero
  5. Receiver processes message, then zeros out memory
  6. Half way through buffer receiver uses RDMA to updates head at sender
Can use for really fast RPC--why not just use RPC for remote objects?
  Steps 5 consumes CPU on server; CPU becomes the bottleneck

What is the FaRM system API?
  ACID transactions on objects in shared global address space
    Begin, Commit, Alloc, Free, Read, Write
    But: be careful of external side effects in case Commit fails
      Consistency checks deferred until Commit
  Lock-free reads - single-object read-only transactions
  Some kind of function shipping (alluded to in Sec 6.2, p. 64)
    E.g., run ++x on server storing x, avoiding lock + 2 RDMAs (read, write)

What is the system architecture?
  Machines connected by Infiniband (or RoCE) network supporting RDMA
  Set of machines S, each of which is both a storage and compute server
    Each has a bunch of 2GB RDMA regions, and also runs FaRM computations
    Why this symmetric design instead of separate storage/compute nodes?
      Because RDMA uses less CPU, so make use of available CPU
      Plus can sometimes leverage locality to operate on local data
  One dedicated configuration manager machine CM in S
    Assigns each region on one primary and f backup servers in S
      (mapping is stored in the regions themselves, too)
    CM exchanges leases with all other servers to detect failure
  A zookeeper instance in which CM stores a configuration <c, S, F, CM>
    c = view number, F = mapping from S to failure domains

What happens to RDMA read while CPU simultaneously writing to object?
  RDMA might return a mishmash of old and new cache lines.  What to do?
  Idea 1: Put a 63-bit counter + 1 bit lock in header of each object
    Read counter, read object, read counter again
    Repeat previous step if locked or two counter values read not equal
    Works, but requires 3 RDMA reads for each object read
  Idea 2: Put counter in each cache line of object
    Repeat RDMA read if all counters not equal
    Works, but eats 8 bytes out of each 64-bit cache line (12.5% mem waste)
  Idea 3: Put counter in object header, low 16 bits in each cache line
    Almost works unless unlucky and low 16 bits of counter wrap
  Idea 4: Like 3, but repeat if RDMA read takes more than 2ms
    Reasonable to bound RDMA time because network so fast
    Chose small time such that 65,536 lock/unlock cycles not practical

How does a normal-case (no failure) transaction proceed? (Fig. 4)
  Read a bunch of objects using RDMA
  Buffer all writes on the coordinator (expose no writes before commit)
  Engage in four-phase commit protocol:
    1. Lock: Send a LOCK RPC to primary of each written region
       Contains new object data + info on other objects/regions in transaction
       Uses CPU on primary to lock specific version with compare-and-swap
       If lock fails, must abort transaction
    2. RDMA-read versions of all read-only objects and validate that unchanged
       (Abort if any version number changed--may have inconsistent reads)
    3. RDMA-write COMMIT-BACKUP w. new data to backup(s) of each written object
    4. RDMA-write COMMIT-PRIMARY to each primary to which LOCK previously sent
    5  RDMA-write TRUNCATE to primary and backups
  Note: 2-4 don't use server CPU; complete with network-level RDMA response
        5 is lazy, can be piggybacked on other messages

What is the serialization point of a read-write transaction?
  Point where all locks acquired
  If two transactions simultaneously have all their locks, don't conflict
What is the serialization point of a read-only transaction?
  Point of last read
  Since all read versions subsequently validated not to have changed
What if coordinator wrote COMMIT-BACKUP & COMMIT-PRIMARY in parallel?
  Expose uncommitted data if primary+coordinator fail, backup didn't receive
When can a coordinator externalize the effects of a transaction?
  After successfully logging COMMIT-PRIMARY for *one* object
  Because coordinator might fail, and only coordinator knows read set
    Must abort transaction if coordinator doesn't say read-set okay
    But if primary + f backups committed, at most f fail, so know validated

How is failure detection fast?  5msec lease time.  How to avoid false timeouts?
  Use dedicated queue pair for lease messages (not queued behind other work)
  Use *unreliable* transport, because this is connectionless
    CM uses single queue for all lease messages (think 1 UDP socket vs. N TCP)
    Otherwise would overflow queue pair cache
  Use dedicated highest-priority thread to process leases
    Preallocate and pin all memory (including code) of lease thread
    *Don't* pin the lease thread to a core in case system usurps the core
    Instead, leave two hyperthreads without pinned FaRM threads
    So usually leases processed on unassigned hyperthreads
    But worse case lease thread can preempt another lower-priority FaRM thread

If a lease expires, does a server just stop serving stale objects?
  Only for external requests (from clients outside of S)
  FaRM servers read data with RDMA--can't check lease validity if CPU not used
  What's the solution?  Strict membership
    All machines (not majority) must acknowledge new configuration
    No mutation until all machines in acknowledge new config
    Coordinators don't RDMA to or accept replies from non-members

What are reconfiguration steps?  (Fig. 5)
  1. Suspect:  CM cuts off external client requests
  2. Probe:  [new] CM checks all other machines (in case correlated failure)
     Stop if can't probe a majority of machines (wait for network to come back)
     Probe reply includes "LastDrained" value (p. 61)
  3. Update config:  write <c+1,S,F,CM'> in zookeeper
  4. Remap regions: make sure f+1 replicas exist for all regions
  5. CM sends NEW-CONFIG to all servers (resets lease protocol if CM changed)
     For each region r that changed in new configuration, CM also includes:
         LastPrimaryChange[r] and LastReplicaChange[r]
  6. Servers block external requests, ack NEW-CONFIG (piggybacking lease msgs)
  7. After all NEW-CONFIG-ACK received *and* all old leases expired, new CM
     broadcasts NEW-CONFIG-COMMIT, all unblock and can process requests again
  Note 5-7 basically like 2PC plus waiting for old leases to expire

Paper defines *recovering transaction* (p. 61) as "one whose commit
  phase spans configuration changes, and for which some replica of a
  written object, some primary of a read object, or the coordinator
  has changed due to reconfiguration."

How does FaRM deal with recovering transactions?  (Fig. 6)
  1. Block access to recovering regions if new primary
  2. Drain logs since CPU cannot reject writes that were acked at RDMA level
     So must apply all log entries present when NEW-CONFIG-COMMIT received
     Set LastDrained = unique id <c,m,t,l> of last transaction drained
       c = viewno, m = coordinator, t thread on m, l thread-local ctr
  3. Find recovering transactions = those started committing in c-1, unless:
     - LastReplicaChange[r] < c for all written regions r, and
     - LastPrimaryChange[r] < c for all read-only regions r, and
     - the coordinator has not been removed
     Backups must notify primaries of all regions needing recovery
  4. Lock recovery - fetch missing log records to restore all lock state
     At this point region is active again, can start accepting RDMA
  5. Replicate log records (on backups that might be missing them)
  6. Coordinator receives votes on what to do with each recovering transaction
     (If coordinator failed, new one picked by consistent hashing)
     Each region's primary automatically sends one of 4 votes
       - commit-primary: If any replica in region saw COMMIT-PRIMARY, else
       - commit-backup: If any replica in region saw COMMIT-BACKUP, else
       - lock: If any replica in region saw LOCK, else
       - abort: If none of the above apply
     If coordinator doesn't get vote for region in 250usec, requests it
     Primary that hasn't heard of transaction can vote:
       - truncated: if transaction must be truncated from log
       - unknown: if transaction unknown
       How does it know if truncated or not?  Basically uses NPrC trick:
         Remember committed, but set is kept compact with lower bound (p. 62)
  7. Decide to commit iff:
     - Any primary votes commit-primary, or
     - Any primary votes commit-backup *and* no votes for abort or unknown
     Broadcast COMMIT-RECOVERY or ABORT-RECOVERY to notify participants
     Once all ack, send TRUNCATE-RECOVERY

How does recovery impact throughput?
  Pause for 10s of milliseconds not so bad
  Pacing of re-replication in background only minimally impacts throughput

How does performance compare to Dynamo?
  Numbers in Dynamo paper use milliseconds.  Here we see microseconds!
    But would be nice to see 99.9th percentile, not 99th
  Does FaRM's strict serializability make failure recovery too expensive?
    Actually 10s of ms probably in the noise for Dynamo
  Should Amazon ditch Dynamo for FaRM?  unclear
    FaRM paper doesn't have much evaluation of scalability
      Some vague talk of hierarchical leases, not implemented
    Requires more expensive hardware
    Violates Amazon's service oriented architecture
      Linking with FaRM C++ lib makes it harder to enforce SLA, blame crashes