Replication and consistency
===========================

Let's say you want to run a service like Hotmail
  10s to 100s of millions of mailboxes, stored in different machine clusters
  Must keep mapping of mailboxes to clusters
    E.g., Compute bucket = H(mailbox) % 0x10000, to get 16-bit bucket value
    Store map of 65,536 bucket -> cluster mappings
    Can migrate buckets as new clusters added, or to load balance
  Two critical properties cluster map must have
    1. highly available (or all of Hotmail grinds to a halt)
    2. consistent throughout system (or mail gets delivered to wrong cluster)
  #1 requires we replicate on multiple machines to tolerate failure

First attempt: use two-phase commit (2PC)
  Machine C ("coordinator") wants to change the bucket map
  C broadcasts "PREPARE to change bucket entries x, y, z" to all replicas
    each replica replies with either YES or NO "vote"
    might say NO if already in the middle of changing x, y, or z
    once a replica votes YES, stops allowing clients to see x, y, or z
  C collects replies
    If anyone voted NO, broadcasts ABORT message
      Replicas discard changes and once again allow access to x, y, z
    Otherwise broadcast COMMIT
      Replicas apply changes and once again allow access to x, y, z
  What happens if one of the replicas (but not C) fails
    It either failed before voting, or after--C will know
    If failed before voting (or voted NO) coordinator can broadcast ABORT
      Fix broken replica, then start over
    If failed after YES, coordinator can send COMMIT
  What happens if C and a replica fail?
    If anyone voted NO or anyone received ABORT from C, then abort
    If anyone received COMMIT from C, then can safely COMMIT
    Otherwise we are stuck:
      All working replicas "prepared"--meaning could go either way
      Before crashing, C might have sent lost COMMIT message
        May have shown new state to clients, so we must preserve transaction!
      Or maybe dead replica voted NO, so we can't COMMIT
      Note:  Even if you bring C back up, disk may have died so lost state

Second attempt: use three-phase commit (3PC)
  Problem was "prepared" state where replicas can go either way
  So now after PREPARE, if everyone sends "YES", broadcast PREPARE-COMMIT
  Once all participants ACK PREPARE-COMMIT, broadcast COMMIT
  Why is this better?
    In 2PC can show clients new state when everyone has voted YES
    In 3PC only show new state when everyone *knows* everyone has voted YES
  So after failure repair machines, COMMIT iff any survivor saw PREPARE-COMMIT

Big problem:  asynchronous systems
  - Means makes no assumptions about relative speeds of machines
  - Means makes no assumptions about delay in message delivery
  Basically means you can't use timeouts
    E.g., can't assume if machine doesn't respond in 2 minutes it must be dead
  Pretty much how you need to models the Internet, for example
  Note: this is not the same thing as async/non-blocking I/O

Why is asynchronous communication bad for 3PC?
  Can only make progress once machines are repaired by a human
    Replicas can't differentiate network partitions from failed machines
    Not safe to proceed if other partition might have divergent values
  Also note 3PC not very efficient
    If many nodes initiate transactions, may have to abort a lot
    Change can't go through until all replicas have voted

Third attempt:  Viewstamped replication
  Designate one node the primary, and the others backups
  All requests must be sent to the primary
  The primary orders all requests, and assigns them "viewstamps"
    viewstamp = <view-ID, timestamp>
      view-ID only changes when nodes fail (more on this later)
      timestamp in the number of the request in this transaction
  Primary broadcasts <viewstamp, request> to all backups
    Backups execute requests in viewstamp order, so all get same state
  Primary proceeds when at least 1/2 of backups have acknowledged request
    Note 1/2 of backups plus primary = a majority of replicas
  What to do if primary fails?
    Must *agree* on a new primary and set of backups--call this set a view
    Must ensure a majority of machines from previous view *agree* on new view
    Must reach agreement even with network delays and maybe more failures
    If majority agree, at least one has seen every committed op in last view

Can we use 3PC to reach agreement among replicas?
  Problem: Fischer-Lynch-Patterson (FLP) result for asynchronous systems:
    For a non-trivial consensus problem (agreed upon value depends on input)
    No protocol is guaranteed to reach consensus if even one machine fails

Let's sidestep FLB with a different approach: 
  3PC is guaranteed to reach consensus if no machine fails
  FLP tells us can't always reach consensus if a machine fails
    But in an asynchronous system, failed machine looks like slow machine
      ... and consequently slow machine looks like failed machine
    E.g., might time out and initiate recovery for machine that didn't fail
  So let's use a protocol that's not guaranteed to terminate
      even without failure!  This is what Paxos gives us

Distributed agreement:  [simplified] View-change protocol
  Some node M (view manager) suspects the primary is down
  M ("view manager") broadcasts VIEW-CHANGE (new-view-ID) message
    new-view-ID chosen to be greater than any previously seen view-ID
    least significant bits set to M's machine ID to ensure uniqueness
  Replicas accept if new-view-ID if higher than any previously seen value
    When accepting, a replica sends reply to VIEW-CHANGE specifying:
      - highest viewstamp replica has seen
      - whether or not the replica was the primary in that view
  Meanwhile, if M receives VIEW-CHANGE with new-view-ID, abandons its own
  When majority of nodes ACKs and (if all didn't reply) timeout expires 
    M forms new view out of existing replicas
    Chose same primary as last view if that node is still available
    Otherwise, pick node with highest viewstamp as new primary
    Notify new primary, which will flush all old transactions to backups
  Good properties of the algorithm:
    During normal case, a majority of nodes required to commit operations
    During view change a majority required to form a new view
    Therefore at least one node in new view will know all committed operations
    Network partitions cannot cause inconsistency
  Note 1: This description ignores "crashed" nodes that may have lost state
  Note 2: This is sort of an instance of the Paxos algorithm
          (For each view, one instance of Paxos is used to select next view)


Chubby
======

So what is motivation for Chubby lock service?
  Many applications need to agree on some metadata consistently
  Hard to get details of agreement right--implement once and for all
  What are example applications?
    GFS--a cell must have at most one master, everyone should agree on it
    Bigtable

Why a service and not a library?
  Easier to adapt existing software w/o restructuring
  Often need well-known rendezvous service anyway, and DNS suboptimal
  Programmers adopt because they *think* they understand locks (but don't)
  Reduces clients required for smaller systems
    E.g., can make progress with only one client if it has the lock

What does interface look like?
  Open (handle, name, mode) - opens file relative to handle
    (for bootstrapping handle may be "/" handle supplied by library)
  Close (handle) - destroys handle
  Poison (handle) - makes all further ops fail w/o deallocating handle
  GetContentsAndStat (handle) - returns contents and metadata
  SetContents (handle) - set file contents
  SetACL (handle) - set access control list
  Delete (handle) - deletes file or empty directory
  Acquire (), TryAcquire (), Release () - shared/exclusive locks, one per file
  Subscribe (handle, events) - get callback when file modified, etc.

What's in a handle?
  Probably some database record key
  Mode in which the file is opened
  "check bits"--probably Cryptographic MAC (message authentication code)
    allows new master to check that handle was not forged
    but could allow faulty client to re-open handle after fail-over

What is the metadata in a node/file
  - Instance number (differs if create file with deleted file's name)
  - Content generation number (changes when a file written)
  - Lock generation number (increases when lock transitions free->held)
  - ACL generation number (increases if ACL names are written)

How does access control work?
  Indirection, though files in ACL directory that contains principal names

Does Chubby work with fine-grained locking?
    E.g., would you use one lock per HotMail mailbox?  No
  Overhead is too high
  Designed for coarse-grained locks
    E.g., acquire once per week if your application primary fails

How would you implement fine-grained locks with chubby?
  Hash into buckets (like Hotmail example)
  Use Chubby to claim lock on bucket
  Within buckets handle locking in instance of application-specific server
  Use sequencer to make sure app's server didn't lose lock--what's a Sequencer?
    GetSequencer () - returns "sequencer" value valid until lock changes hands
      Encodes lock name, mode (shared/exclusive), and lock generation no
    CheckSequencer () - checks that a sequencer is still valid
    SetSequencer () - associates sequencer with a particular handle
      So if sequencer becomes invalid, handle automatically stops working
  How would you use sequencer?
    When get lock from app server, it tells you sequencer from its Chubby lock
    Whenever you use the app-specific lock, call CheckSequencer
    Note:  Assumes you can abort/roll back whatever you are doing

How do clients find Chubby Master (primary)?
  All Chubby replicas stored in DNS
  Any Chubby server will reply with identity of master

Chubby is heavily optimized for reads
  In Viewstamped replication, all operations go to majority of replicas
  This includes reads--how does Chubby optimize this?
    The primary/"master" gets a "master lease"
    Essentially a promise by other Chubby replicas not to initiate view change
    Then master just replies to reads without contacting replicas
    Good idea since replicas initiate view change after timeout anyway
  This allows Chubby to go one step further--clients can cache data--how?
    Clients have leases (but one per session, not one per file like LBFS)
    When write comes in, server sends invalidate to all clients
    If all caching clients ack, write proceeds.  Else, wait for lease timeout
    Note:  Write blocks while waiting to invalidate caches
      This also allows server to send (uncacheable) replies to other reads
      So never need to block reads
        (though could maybe be unlucky if read reply delayed???)

What happens when a Chubby replica fails permanently?
  System automatically notices
  Allocates a new server (fires off the lock server on some new machine)
  Automatically updates the DNS
  Once DNS updated, updates list of replicas stored in Chubby itself
  New replica can vote in next view change once commits ops from master

What if master fails (Fig. 2)?
  When does a client give up?
    When lease expires, clients can't use cached files any more
      Flush the cache, and just make all Chubby operations block
    But if master comes back, don't want to kill application
      Use grace period, during which lease bad, but could still get renewed
  Use Paxos to pick a new master
  New master conservatively estimates all client leases
  New master picks new client epoch number - why?
    Prevents old calls from getting re-executed with new master
    (E.g., old packet may have been floating around the net for a while)
  New master tells clients to flush their caches (fail-over event) - Why?
    E.g., Old master sent callback, died, then network lost callback
  Only once all caches flushed (or timed out) does master accept new ops
    May see old handles it hasn't seen before, re-create them on-the-fly
  Delete ephemeral files after a minute or so

How does file mirroring work?
  One cell can subscribe to events that change files in another cell
  Copy files over when created or changed

How sensitive is the system to synchronized clocks?
  If server's clock very fast, could think lease expired while client doesn't
  But can easily just be conservative given typical clock drifts

What are the ideas in this paper?

What are scaling issues?
  Lots of KeepAlives - how to deal?  Proxies
  How does a proxy work?
    Lots of clients send keepalives to proxy each time period
    Proxy only sends one keepalive to master each time period
  What if proxy fails?
    Special call allows a proxy to take over a session lock from another (4.4)

How is evaluation?
  Very anecdotal

Could the whole thing be much simpler?