Chubby
======

Let's say you want to run a service like Hotmail
  10s to 100s of millions of mailboxes, stored in different machine clusters
  Must keep mapping of mailboxes to clusters
    E.g., Compute bucket = H(mailbox) % 0x10000, to get 16-bit bucket value
    Store map of 65,536 bucket -> cluster mappings
    Can migrate buckets as new clusters added, or to load balance
  Two critical properties cluster map must have
    1. highly available (or all of Hotmail grinds to a halt)
    2. consistent throughout system (or mail gets delivered to wrong cluster)
  #1 requires we replicate on multiple machines to tolerate failure

First attempt: use two-phase commit (2PC)
  Machine C ("coordinator") wants to change the bucket map
  C broadcasts "PREPARE to change bucket entries x, y, z" to all replicas
    each replica replies with either YES or NO "vote"
    might say NO if already in the middle of changing x, y, or z
    once a replica votes YES, stops allowing clients to see x, y, or z
  C collects replies
    If anyone voted NO, broadcasts ABORT message
      Replicas discard changes and once again allow access to x, y, z
    Otherwise broadcast COMMIT
      Replicas apply changes and once again allow access to x, y, z
  What happens if one of the replicas (but not C) fails
    It either failed before voting, or after--C will know
    If failed before voting (or voted NO) coordinator can broadcast ABORT
      Fix broken replica, then start over
    If failed after YES, coordinator can send COMMIT
  What happens if C and a replica fail?
    If anyone voted NO or anyone received ABORT from C, then abort
    If anyone received COMMIT from C, then can safely COMMIT
    Otherwise we are stuck:
      All working replicas "prepared"--meaning could go either way
      Before crashing, C might have sent lost COMMIT message
        May have shown new state to clients, so we must preserve transaction!
      Or maybe dead replica voted NO, so we can't COMMIT
      Note:  Even if you bring C back up, disk may have died so lost state

Second attempt: use three-phase commit (3PC)
  Problem was "prepared" state where replicas can go either way
  So now after PREPARE, if everyone sends "YES", broadcast PREPARE-COMMIT
  Once all participants ACK PREPARE-COMMIT, broadcast COMMIT
  Why is this better?
    In 2PC can show clients new state when everyone has voted YES
    In 3PC only show new state when everyone *knows* everyone has voted YES
  So after failure repair machines, COMMIT iff any survivor saw PREPARE-COMMIT

What are limitations of 3PC?
  Can only make progress once machines are repaired by a human
    Replicas can't differentiate network partitions from failed machines
    Not safe to proceed if other partition might have divergent values
  Also not very efficient
    If many nodes initiate transactions, may have to abort a lot
    Change can't go through until all replicas have voted

Third attempt:  Viewstamped replication
  Designate one node the primary, and the others backups
  All requests must be sent to the primary
  The primary orders all requests, and assigns them "viewstamps"
    viewstamp = <view-ID, timestamp>
      view-ID only changes when nodes fail (more on this later)
      timestamp in the number of the request in this transaction
  Primary broadcasts <viewstamp, request> to all backups
    Backups execute requests in viewstamp order, so all get same state
  Primary proceeds when at least 1/2 of backups have acknowledged request
    Note 1/2 of backups plus primary = a majority of replicas
  What to do if primary fails?
    Must agree on a new primary and set of backups--call this set a view
    Must ensure a majority of machines from previous view agree on new view
    Must reach agreement even with network delays and maybe more failures
    If majority agree, at least one has seen every committed op in last view

Distributed agreement:  [simplified] View-change protocol
  Some node M (view manager) suspects the primary is down
  M ("view manager") broadcasts VIEW-CHANGE (new-view-ID) message
    new-view-ID chosen to be greater than any previously seen view-ID
    least significant bits set to M's machine ID to ensure uniqueness
  Replicas accept if new-view-ID if higher than any previously seen value
    When accepting, a replica sends reply to VIEW-CHANGE specifying:
      - highest viewstamp replica has seen
      - whether or not the replica was the primary in that view
  Meanwhile, if M receives VIEW-CHANGE with new-view-ID, abandons its own
  When majority of nodes ACKs and (if all didn't reply) timeout expires 
    M forms new view out of existing replicas
    Chose same primary as last view if that node is still available
    Otherwise, pick node with highest viewstamp as new primary
    Notify new primary, which will flush all old transactions to backups
  Good properties of the algorithm:
    During normal case, a majority of nodes required to commit operations
    During view change a majority required to form a new view
    Therefore at least one node in new view will know all committed operations
    Network partitions cannot cause inconsistency
  Note 1: This description ignores "crashed" nodes that may have lost state
  Note 2: This is sort of an instance of the Paxos algorithm
          (For each view, one instance of Paxos is used to select next view)


So what is motivation for Chubby lock service?
  Many applications need to agree on some metadata consistently
  Hard to get details of agreement right--implement once and for all
  What are example applications?
    GFS--a cell must have at most one master, everyone should agree on it
    Bigtable

Why a service and not a library?
  Easier to adapt existing software w/o restructuring
  Often need well-known rendezvous service anyway, and DNS suboptimal
  Programmers adopt because they *think* they understand locks (but don't)
  Reduces clients required for smaller systems
    E.g., can make progress with only one client if it has the lock

What does interface look like?
  Open (handle, name, mode) - opens file relative to handle
    (for bootstrapping handle may be "/" handle supplied by library)
  Close (handle) - destroys handle
  Poison (handle) - makes all further ops fail w/o deallocating handle
  GetContentsAndStat (handle) - returns contents and metadata
  SetContents (handle) - set file contents
  SetACL (handle) - set access control list
  Delete (handle) - deletes file or empty directory
  Acquire (), TryAcquire (), Release () - shared/exclusive locks, one per file

What's in a handle?
  Probably some database record key
  Mode in which the file is opened
  "check bits"--probably Cryptographic MAC (message authentication code)
    allows new master to check that handle was not forged
    but could allow faulty client to re-open handle after fail-over

What is the metadata in a node/file
  - Instance number (differs if create file with deleted file's name)
  - Content generation number (changes when a file written)
  - Lock generation number (increases when lock transitions free->held)
  - ACL generation number (increases if ACL names are written)

How does access control work?
  Indirection, though files in ACL directory that contains principal names

How to implement fine-grained locks with chubby?
  Hash into buckets (like Hotmail example)
  Use Chubby to claim lock on bucket
  Within buckets handle locking in instance of application-specific server
  Use sequencer to make sure app's server didn't lose lock--what's a Sequencer?
    GetSequencer () - returns "sequencer" value valid until lock changes hands
      Encodes lock name, mode (shared/exclusive), and lock generation no
    CheckSequencer () - checks that a sequencer is still valid
    SetSequencer () - associates sequencer with a particular handle
      So if sequencer becomes invalid, handle automatically stops working
  How would you use sequencer?
    When get lock from app server, it tells you sequencer from its Chubby lock
    Whenever you use the app-specific lock, call CheckSequencer
    Note:  Assumes you can abort/roll back whatever you are doing

How do clients find Chubby Master (primary)?
  All Chubby replicas stored in DNS
  Any Chubby server will reply with identity of master

Chubby is heavily optimized for reads
  In Viewstamped replication, all operations go to majority of replicas
  This includes reads--how does Chubby optimize this?
    The primary/"master" gets a "master lease"
    Essentially a promise by other Chubby replicas not to initiate view change
    Then master just replies to reads without contacting replicas
    Good idea since replicas initiate view change after timeout anyway
  This allows Chubby to go one step further--clients can cache data--how?
    Clients have leases (but one per session, not one per file like LBFS)
    When write comes in, server sends invalidate to all clients
    If all caching clients ack, write proceeds.  Else, wait for lease timeout
    Note:  Write blocks while waiting to invalidate caches
      This also allows server to send (uncacheable) replies to other reads
      So never need to block reads
        (though could maybe be unlucky if read reply delayed???)

What happens when a Chubby replica fails permanently?
  System automatically notices
  Allocates a new server (fires off the lock server on some new machine)
  Automatically updates the DNS
  Once DNS updated, updates list of replicas stored in Chubby itself
  New replica can vote in next view change once commits ops from master

How sensitive is the system to synchronized clocks?

What are the ideas in this paper?

What are scaling issues?

How is evaluation?
  Very anecdotal

Could the whole thing be much simpler?


* Other questions from E-mail

Why use keep-alives instead of having the server send out messages?
Why aren't directory modification times maintained?
How hard to add path-dependent permission semantics?
What exactly are "check digits"?
  What if the client is malicious? 

   - system structure: what if a majority of servers in a cell are down
   (possibly due to a network partition)
   - is the understanding right when i say that a 'file' in chubby is
   equivalent to a lock and this file can have data as stored in the db?
   - how does partitioning help scale the service?
   - how does chubby's name service compare with dns's name service
   mechanism
   - how can chubby's design be modified to support a large scale
   publish/subscribe service?