Chubby ====== Let's say you want to run a service like Hotmail 10s to 100s of millions of mailboxes, stored in different machine clusters Must keep mapping of mailboxes to clusters E.g., Compute bucket = H(mailbox) % 0x10000, to get 16-bit bucket value Store map of 65,536 bucket -> cluster mappings Can migrate buckets as new clusters added, or to load balance Two critical properties cluster map must have 1. highly available (or all of Hotmail grinds to a halt) 2. consistent throughout system (or mail gets delivered to wrong cluster) #1 requires we replicate on multiple machines to tolerate failure First attempt: use two-phase commit (2PC) Machine C ("coordinator") wants to change the bucket map C broadcasts "PREPARE to change bucket entries x, y, z" to all replicas each replica replies with either YES or NO "vote" might say NO if already in the middle of changing x, y, or z once a replica votes YES, stops allowing clients to see x, y, or z C collects replies If anyone voted NO, broadcasts ABORT message Replicas discard changes and once again allow access to x, y, z Otherwise broadcast COMMIT Replicas apply changes and once again allow access to x, y, z What happens if one of the replicas (but not C) fails It either failed before voting, or after--C will know If failed before voting (or voted NO) coordinator can broadcast ABORT Fix broken replica, then start over If failed after YES, coordinator can send COMMIT What happens if C and a replica fail? If anyone voted NO or anyone received ABORT from C, then abort If anyone received COMMIT from C, then can safely COMMIT Otherwise we are stuck: All working replicas "prepared"--meaning could go either way Before crashing, C might have sent lost COMMIT message May have shown new state to clients, so we must preserve transaction! Or maybe dead replica voted NO, so we can't COMMIT Note: Even if you bring C back up, disk may have died so lost state Second attempt: use three-phase commit (3PC) Problem was "prepared" state where replicas can go either way So now after PREPARE, if everyone sends "YES", broadcast PREPARE-COMMIT Once all participants ACK PREPARE-COMMIT, broadcast COMMIT Why is this better? In 2PC can show clients new state when everyone has voted YES In 3PC only show new state when everyone *knows* everyone has voted YES So after failure repair machines, COMMIT iff any survivor saw PREPARE-COMMIT What are limitations of 3PC? Can only make progress once machines are repaired by a human Replicas can't differentiate network partitions from failed machines Not safe to proceed if other partition might have divergent values Also not very efficient If many nodes initiate transactions, may have to abort a lot Change can't go through until all replicas have voted Third attempt: Viewstamped replication Designate one node the primary, and the others backups All requests must be sent to the primary The primary orders all requests, and assigns them "viewstamps" viewstamp = view-ID only changes when nodes fail (more on this later) timestamp in the number of the request in this transaction Primary broadcasts to all backups Backups execute requests in viewstamp order, so all get same state Primary proceeds when at least 1/2 of backups have acknowledged request Note 1/2 of backups plus primary = a majority of replicas What to do if primary fails? Must agree on a new primary and set of backups--call this set a view Must ensure a majority of machines from previous view agree on new view Must reach agreement even with network delays and maybe more failures If majority agree, at least one has seen every committed op in last view Distributed agreement: [simplified] View-change protocol Some node M (view manager) suspects the primary is down M ("view manager") broadcasts VIEW-CHANGE (new-view-ID) message new-view-ID chosen to be greater than any previously seen view-ID least significant bits set to M's machine ID to ensure uniqueness Replicas accept if new-view-ID if higher than any previously seen value When accepting, a replica sends reply to VIEW-CHANGE specifying: - highest viewstamp replica has seen - whether or not the replica was the primary in that view Meanwhile, if M receives VIEW-CHANGE with new-view-ID, abandons its own When majority of nodes ACKs and (if all didn't reply) timeout expires M forms new view out of existing replicas Chose same primary as last view if that node is still available Otherwise, pick node with highest viewstamp as new primary Notify new primary, which will flush all old transactions to backups Good properties of the algorithm: During normal case, a majority of nodes required to commit operations During view change a majority required to form a new view Therefore at least one node in new view will know all committed operations Network partitions cannot cause inconsistency Note 1: This description ignores "crashed" nodes that may have lost state Note 2: This is sort of an instance of the Paxos algorithm (For each view, one instance of Paxos is used to select next view) So what is motivation for Chubby lock service? Many applications need to agree on some metadata consistently Hard to get details of agreement right--implement once and for all What are example applications? GFS--a cell must have at most one master, everyone should agree on it Bigtable Why a service and not a library? Easier to adapt existing software w/o restructuring Often need well-known rendezvous service anyway, and DNS suboptimal Programmers adopt because they *think* they understand locks (but don't) Reduces clients required for smaller systems E.g., can make progress with only one client if it has the lock What does interface look like? Open (handle, name, mode) - opens file relative to handle (for bootstrapping handle may be "/" handle supplied by library) Close (handle) - destroys handle Poison (handle) - makes all further ops fail w/o deallocating handle GetContentsAndStat (handle) - returns contents and metadata SetContents (handle) - set file contents SetACL (handle) - set access control list Delete (handle) - deletes file or empty directory Acquire (), TryAcquire (), Release () - shared/exclusive locks, one per file What's in a handle? Probably some database record key Mode in which the file is opened "check bits"--probably Cryptographic MAC (message authentication code) allows new master to check that handle was not forged but could allow faulty client to re-open handle after fail-over What is the metadata in a node/file - Instance number (differs if create file with deleted file's name) - Content generation number (changes when a file written) - Lock generation number (increases when lock transitions free->held) - ACL generation number (increases if ACL names are written) How does access control work? Indirection, though files in ACL directory that contains principal names How to implement fine-grained locks with chubby? Hash into buckets (like Hotmail example) Use Chubby to claim lock on bucket Within buckets handle locking in instance of application-specific server Use sequencer to make sure app's server didn't lose lock--what's a Sequencer? GetSequencer () - returns "sequencer" value valid until lock changes hands Encodes lock name, mode (shared/exclusive), and lock generation no CheckSequencer () - checks that a sequencer is still valid SetSequencer () - associates sequencer with a particular handle So if sequencer becomes invalid, handle automatically stops working How would you use sequencer? When get lock from app server, it tells you sequencer from its Chubby lock Whenever you use the app-specific lock, call CheckSequencer Note: Assumes you can abort/roll back whatever you are doing How do clients find Chubby Master (primary)? All Chubby replicas stored in DNS Any Chubby server will reply with identity of master Chubby is heavily optimized for reads In Viewstamped replication, all operations go to majority of replicas This includes reads--how does Chubby optimize this? The primary/"master" gets a "master lease" Essentially a promise by other Chubby replicas not to initiate view change Then master just replies to reads without contacting replicas Good idea since replicas initiate view change after timeout anyway This allows Chubby to go one step further--clients can cache data--how? Clients have leases (but one per session, not one per file like LBFS) When write comes in, server sends invalidate to all clients If all caching clients ack, write proceeds. Else, wait for lease timeout Note: Write blocks while waiting to invalidate caches This also allows server to send (uncacheable) replies to other reads So never need to block reads (though could maybe be unlucky if read reply delayed???) What happens when a Chubby replica fails permanently? System automatically notices Allocates a new server (fires off the lock server on some new machine) Automatically updates the DNS Once DNS updated, updates list of replicas stored in Chubby itself New replica can vote in next view change once commits ops from master How sensitive is the system to synchronized clocks? What are the ideas in this paper? What are scaling issues? How is evaluation? Very anecdotal Could the whole thing be much simpler? * Other questions from E-mail Why use keep-alives instead of having the server send out messages? Why aren't directory modification times maintained? How hard to add path-dependent permission semantics? What exactly are "check digits"? What if the client is malicious? - system structure: what if a majority of servers in a cell are down (possibly due to a network partition) - is the understanding right when i say that a 'file' in chubby is equivalent to a lock and this file can have data as stored in the db? - how does partitioning help scale the service? - how does chubby's name service compare with dns's name service mechanism - how can chubby's design be modified to support a large scale publish/subscribe service?