Replication and consistency =========================== Let's say you want to run a service like Hotmail 10s to 100s of millions of mailboxes, stored in different machine clusters Must keep mapping of mailboxes to clusters E.g., Compute bucket = H(mailbox) % 0x10000, to get 16-bit bucket value Store map of 65,536 bucket -> cluster mappings Can migrate buckets as new clusters added, or to load balance Two critical properties cluster map must have 1. highly available (or all of Hotmail grinds to a halt) 2. consistent throughout system (or mail gets delivered to wrong cluster) #1 requires we replicate on multiple machines to tolerate failure First attempt: use two-phase commit (2PC) Machine C ("coordinator") wants to change the bucket map C broadcasts "PREPARE to change bucket entries x, y, z" to all replicas each replica replies with either YES or NO "vote" might say NO if already in the middle of changing x, y, or z once a replica votes YES, stops allowing clients to see x, y, or z C collects replies If anyone voted NO, broadcasts ABORT message Replicas discard changes and once again allow access to x, y, z Otherwise broadcast COMMIT Replicas apply changes and once again allow access to x, y, z What happens if one of the replicas (but not C) fails It either failed before voting, or after--C will know If failed before voting (or voted NO) coordinator can broadcast ABORT Fix broken replica, then start over If failed after YES, coordinator can send COMMIT What happens if C and a replica fail? If anyone voted NO or anyone received ABORT from C, then abort If anyone received COMMIT from C, then can safely COMMIT Otherwise we are stuck: All working replicas "prepared"--meaning could go either way Before crashing, C might have sent lost COMMIT message May have shown new state to clients, so we must preserve transaction! Or maybe dead replica voted NO, so we can't COMMIT Note: Even if you bring C back up, disk may have died so lost state Second attempt: use three-phase commit (3PC) Problem was "prepared" state where replicas can go either way So now after PREPARE, if everyone sends "YES", broadcast PREPARE-COMMIT Once all participants ACK PREPARE-COMMIT, broadcast COMMIT Why is this better? In 2PC can show clients new state when everyone has voted YES In 3PC only show new state when everyone *knows* everyone has voted YES So after failure repair machines, COMMIT iff any survivor saw PREPARE-COMMIT Big problem: asynchronous systems - Means makes no assumptions about relative speeds of machines - Means makes no assumptions about delay in message delivery Basically means you can't use timeouts E.g., can't assume if machine doesn't respond in 2 minutes it must be dead Pretty much how you need to models the Internet, for example Note: this is not the same thing as async/non-blocking I/O Why is asynchronous communication bad for 3PC? Can only make progress once machines are repaired by a human Replicas can't differentiate network partitions from failed machines Not safe to proceed if other partition might have divergent values Also note 3PC not very efficient If many nodes initiate transactions, may have to abort a lot Change can't go through until all replicas have voted Third attempt: Viewstamped replication Designate one node the primary, and the others backups All requests must be sent to the primary The primary orders all requests, and assigns them "viewstamps" viewstamp = view-ID only changes when nodes fail (more on this later) timestamp in the number of the request in this transaction Primary broadcasts to all backups Backups execute requests in viewstamp order, so all get same state Primary proceeds when at least 1/2 of backups have acknowledged request Note 1/2 of backups plus primary = a majority of replicas What to do if primary fails? Must *agree* on a new primary and set of backups--call this set a view Must ensure a majority of machines from previous view *agree* on new view Must reach agreement even with network delays and maybe more failures If majority agree, at least one has seen every committed op in last view Can we use 3PC to reach agreement among replicas? Problem: Fischer-Lynch-Patterson (FLP) result for asynchronous systems: For a non-trivial consensus problem (agreed upon value depends on input) No protocol is guaranteed to reach consensus if even one machine fails Let's sidestep FLB with a different approach: 3PC is guaranteed to reach consensus if no machine fails FLP tells us can't always reach consensus if a machine fails But in an asynchronous system, failed machine looks like slow machine ... and consequently slow machine looks like failed machine E.g., might time out and initiate recovery for machine that didn't fail So let's use a protocol that's not guaranteed to terminate even without failure! This is what Paxos gives us Distributed agreement: [simplified] View-change protocol Some node M (view manager) suspects the primary is down M ("view manager") broadcasts VIEW-CHANGE (new-view-ID) message new-view-ID chosen to be greater than any previously seen view-ID least significant bits set to M's machine ID to ensure uniqueness Replicas accept if new-view-ID if higher than any previously seen value When accepting, a replica sends reply to VIEW-CHANGE specifying: - highest viewstamp replica has seen - whether or not the replica was the primary in that view Meanwhile, if M receives VIEW-CHANGE with new-view-ID, abandons its own When majority of nodes ACKs and (if all didn't reply) timeout expires M forms new view out of existing replicas Chose same primary as last view if that node is still available Otherwise, pick node with highest viewstamp as new primary Notify new primary, which will flush all old transactions to backups Good properties of the algorithm: During normal case, a majority of nodes required to commit operations During view change a majority required to form a new view Therefore at least one node in new view will know all committed operations Network partitions cannot cause inconsistency Note 1: This description ignores "crashed" nodes that may have lost state Note 2: This is sort of an instance of the Paxos algorithm (For each view, one instance of Paxos is used to select next view) Chubby ====== So what is motivation for Chubby lock service? Many applications need to agree on some metadata consistently Hard to get details of agreement right--implement once and for all What are example applications? GFS--a cell must have at most one master, everyone should agree on it Bigtable Why a service and not a library? Easier to adapt existing software w/o restructuring Often need well-known rendezvous service anyway, and DNS suboptimal Programmers adopt because they *think* they understand locks (but don't) Reduces clients required for smaller systems E.g., can make progress with only one client if it has the lock What does interface look like? Open (handle, name, mode) - opens file relative to handle (for bootstrapping handle may be "/" handle supplied by library) Close (handle) - destroys handle Poison (handle) - makes all further ops fail w/o deallocating handle GetContentsAndStat (handle) - returns contents and metadata SetContents (handle) - set file contents SetACL (handle) - set access control list Delete (handle) - deletes file or empty directory Acquire (), TryAcquire (), Release () - shared/exclusive locks, one per file Subscribe (handle, events) - get callback when file modified, etc. What's in a handle? Probably some database record key Mode in which the file is opened "check bits"--probably Cryptographic MAC (message authentication code) allows new master to check that handle was not forged but could allow faulty client to re-open handle after fail-over What is the metadata in a node/file - Instance number (differs if create file with deleted file's name) - Content generation number (changes when a file written) - Lock generation number (increases when lock transitions free->held) - ACL generation number (increases if ACL names are written) How does access control work? Indirection, though files in ACL directory that contains principal names Does Chubby work with fine-grained locking? E.g., would you use one lock per HotMail mailbox? No Overhead is too high Designed for coarse-grained locks E.g., acquire once per week if your application primary fails How would you implement fine-grained locks with chubby? Hash into buckets (like Hotmail example) Use Chubby to claim lock on bucket Within buckets handle locking in instance of application-specific server Use sequencer to make sure app's server didn't lose lock--what's a Sequencer? GetSequencer () - returns "sequencer" value valid until lock changes hands Encodes lock name, mode (shared/exclusive), and lock generation no CheckSequencer () - checks that a sequencer is still valid SetSequencer () - associates sequencer with a particular handle So if sequencer becomes invalid, handle automatically stops working How would you use sequencer? When get lock from app server, it tells you sequencer from its Chubby lock Whenever you use the app-specific lock, call CheckSequencer Note: Assumes you can abort/roll back whatever you are doing How do clients find Chubby Master (primary)? All Chubby replicas stored in DNS Any Chubby server will reply with identity of master Chubby is heavily optimized for reads In Viewstamped replication, all operations go to majority of replicas This includes reads--how does Chubby optimize this? The primary/"master" gets a "master lease" Essentially a promise by other Chubby replicas not to initiate view change Then master just replies to reads without contacting replicas Good idea since replicas initiate view change after timeout anyway This allows Chubby to go one step further--clients can cache data--how? Clients have leases (but one per session, not one per file like LBFS) When write comes in, server sends invalidate to all clients If all caching clients ack, write proceeds. Else, wait for lease timeout Note: Write blocks while waiting to invalidate caches This also allows server to send (uncacheable) replies to other reads So never need to block reads (though could maybe be unlucky if read reply delayed???) What happens when a Chubby replica fails permanently? System automatically notices Allocates a new server (fires off the lock server on some new machine) Automatically updates the DNS Once DNS updated, updates list of replicas stored in Chubby itself New replica can vote in next view change once commits ops from master What if master fails (Fig. 2)? When does a client give up? When lease expires, clients can't use cached files any more Flush the cache, and just make all Chubby operations block But if master comes back, don't want to kill application Use grace period, during which lease bad, but could still get renewed Use Paxos to pick a new master New master conservatively estimates all client leases New master picks new client epoch number - why? Prevents old calls from getting re-executed with new master (E.g., old packet may have been floating around the net for a while) New master tells clients to flush their caches (fail-over event) - Why? E.g., Old master sent callback, died, then network lost callback Only once all caches flushed (or timed out) does master accept new ops May see old handles it hasn't seen before, re-create them on-the-fly Delete ephemeral files after a minute or so How does file mirroring work? One cell can subscribe to events that change files in another cell Copy files over when created or changed How sensitive is the system to synchronized clocks? If server's clock very fast, could think lease expired while client doesn't But can easily just be conservative given typical clock drifts What are the ideas in this paper? What are scaling issues? Lots of KeepAlives - how to deal? Proxies How does a proxy work? Lots of clients send keepalives to proxy each time period Proxy only sends one keepalive to master each time period What if proxy fails? Special call allows a proxy to take over a session lock from another (4.4) How is evaluation? Very anecdotal Could the whole thing be much simpler?