Dynamo: Amazon's Highly Available Key-value Store ================================================= Brief intro to Linearizability Suppose you had one server getting requests from many clients Concurrent requests can be arbitrarily ordered by the network, however: 1. The server will execute requests in some particular order 2. Non-concurrent requests will be executed in temporal order Above two properties are known as *Linearizability* No big deal with one server, but gold standard for distributed systems. Why? Easy to reason about, but costly to achieve Danger: under network partitions, could end up with "divergent" states Paxos lets you achieve linearizability with very few assumptions In class we are studying systems that push the envelope or "cheat" a bit Dynamo is a simple key-value store, used in a production environment Amazon's applications do not need the full power of SQL queries But all else equal, traditional consistency would be fine... What are traditional "ACID" database semantics (§2.1)? Atomicity - All or none of actions happen Consistency - Transaction leaves data in valid state (all invariants hold) Isolation - All actions in a transaction appear to happen before or after other transactions Durability - Effects of transactions will survive reboots Why doesn't Amazon want to pursue traditional ACID semantics? Problem: Consistency vs. availability and response time (§2.3) In many scenarios, actually better to get wrong answer than no/slow answer Exacerbating problem is focus on 99.9% latency--why? Exceeding latency costs Amazon serious money (blocks purchase) A single end-user operations may require many Dynamo operations Any one of them could blow the whole operation and lose a customer Strict SLAs ensure such failures will be kept to a minimum What does Dynamo API look like (§4.1)? get (key) -> (context, list of values) put (key, context, value) -> void What is this weird context value from the client's point of view Just an opaque string of bytes needing to be sent back in a put But idea is to help server resolve conflicts (more on this later) What are issues in deciding how to spread keys amongst the servers? Replicate values on at least N machines (in case of failure) Handle heterogeneous nodes (new machines more powerful than old) Minimize churn when machines are added or removed What is consistent hashing? Idea: assign keys to nodes so as to minimize churn on addition deletion Use ring trick shown in Figure 2, used by Akamai Popularized by Chord, a multi-hop key,value storage system Note: Each server needs many tokens to spread join/leave effects How to handle heterogeneity? Scale number of tokens with power of machine Some good alternatives to Chord ring to know about: Kademlia - store on server with token T that minimizes T XOR key Popularized by trackerless bit torrent (and some malware...) Easy to find servers with tree data structure Nearby servers will tend to share same key subspace in symmetric way CARP - store on server S that maximizes: machine-power * Hash (S, key) + No need for multiple tokens--join/leave spreads load uniformly - Need to re-hash everything on joins/leaves How does Dynamo achieve geographic replication? §4.6: "In essence, the preference list of a key is constructed such that the storage nodes are spread across multiple data centers." Hmm... Not clear how that is compatible with consistent hashing Maybe they ensure this while doing Token assignment to servers? What is a "quorum technique"? Say you store a value on N servers and care about linearizability... Write must reach some write-set of W servers before deemed complete Read must hear from some read-set of R servers before being deemed complete If every write-set and read-set have a non-empty intersection: Guarantees a reader will see effects of a previously completed write An easy way to guarantee this is to ensure R + W > N Convince everyone this works even with failure within fixed N nodes Each individual Dynamo instance configured with its own N, R, W What happens when client issues a request (§4.5)? A key's N successors around Chord ring are its *preference list* (§4.3) Option 1: Send to generic load balancer, which sends to Dynamo server Option 2: Clients have server map and know where to send (§6.4) Clients poll random server every 10 seconds for membership change Then: If not in preference list, forward to (ideally first) in pref list Will usually only be necessary for Option 1 above Forwarding is required for writes, optional for reads Wait for W or R servers (depending on op), then reply Keep waiting for stragglers to update them and simplify anti-entropy (§5) What happens when a server is down or partitioned (§4.6)? Hinted handoff: Write values to next server around ring E.g., in Fig. 2 N=3, if B down, store node K on E But tell E that write was really intended for B Server (E) stores hinted values in separate file, Makes it easy to transfer whole file to target (B) when back up What does this do to our quorum algorithm!?! Basically breaks it--they call this a "Sloppy quorum" But fits with philosophy of prioritizing availability over correctness What happens if server is permanently down? Administrator must manually decide this and configure system Fortunately, new writes have been going to the right place But may now have old values replicated at only N-1 servers Administrator may also add new nodes to replace old After add/remove/failure--how to ensure state properly replicated Transfer needed ring arcs from other servers Confirmation round (§4.9) ensures you don't get something you don't need In many cases, don't need to transfer a whole arc Just want to fix up a few missing entries Merkle trees make it easy to compare state when only small parts are missing How does every node know the composition of the whole Chord ring? Only human administrator can change ring membership Gossip protocol (§4.8) - every sec exchange membership info with random peer When adding servers, what's a seed? (§4.8.2) Server known to all--want to make sure you don't get divergent parallel rings What can lead to an update conflict? Servers partitioned, multiple coordinators Concurrent client updates How are such conflicts resolved? Servers may resolve "syntactically" using vector clocks (discuss) Note Dynamo truncates vector clocks at 10 entries Failing vector clocks, clients get multiple answers resolve "semantically" Can also have servers just do something (last update wins based on time) How does Dynamo modify Chord ring for load balancing (§6.2)? Split ring into fixed, equal size arcs (Figure 7 strategy 3) How does this help the system? Store keys for each segment in separate file--makes transfer easy How does this reduce size of membership info by three orders of magnitude? Could draw horizontal line on graph in Fig 8 and see this But you wouldn't necessarily run Dynamo in those two states Segment numbers smaller than MD5 hashes, but not 1,000x Not totally clear... No global failure state (§4.8.3)--why? Durability compromises (§6.1) Background tasks (§6.5) What types of task might you tune N, R, W for What are scaling bottlenecks?