COPS (2011) =========== Administrivia: Please sign up for project check-in meetings Learning goals: - Logical clocks (a.k.a. Lamport clocks) - Causal consistency - Get transactions Lots of motivation for "ALPS" systems (e.g., Dynamo on Monday) Availability, Low-latency, Partition-tolerance, Scalability Why doesn't Dynamo solve all of our problems? Durability--maybe you can't afford to lose data after a write Consistency--maybe can't afford old or out-of-order data Transactions--maybe you need more than a simple key-value store Review: why not use a giant, scalable ACID database (e.g., we'll see Spanner)? CAP Theorem (Glibert&Lynch) says pick at most two of: Consistency, Availability, Partition tolerance Intuition: imagine a partition-tolerant system suffering a partition Can both partitions keep using system (w. writes)? Can't be consistent Do all partitions but one block? Then system isn't available In ideal world want strict serializability Touch arbitrarily many objects in an atomic transaction Everybody sees the same order of operations If I see update 27, then create update 30 Then everybody sees my update (30) after update 27 Or at least knows 30 follows 27 and delays applying if see 27 first That is, we've preserved causality Example: 27 posts a photo, 28 reads photo and comments in 30 The "strict" part means transactions obey temporal order, too Can we achieve this in a distributed system? Paxos/Raft/PBFT/Harp achieve strict serializability on single keys This is equivalent to linearizability Useful model for coordination (zookeeper), file systems (Harp) With optimizations like leader-leases, can avoid logging reads (28) How to extend to arbitrary transactions? Introduce two-phase locking, use 2PC across different Paxos groups But that's not scalable, and requires good network connectivity Consistency isn't binary linearizable or eventual (Fig. 3) Linearizable > Sequential > Causal+ > Causal > FIFO > Per-key sequential > eventual Linearizable (Zookeeper): preserves temporal order of overlapping ops Sequential: as if all reads and writes happen one at a time globally Review: What's an example of Seq. Cst. that isn't Linearizable? Total order differs from temporal order--violates external consistency Write completes, I call you on phone, you read and see old data Causal, Causal+: we'll get to these FIFO: Operations by same client never re-ordered, but no global order Per-key sequential (PNUTS): per-record, everyone sees same write timeline Eventual (Dynamo): go long enough w/o writing key, all will read same val Say A posts a photo, B posts a comment about the photo, C sees comment Which consistency models guarantee that C sees the photo? Causal Why not FIFO or per-key sequential? A and B different clients writing different feeds, so no global order Can we fix this by issuing zookeeper-like "sync()" before writing? Wait until all pending writes from all users propagate to all replicas Works, but: slow and no availability under partition tolerance At that point might as well go for linearizability... Would we even have sequential consistency w. sync? no write atomicity E.g., two clusters could receive two writes in different order Today's questions: Can we preserve causal order in a decentralized, scalable way? Can we move beyond single-key w/o locks/aborts? Fortunately, we don't need to sync all writes Only ones that might have *caused* our writes (e.g., A's photo) Boils down to two problems: 1. Apply writes to a key in same order everywhere (or get same result) 2. If B reads A's write, make sure B also sees anything A read Our task: get 1-2 while sharding data, replicating shards across clusters Complete reads and writes talking only to local replica Requiring only local replica gives ALP of ALPS, sharding gives S Can we solve by timestamping all writes, last writer wins? Store tuples ts is timestamp on local replica (real-time clock) id is identity of replica (or replica's cluster) Introduce a local "log server" to each cluster All writes to all shards get pushed to log server, which assigns ts Push tuples to other cluster, in ts order, in background How to solve #1? Always keep highest ts for each key k (break ties by id) What about #2? Must delay revealing newly received writes to local cluster Say ts_i is highest ts seen from log server of cluster i Set ts_min to min of all ts_i. Hide write until ts_min >= min Do we like this scheme? No Single log per cluster undermines scalability (O(1) tps, not O(#shards)) Kind of lose write availability if a single cluster is down Vulnerable to clock skew--fast clock on one log server = discarded writes Local timestamp can be less than fast timestamp for previous write! Let's avoid clock skew problem with logical clocks (a.k.a. Lamport clocks) Calculate single timestamp integer for each operation Such that if A caused B, then B.ts > A.ts First, let's define "happens before" or ~>: - If op1 precedes op2 in same thread, say op1 -> op2 - If op1 "puts into" op2 (op2 sees op1), then say op1 |-> op2 - op1 causally precedes op2, written op1 ~> op2, iff: 1. op1 -> op2, 2. op1 |-> op2, or 3. op1 ~> op2 So must ensure that if A ~> B, then B.ts > A.ts Choose timestamps based on a logical clock Ti at each server Si On each local operation, set Ti = max(realtime, Ti+1) Upon receiving a message m, set Ti = max(Ti, m.ts) Now if one cluster has fast clock, Tis will just become counters How do we move beyond the availability problem? What if instead of ts_min, we kept a vector clock of vts = <1-ts_1, ...> When can you apply tuple in local cluster? Wait until for all i, t_i >= vts[i] And if unavailable, let log servers download third-party logs from each other Drawbacks? Still no scalability, may introduce a lot of latency So how do we move beyond a single log server per datacenter? Shard servers communicate directly to *equivalent nodes* at other cluster Keep track of dependencies *per operation*, not per data-center Introduce a context argument to API (like Dynamo) [ยง4.3, p.6] What's the difference between causal and causal+ consistency? Use commutative, associative reconciliation (e.g., highest ts wins) Causal: each client sees only increasing ts values for particular key See versions at least as recent as those who put into previous gets Causal+: client also can't see conflicting versions of particular key E.g., on last page, one client cannot see sigma_1 then sigma_2,3 Why do we care? Suppose you don't use highest ts wins, E.g. - A1 adds a photo - A2 adds another photo A2.ts > A1.ts, but the two conflict - A3 merges A1 and A2 to show both photos - B comments on photo posted in A1 - C reads A2 and B, but can't see photo posted in A1 Why do we want multi-get consistency? Example of ACL and Spring break photos Say app does (1) get(ACL), then (2) get(photo-list)--what is TOCTTOU problem? What if you change ACL then add photo between (1) and (2) What if app does (1) get(photo-list) then (2) get ACL? Now problem is change ACL then remove photo between (1) and (2) So this is solved by get transactions in COPS-GT What is interface for client library to access local data store? * (bool,vers) <- put_after(k,val,[deps],nearest,vers=0) If vers=0, local server assigns value of local Lamport clock Commits val only after nearest has been committed in each cluster Note happens automatically in local cluster (which is linearizable) * bool <- dep_check(key,vers) Block until needed version is written (or return false on timeout) * (val,vers,deps) <- get_by_version(k,vers=LATEST) * Also, global-checkpoint-time (i.e., ts_min) piggybacked on responses Each server keeps track of local vers guaranteed committed in all clusters Compute global by talking to servers of all key ranges in cluster 10x/sec What happens when you invoke get(k,ctx) in COPS library? - Call get_by_version with vers=LATEST - In COPS, add (k,vers) to dependencies in local ctx - In COPS-GT, add (k,vers,deps) to local context What happens when you invoke put(k,v,ctx) in COPS library? - Library Calls put_after with vers=0 nearest = all dependencies not subsumed by others [Fig. 6] deps = all dependencies in COPS-GT, else empty - Server in local cluster assigns vers, commits locally returns vers to client, supersedes all previous k,vers in nearest - Local server forwards write to equivalent nodes in other clusters Remotely invokes put_after (with non-zero vers) Remote equivalent nodes call dep_check on nearest within their cluster Once dep_check succeeds, they can commit new value - Once all clusters complete write, key,version gets flagged *never-depend* Why only dep_check nearest, instead of all deps? Transitivity How does a get transaction work? [Figure 7] - Library calls get_by_version(k,LATEST) for each key - For each k, find the latest version ccv[k] a different key depended on - If k's version is less than ccv[k], issue get_by_version(k, ccv[k]) - update context to include all keys and the versions ultimately read Why are 2 rounds of get_by_version always sufficient? All ccv[k] versions were locally committed at the time of the first round So their dependencies must have been transitively committed Thus, second round won't introduce any new dependencies What goes wrong if we use nearest instead of deps in Figure 7? Suppose A1, B1, C1 have no dependencies, A2 |-> B2 and B2 |-> C2 Now thread T1: put(A2), put(B2), put(C2) Concurrently thread T2: get_trans(A,B,C) -> A1, B1, C2 Now first round flags need to fetch B2, but not A2, violates causal order What needs to be garbage collected? COPS-GT: old versions (for get_by_version) Add a trans_time (5 second) timeout on get transactions So just need to keep old versions for trans_time + max_clock_drift COPS-GT: deps fields returned by get_by_version clean trans_time after never-depends flag gets set What happens when cluster offline? pause garbage collection COPS+COPS-GT: nearest and (for -GT) deps in ctx structures nearest: only keep last put plus all gets since last put deps: remove any never-depends versions, plus their transitive dependencies also remove any deps older than global-checkpoint-time (ts_min) How would you implement shopping-cart union? Need COPS-CD (conflict detection) - put operations carry new prev_vers argument - put operations implicitly depend on prev_vers of key - invoke application-specific reconciliation routine on conflicting writes Does COPS actually solve the B-comments-on-A's-photo problem? Maybe not if HTTP requests go to different clusters, but that would be rare Would have to push COPS down to the JavaScript layer to guarantee 100% Does COPS provide external consistency? (by phone: "hey, check out my picture") No, can only track dependencies within the system Evaluation--what questions should we ask? Is Causal+ a useful model for real systems? Does it scale? (this one at least is answered in graphs) How does it compare to other approaches? Do people use causal consistency in deployed storage systems? Not much What are the popular consistency models? Linearizability - Raft, zookeeper, PBFT, blockchains Strict serializability - Spanner (later) Eventual consistency - Dynamo, also Cassandra (if so configured) Per-key sequential consistency - PNUTS