COPS (2011)
===========

Administrivia: Please sign up for project check-in meetings

Learning goals:
  - Logical clocks (a.k.a. Lamport clocks)
  - Causal consistency
  - Get transactions

Lots of motivation for "ALPS" systems (e.g., Dynamo on Monday)
  Availability, Low-latency, Partition-tolerance, Scalability
  Why doesn't Dynamo solve all of our problems?
    Durability--maybe you can't afford to lose data after a write
    Consistency--maybe can't afford old or out-of-order data
    Transactions--maybe you need more than a simple key-value store
  Review: why not use a giant, scalable ACID database (e.g., we'll see Spanner)?
    CAP Theorem (Glibert&Lynch) says pick at most two of:
      Consistency, Availability, Partition tolerance
    Intuition: imagine a partition-tolerant system suffering a partition
      Can both partitions keep using system (w. writes)?  Can't be consistent
      Do all partitions but one block?  Then system isn't available

In ideal world want strict serializability
  Touch arbitrarily many objects in an atomic transaction
  Everybody sees the same order of operations
    If I see update 27, then create update 30
      Then everybody sees my update (30) after update 27
      Or at least knows 30 follows 27 and delays applying if see 27 first
    That is, we've preserved causality
    Example: 27 posts a photo, 28 reads photo and comments in 30
  The "strict" part means transactions obey temporal order, too
  Can we achieve this in a distributed system?
    Paxos/Raft/PBFT/Harp achieve strict serializability on single keys
      This is equivalent to linearizability
      Useful model for coordination (zookeeper), file systems (Harp)
    With optimizations like leader-leases, can avoid logging reads (28)
    How to extend to arbitrary transactions?
      Introduce two-phase locking, use 2PC across different Paxos groups
    But that's not scalable, and requires good network connectivity

Consistency isn't binary linearizable or eventual (Fig. 3)
  Linearizable > Sequential > Causal+ > Causal > FIFO
                                      > Per-key sequential > eventual
  Linearizable (Zookeeper): preserves temporal order of overlapping ops
  Sequential: as if all reads and writes happen one at a time globally
    Review: What's an example of Seq. Cst. that isn't Linearizable?
      Total order differs from temporal order--violates external consistency
      Write completes, I call you on phone, you read and see old data
  Causal, Causal+: we'll get to these
  FIFO: Operations by same client never re-ordered, but no global order
  Per-key sequential (PNUTS): per-record, everyone sees same write timeline
  Eventual (Dynamo): go long enough w/o writing key, all will read same val

Say A posts a photo, B posts a comment about the photo, C sees comment
  Which consistency models guarantee that C sees the photo?  Causal
    Why not FIFO or per-key sequential?
    A and B different clients writing different feeds, so no global order
Can we fix this by issuing zookeeper-like "sync()" before writing?
  Wait until all pending writes from all users propagate to all replicas
  Works, but: slow and no availability under partition tolerance
    At that point might as well go for linearizability...
  Would we even have sequential consistency w. sync? no write atomicity
    E.g., two clusters could receive two writes in different order
Today's questions:
  Can we preserve causal order in a decentralized, scalable way?
  Can we move beyond single-key w/o locks/aborts?

Fortunately, we don't need to sync all writes
  Only ones that might have *caused* our writes (e.g., A's photo)
  Boils down to two problems:
    1. Apply writes to a key in same order everywhere (or get same result)
    2. If B reads A's write, make sure B also sees anything A read
  Our task: get 1-2 while sharding data, replicating shards across clusters
    Complete reads and writes talking only to local replica
    Requiring only local replica gives ALP of ALPS, sharding gives S

Can we solve by timestamping all writes, last writer wins?
  Store <k,v,ts,id> tuples
    ts is timestamp on local replica (real-time clock)
    id is identity of replica (or replica's cluster)
  Introduce a local "log server" to each cluster
    All writes to all shards get pushed to log server, which assigns ts
    Push tuples to other cluster, in ts order, in background
  How to solve #1?  Always keep highest ts for each key k (break ties by id)
  What about #2?  Must delay revealing newly received writes to local cluster
    Say ts_i is highest ts seen from log server of cluster i
    Set ts_min to min of all ts_i.  Hide write <k,v,ts,id> until ts_min >= min
  Do we like this scheme?  No
    Single log per cluster undermines scalability (O(1) tps, not O(#shards))
    Kind of lose write availability if a single cluster is down
    Vulnerable to clock skew--fast clock on one log server = discarded writes
      Local timestamp can be less than fast timestamp for previous write!

Let's avoid clock skew problem with logical clocks (a.k.a. Lamport clocks)
  Calculate single timestamp integer for each operation
    Such that if A caused B, then B.ts > A.ts
  First, let's define "happens before" or ~>:
    - If op1 precedes op2 in same thread, say op1 -> op2
    - If op1 "puts into" op2 (op2 sees op1), then say op1 |-> op2
    - op1 causally precedes op2, written op1 ~> op2, iff:
        1. op1 -> op2,
        2. op1 |-> op2, or
        3. op1 ~> op2
    So must ensure that if A ~> B, then B.ts > A.ts
  Choose timestamps based on a logical clock Ti at each server Si
    On each local operation, set Ti = max(realtime, Ti+1)
    Upon receiving a message m, set Ti = max(Ti, m.ts)
  Now if one cluster has fast clock, Tis will just become counters

How do we move beyond the availability problem?
  What if instead of ts_min, we kept a vector clock of vts = <1-ts_1, ...>
  When can you apply tuple <k,v,vts,id> in local cluster?
    Wait until for all i, t_i >= vts[i]
  And if unavailable, let log servers download third-party logs from each other
  Drawbacks?  Still no scalability, may introduce a lot of latency

So how do we move beyond a single log server per datacenter?
  Shard servers communicate directly to *equivalent nodes* at other cluster
  Keep track of dependencies *per operation*, not per data-center
    Introduce a context argument to API (like Dynamo) [§4.3, p.6]

What's the difference between causal and causal+ consistency?
  Use commutative, associative reconciliation (e.g., highest ts wins)
  Causal: each client sees only increasing ts values for particular key
    See versions at least as recent as those who put into previous gets
  Causal+: client also can't see conflicting versions of particular key
    E.g., on last page, one client cannot see sigma_1 then sigma_2,3
  Why do we care?  Suppose you don't use highest ts wins, E.g.
    - A1 adds a photo
    - A2 adds another photo A2.ts > A1.ts, but the two conflict
    - A3 merges A1 and A2 to show both photos
    - B comments on photo posted in A1
    - C reads A2 and B, but can't see photo posted in A1

Why do we want multi-get consistency?  Example of ACL and Spring break photos
  Say app does (1) get(ACL), then (2) get(photo-list)--what is TOCTTOU problem?
    What if you change ACL then add photo between (1) and (2)
  What if app does (1) get(photo-list) then (2) get ACL?
    Now problem is change ACL then remove photo between (1) and (2)
  So this is solved by get transactions in COPS-GT

What is interface for client library to access local data store?
  * (bool,vers) <- put_after(k,val,[deps],nearest,vers=0)
    If vers=0, local server assigns value of local Lamport clock
    Commits val only after nearest has been committed in each cluster
    Note happens automatically in local cluster (which is linearizable)
  * bool <- dep_check(key,vers)
    Block until needed version is written (or return false on timeout)
  * (val,vers,deps) <- get_by_version(k,vers=LATEST)
  * Also, global-checkpoint-time (i.e., ts_min) piggybacked on responses
    Each server keeps track of local vers guaranteed committed in all clusters
    Compute global by talking to servers of all key ranges in cluster 10x/sec
What happens when you invoke get(k,ctx) in COPS library?
 - Call get_by_version with vers=LATEST
 - In COPS, add (k,vers) to dependencies in local ctx
 - In COPS-GT, add (k,vers,deps) to local context
What happens when you invoke put(k,v,ctx) in COPS library?
 - Library Calls put_after with vers=0
     nearest = all dependencies not subsumed by others [Fig. 6]
     deps = all dependencies in COPS-GT, else empty
 - Server in local cluster assigns vers, commits locally
     returns vers to client, supersedes all previous k,vers in nearest
 - Local server forwards write to equivalent nodes in other clusters
     Remotely invokes put_after (with non-zero vers)
     Remote equivalent nodes call dep_check on nearest within their cluster
     Once dep_check succeeds, they can commit new value
 - Once all clusters complete write, key,version gets flagged *never-depend*
 Why only dep_check nearest, instead of all deps?  Transitivity

How does a get transaction work?  [Figure 7]
  - Library calls get_by_version(k,LATEST) for each key
  - For each k, find the latest version ccv[k] a different key depended on
  - If k's version is less than ccv[k], issue get_by_version(k, ccv[k])
  - update context to include all keys and the versions ultimately read
  Why are 2 rounds of get_by_version always sufficient?
    All ccv[k] versions were locally committed at the time of the first round
    So their dependencies must have been transitively committed
    Thus, second round won't introduce any new dependencies
  What goes wrong if we use nearest instead of deps in Figure 7?
    Suppose A1, B1, C1 have no dependencies, A2 |-> B2 and B2 |-> C2
    Now thread T1: put(A2), put(B2), put(C2)
    Concurrently thread T2: get_trans(A,B,C) -> A1, B1, C2
    Now first round flags need to fetch B2, but not A2, violates causal order

What needs to be garbage collected?
  COPS-GT: old versions (for get_by_version)
    Add a trans_time (5 second) timeout on get transactions
    So just need to keep old versions for trans_time + max_clock_drift
  COPS-GT: deps fields returned by get_by_version
    clean trans_time after never-depends flag gets set
    What happens when cluster offline?  pause garbage collection
  COPS+COPS-GT: nearest and (for -GT) deps in ctx structures
    nearest: only keep last put plus all gets since last put
    deps: remove any never-depends versions, plus their transitive dependencies
      also remove any deps older than global-checkpoint-time (ts_min)

How would you implement shopping-cart union?  Need COPS-CD (conflict detection)
  - put operations carry new prev_vers argument
  - put operations implicitly depend on prev_vers of key
  - invoke application-specific reconciliation routine on conflicting writes

Does COPS actually solve the B-comments-on-A's-photo problem?
  Maybe not if HTTP requests go to different clusters, but that would be rare
  Would have to push COPS down to the JavaScript layer to guarantee 100%
Does COPS provide external consistency?  (by phone: "hey, check out my picture")
  No, can only track dependencies within the system

Evaluation--what questions should we ask?
  Is Causal+ a useful model for real systems?
  Does it scale?  (this one at least is answered in graphs)
  How does it compare to other approaches?

Do people use causal consistency in deployed storage systems?  Not much
What are the popular consistency models?
  Linearizability - Raft, zookeeper, PBFT, blockchains
  Strict serializability - Spanner (later)
  Eventual consistency - Dynamo, also Cassandra (if so configured)
  Per-key sequential consistency - PNUTS