Spanner: Google's Globally-Distributed Database
===============================================

What is motivation?  F1 is Google's bread and butter
  Need consistency and features of an RDBMS 
  Need massive scalability 
    Tens of terabytes causing serious problems with MySQL
    Required years of work to re-shard data!

What is spanner API?
  Looks a bit like SQL (Fig. 4, p. 255)
    Except one B-tree (index) per table
    Can interleave data from multiple tables
  Break key space up into directories to give locality hint to system
    Each directory can have its own replication/placement policy
  Supports linearizable read-write transactions
  Also supports linearizable lock-free read-only transactions
    Must predeclare a read-only transaction
    Can run at server-chosen recent time, or client specified time/range
    Because no locks, read-only transactions don't abort
      (unless old data has been garbage-collected)

How are the servers organized?
  Servers are grouped into *zones*--the unit of administrative deployment
    Each zone physically isolated within a datacenter
  Each zone contains a hundred to thousands of *spanservers*
  Each spanserver stores 100-1000 *tablets* (<key, timestamp> -> value map)
    Basically a B-tree with a write-ahead log
  Each tablet is managed by a Paxos state machine
    Allows tablets to be replicated to spanservers in different zones
  Zonemaster - assigns data to spanservers
  Universe master - debugging console
  Placement driver - decides when to move data on timescale of minutes
    Meets updated replication constraints or helps load balancing

What's a witness (parenthetical comment in Sec. 2.2 on p. 254)?
  Need a quorum of servers to agree on state of a state machine
  But can separate agreement from replication
  If no majority of replicas, witness votes (preventing network partition)
    Witness also logs operations that happened since it started voting
    Eventually play back when network partition healed

Paxos (& raft) typically discussed as *asynchronous systems*
  Theoretical model for distributed systems
  Key property:  No bound on message delay, relative execution speed
    Can't tell the difference between failed and slow agent
    No clocks, no timeouts are reasonable
  Idea of model is to be conservative
    Want robustness under any possible timing conditions
    E.g., say backhoe tears fiber, takes a day to repair
      Could see messages delays a billion times more than usual

What is a leader lease in Paxos (same idea works for Raft)?
  In straight-up Paxos, both reads and writes go through same protocol
    Leader must wait another round trip to hear from quorum
  Why not just handle read locally at the leader (no data to replicate)?
    Later leader could have externalized writes, violating linearizability
  A lease is a promise by follower not to acknowledge other leaders for time T
    Given leases from quorum, leader knows no other leaders, can read locally
    Assumes bounded clock skew

Let's imagine a simpler straw man #1 to understand spanner's design
  Suppose each transaction stays entirely within a Paxos group
    Hence, no synchronization between Paxos groups, no fancy time stuff
    (Assumption will break down with need to fragment tablets)
  Use two-phase locking for concurrency control
    Acquire locks during a transaction, release on commit
    Ensures linearizability of transactions within Paxos group
  Paxos leader maintain lock table with shared and exclusive locks
    If leader fails or loses lease, can always abort the transaction

What could go wrong in straw man #1?
  Within Paxos, everything will be totally ordered, but not externally
  E.g., A and B concurrent post comments to different Paxos groups
    C sees A's comment but not B's, D sees B's but not A's

Straw man #2:  Implement cross-Paxos-group transactions
  Transaction must only commit if locks in all Paxos groups intact
Use two-phase commit--what's this?
  Distributed agents must agree whether a transaction committed or aborted
  A *coordinator* agent solicits votes from all participant agents
    Each agent replies with with VOTE-COMMIT or VOTE-ABORT
  If all agents send VOTE-COMMIT, then coordinator broadcasts GLOBAL-COMMIT
  If any agent sends VOTE-ABORT, the coordinator broadcasts GLOBAL-ABORT
  What fault-tolerance properties does two-phase commit guarantee?  None
    Must typically wait for all replies
    Coordinator could time out and GLOBAL-ABORT
    But if coordinator fails, could have externalized either
  But each two-phase commit participant in spanner is actually a Paxos group

What's wrong with straw man #2?
  That's a lot of locking, especially with many read-only transactions
  That's a lot of load on the leader (which must handle all reads)
  It might not make sense to cram everything into one transaction
    E.g., decision to load A's and B's comments might be made in browser
      Browser could fail or be slow--shouldn't hold even read locks

How does spanner solve this?  Totally ditches async. model with TrueTime
What's is TrueTime?  API for retrieving estimate of current time
  Can't globally synchronize exact time, so returns a bounded interval:
    TT.now() -> TTinterval: [earliest, latest]
  Requires hardware (GPS, atomic clocks)
  Uses local daemon to coordinate
  Assumes bounded clock drift (200us/sec = 0.02%)--is this reasonable?
    Sec. 5.3 says clocks fail 6 times less often than CPUs, so maybe

Idea:  Use real time to order all transactions globally
  Either A's or B's comment will have later timestamp
  If you see effects of later transaction, guaranteed to see earlier one

How does time eliminate the need for read locking?
  Assign each transaction a timestamp that preserves linearizability
    (I.e., if A committed before B started, then s_A < S_B)
    Within Paxos log, transaction timestamps must increase monotonically
  Tablets store history of values and allow reading at particular time
    Just read values at a read-only transaction's time to get linearizability
  Now reads can be spread across entire Paxos group
    So long as replica knows history through transaction's timestamp
    Replica fails?  Try another one with same timestamp
    That's why read-only transactions can't fail

How does a read-write transaction proceed?
  First, a client reads a bunch of data, acquiring read locks as needed
  When done, all writes buffered at client which holds only read locks
  Writes and lock releases must be sent to one or more Paxos groups

Let's say whole transaction involves only one group--what happens?
  Client sends writes to group leader in a *commit request*
  Leader must pick a timestamp s--how does it proceed?
    s must be greater than any previously committed transaction in Paxos log
    Two-phase locking implies a period where all locks simultaneously held
      Logically want transaction timestamp to lie within this period
        i.e., want s greater than the time of any lock acquisition
      So, conservatively, ensure s > TTnow.latest() at commit request receipt
  *Commit wait*: must wait until s < TTnow.earliest() before replying--why?
    External consistency

What happens when transaction involves multiple groups?
  Client picks one group to be two-phase commit coordinator
  Sends commit request to *coordinator leader* (Paxos leader of that group)
  Coordinator broadcasts vote request to participant leaders
  On receipt of the vote request, the participant leaders must:
    Acquire any write locks that will be necessary for transaction
    Pick a *prepare timestamp* to send back with VOTE-COMMIT
      Prepare timestamp must lie within the term of the leader's lease
      Must also also be greater than any committed transactions
    Once all participants send VOTE-COMMIT
      Coordinator picks timestamp s such that:
        - s >= Max prepare timestamps
        - s > TTnow.latest() when commit request received
        - As for other participants (s in leaders lease term, > max in log)
      Again "commit wait" until s < TTnow.earliest()

Schema changes
  No explicit locking, because humans ensure only one schema change at a time
  Hence, can pick timestamp far in the future for "flag point"