Aurora (2017)
=============

Administrivia:
  - By next Friday, May 29, we need final project title, team members
    Subject: "final project title"
  - Recall June 3 project presentations now Tuesday, June 9
  - Please email cs244b-staff if you have scheduling constraints

Learning goals:
  - Chain replication, writeahead logs (interaction with cloud-scale)
  - Rethinking abstractions
  - Quorum systems (w/o R reads in normal case)

Many sites use a 3-tier web architecture
  Public/presentation: runs web servers accessed by public
  Application layer: runs the business logic of the web site
  Storage/database layer: runs the actual databases

What happens when you put such an architecture on EC2?
  Public and application layer become totally scalable
    Site overloaded?  Spin up more instances; servers are stateless
  What about storage layer?
    Can maybe scale by sharding database, but need to move data
    Worse: insufficient reliability--can lose an EC2 instance and its data

Strawman solution: use elastic block store (EBS) for storage layer
  Like a virtual disk that that outlives your EC2 instance
  Data replicated two-ways using chain replication
What's chain replication?  replication technique for fail-stop storage
  Say you want to replicate linearizable state and have Zookeeper
  Use zookeeper to organize replicas into a chain
    Send all writes to head of chain and reads to tail of chain
    Both read & write replies come from tail of chain:
                                  read--\
         write -> [head] -> [node] -> [tail] -> reply
    All updates ordered by head; read results already replicated everywhere
  How do you remove a node?
    Remove the head?  Just make second node the head
    Remove the tail?  Okay, doesn't have any info other nodes don't have
    Remove X in A -> X -> B?
      Now must ensure B has all info A pushed to X
      So downstream nodes ack writes, starting with the tail
      Upstream nodes just track unacknowledged writes so can resend
        Everything ordered by head, so can have sequence numbers
  How do you add a node?
    Must copy state from predecessor.
    E.g., pre-copy snapshot, predecessor logs and sends writes during pre-copy
What's wrong with this approach?  Fig. 2
  EBS's 2-way replication is not good enough for durability or availability
  Need to replicate EBS contents across availability zones (AZs)
  How does InnoDB (MySql) backend store data?
    Uses B-trees organized as (e.g., 16KiB) pages
    Updated atomically using re-do logging (a.k.a. write-ahead log or WAL)
  What gets written on transaction commit (InnoDB on normal disk)?
    First write what you are going to do to WAL--might touch many tables
    Then update all the B-trees.  If you crash, just (re-)apply WAL
    But also need to "double-write" pages--why?
      Presumably the log is a logical log, storing deltas to B-tree pages
      What if you crash and only some sectors of page get written (torn pages)?
      Could be that log + on-disk data insufficient to repair the damage
      So first store complete updated page in special double-write area
    Also log the actual SQL statements, possibly update .frm (table metadata)
  What happens with EBS?
    Must wait for primary to complete chain replication, then send to backup
      So lots of writes over network, lots of waiting for writes to complete
      Bad for both median and tail latency
    Also not great for availability

Aurora strives for "AZ+1" availability--what's this?
  Survive loss of 1 AZ and simultaneous loss of one server not in AZ
  Even when AZ down, can recover from lost disk by creating new replica
  Is this good enough?  Depends on probability of double fault
    MTTF - mean time to failure, MTTR - mean time to recovery
    Each MTTF, have MTTR window of vulnerability to losing second disk
  So increase MTTF?  That's hard
  Instead concentrate on lowering MTTR--how?
    Break storage up into small (10GiB) *segments*
    Can be repaired (transferred to new replica) in 10s over 10Gbps net
    Lose multiple disks?  Can parallelize recovery and still get 10s
  Short MTTR has operational advantages (head management, software upgrades)

What's the replica lag problem?  (6.2.3)
  For scalability, can issue read-only transactions to database replicas
  But no longer linearizable
  12-*minute* lag (MySQL+EBS) was unacceptable, but 20ms is okay

So what's the big idea behind Aurora?  (Fig. 3)
  Each segment replicated 6 ways (3 AZs) in protection group (PG)
  Do the log apply at the storage servers, not at the database server
    Vastly reduces what needs to be sent over the network
  Use a quorum system for writes:  N=6, W=4, R=3
    Why choose W > R?  Because can re-build in 10s from a read quorum

Lots of TLAs (three-letter acronyms).  Build a "whiteboard" to define them
  MTR - mini transaction
  LSN - log sequence number
  VCL - volume complete list: all lower LSNs guaranteed committed)
        during recovery must truncate all LSNs above VCL
  CPL - consistency point LSN: LSN that isn't in the middle of transaction
  VDL - volume durable LSN: highest CPL <= VCL
        so actually truncate all LSNs higher than VDL
  LAL - log allocation limit (10M): max difference between any LSN and VDL
        have to wait for prior transactions to commit
  SCL - segment complete LSN: per-server LSN to which all relevant LNSs received
  PGMRPL - protection group min read point LSN

What does a MySQL+Aurora transaction commit look like?
  For each B-tree update, aurora sends log records to 6 storage servers in PG
    Each log record has LSN between VDL and VDL+LAL
  On commit, commit record (flagged as a CPL) to all affected PGs
  Wait until VDL >= commit record's LSN, then release locks + reply to client
  How does database instance determine VDL?
    For each affected PG, must get acknowledgment from W servers
  What happens when an Aurora storage server is missing LSNs?
    Background gossip of SCL allows servers to detect and catch up
    How does each server determine that it has all records below SCL?
      Only see LSNs that affecting the PGs segments, so will have gaps
      But database server tags log records with "backlinks"
      On commit record, presumably multiple backlinks? or per-PG commit record
  How does server maintain integrity of B-trees with split/join, etc.?
    Break user transaction into series of mini-transactions, or MTRs
    Each MTR uses consecutive LSNs, last one flagged as a CPL
    Since no interleaving of MTRs, each CPL guarantees data structure integrity

What must API look like between database instance and storage?
  - write log record (affecting pages)
  - read data page (possibly as of a particular LSN)
  - delete log records in some LSN range (e.g., above VDL during recovery)
  - query info about SCL (presumably needed)
What's actually stored on a Aurora storage server?
  A log
  Data pages, but not necessarily the latest version
  List of log records required to bring each page up to date
When can a storage server reply to a write?
  Only need to write log records, not data pages

What happens when database instance reads a page?
  Page may be cached at database instance--nothing to do
  Cache miss?  Send request to one database server
    Only consider servers with SCL >= VCL (highest LSN of durable commit record)
  Don't need to wait for R reads--why not?
    Only one database instance at a time, so already know the VCL
  When do you need a read quorum of R nodes?
    After a DB instance crash, new DB instance needs to learn VCL
    Guaranteed that out of R servers, one will have SCL >= VCL
  What if storage server must evict a page?
    Only evict a page whose LSN > current VDL
    Means you can reconstruct by re-reading and playing from current VDL

So what happens after DB instance crash?
  Figure out VCL by polling at least R servers from each PG
  May include log records for uncommitted transactions
    Compute VDL, which is highest CPL <= VCL
    Anything after VDL is incomplete transaction
      Tell servers to truncate all LSNs between VDL and VDL+LAL
      Tag truncation with durably recorded epoch--why? (4.3 p. 1047)
        Would be a problem if truncation didn't happen or re-executed later
    Are we done? No, could have uncommitted log records before VDL
      Must undo possibly uncommitted records below VDL
  All committed transactions already at W servers, so will get fixed by gossip
What happens after a storage server crash?
  Don't need to apply the log synchronously (lowers latency)--why?
  Already might need to apply log after re-fetching evicted page in normal path

How do read-only replicas work?
  Up to 15 DB instances can "mount" a PG read-only
  Primary (writer) sends log records asynchronously to replicas
    Does not wait for replicas to ACK, so may be out of date (e.g., 20ms)
  Read-only replica applies log records to keep cache up-to-date
  Which log records are safe to apply?
    Must ensure LSN <= VDL--why?  avoid exposing state that could be aborted
    Also make sure you apply MTRs atomically

When can a storage server discard old log records?
  When old versions of a data page will not be needed for a read
  DB instance knows oldest serialization point needed by pending transaction
    Gossip with read-only instances to determine PGMRPL
    Storage servers can discard anything older than PGMRPL

What is the impact of Aurora on:
  Network bandwidth?  saves by sending ops instead of pages
  IO/sec?  saves (see Table 1)
  Storage consumption?  uses more (6x copies instead of 2x or 4x)
  CPU consumption?  more because logic pushed down to 6 storage servers

What evaluation questions should we ask?
  Does it work and meet performance goals?  seems like it
  How does it compare to other alternatives?
    Clearly beats InnoDB on top of EBS
    Maybe InnoDB is a particularly bad comparison point (e.g., double writes)
Do we believe evaluation?
  Reads almost like marketing copy--but real product, so presumably works
  Basically this is highly plausible but non-reproducible research