Aurora (2017) ============= Administrivia: - By next Friday, May 29, we need final project title, team members Subject: "final project title" - Recall June 3 project presentations now Tuesday, June 9 - Please email cs244b-staff if you have scheduling constraints Learning goals: - Chain replication, writeahead logs (interaction with cloud-scale) - Rethinking abstractions - Quorum systems (w/o R reads in normal case) Many sites use a 3-tier web architecture Public/presentation: runs web servers accessed by public Application layer: runs the business logic of the web site Storage/database layer: runs the actual databases What happens when you put such an architecture on EC2? Public and application layer become totally scalable Site overloaded? Spin up more instances; servers are stateless What about storage layer? Can maybe scale by sharding database, but need to move data Worse: insufficient reliability--can lose an EC2 instance and its data Strawman solution: use elastic block store (EBS) for storage layer Like a virtual disk that that outlives your EC2 instance Data replicated two-ways using chain replication What's chain replication? replication technique for fail-stop storage Say you want to replicate linearizable state and have Zookeeper Use zookeeper to organize replicas into a chain Send all writes to head of chain and reads to tail of chain Both read & write replies come from tail of chain: read--\ write -> [head] -> [node] -> [tail] -> reply All updates ordered by head; read results already replicated everywhere How do you remove a node? Remove the head? Just make second node the head Remove the tail? Okay, doesn't have any info other nodes don't have Remove X in A -> X -> B? Now must ensure B has all info A pushed to X So downstream nodes ack writes, starting with the tail Upstream nodes just track unacknowledged writes so can resend Everything ordered by head, so can have sequence numbers How do you add a node? Must copy state from predecessor. E.g., pre-copy snapshot, predecessor logs and sends writes during pre-copy What's wrong with this approach? Fig. 2 EBS's 2-way replication is not good enough for durability or availability Need to replicate EBS contents across availability zones (AZs) How does InnoDB (MySql) backend store data? Uses B-trees organized as (e.g., 16KiB) pages Updated atomically using re-do logging (a.k.a. write-ahead log or WAL) What gets written on transaction commit (InnoDB on normal disk)? First write what you are going to do to WAL--might touch many tables Then update all the B-trees. If you crash, just (re-)apply WAL But also need to "double-write" pages--why? Presumably the log is a logical log, storing deltas to B-tree pages What if you crash and only some sectors of page get written (torn pages)? Could be that log + on-disk data insufficient to repair the damage So first store complete updated page in special double-write area Also log the actual SQL statements, possibly update .frm (table metadata) What happens with EBS? Must wait for primary to complete chain replication, then send to backup So lots of writes over network, lots of waiting for writes to complete Bad for both median and tail latency Also not great for availability Aurora strives for "AZ+1" availability--what's this? Survive loss of 1 AZ and simultaneous loss of one server not in AZ Even when AZ down, can recover from lost disk by creating new replica Is this good enough? Depends on probability of double fault MTTF - mean time to failure, MTTR - mean time to recovery Each MTTF, have MTTR window of vulnerability to losing second disk So increase MTTF? That's hard Instead concentrate on lowering MTTR--how? Break storage up into small (10GiB) *segments* Can be repaired (transferred to new replica) in 10s over 10Gbps net Lose multiple disks? Can parallelize recovery and still get 10s Short MTTR has operational advantages (head management, software upgrades) What's the replica lag problem? (6.2.3) For scalability, can issue read-only transactions to database replicas But no longer linearizable 12-*minute* lag (MySQL+EBS) was unacceptable, but 20ms is okay So what's the big idea behind Aurora? (Fig. 3) Each segment replicated 6 ways (3 AZs) in protection group (PG) Do the log apply at the storage servers, not at the database server Vastly reduces what needs to be sent over the network Use a quorum system for writes: N=6, W=4, R=3 Why choose W > R? Because can re-build in 10s from a read quorum Lots of TLAs (three-letter acronyms). Build a "whiteboard" to define them MTR - mini transaction LSN - log sequence number VCL - volume complete list: all lower LSNs guaranteed committed) during recovery must truncate all LSNs above VCL CPL - consistency point LSN: LSN that isn't in the middle of transaction VDL - volume durable LSN: highest CPL <= VCL so actually truncate all LSNs higher than VDL LAL - log allocation limit (10M): max difference between any LSN and VDL have to wait for prior transactions to commit SCL - segment complete LSN: per-server LSN to which all relevant LNSs received PGMRPL - protection group min read point LSN What does a MySQL+Aurora transaction commit look like? For each B-tree update, aurora sends log records to 6 storage servers in PG Each log record has LSN between VDL and VDL+LAL On commit, commit record (flagged as a CPL) to all affected PGs Wait until VDL >= commit record's LSN, then release locks + reply to client How does database instance determine VDL? For each affected PG, must get acknowledgment from W servers What happens when an Aurora storage server is missing LSNs? Background gossip of SCL allows servers to detect and catch up How does each server determine that it has all records below SCL? Only see LSNs that affecting the PGs segments, so will have gaps But database server tags log records with "backlinks" On commit record, presumably multiple backlinks? or per-PG commit record How does server maintain integrity of B-trees with split/join, etc.? Break user transaction into series of mini-transactions, or MTRs Each MTR uses consecutive LSNs, last one flagged as a CPL Since no interleaving of MTRs, each CPL guarantees data structure integrity What must API look like between database instance and storage? - write log record (affecting pages) - read data page (possibly as of a particular LSN) - delete log records in some LSN range (e.g., above VDL during recovery) - query info about SCL (presumably needed) What's actually stored on a Aurora storage server? A log Data pages, but not necessarily the latest version List of log records required to bring each page up to date When can a storage server reply to a write? Only need to write log records, not data pages What happens when database instance reads a page? Page may be cached at database instance--nothing to do Cache miss? Send request to one database server Only consider servers with SCL >= VCL (highest LSN of durable commit record) Don't need to wait for R reads--why not? Only one database instance at a time, so already know the VCL When do you need a read quorum of R nodes? After a DB instance crash, new DB instance needs to learn VCL Guaranteed that out of R servers, one will have SCL >= VCL What if storage server must evict a page? Only evict a page whose LSN > current VDL Means you can reconstruct by re-reading and playing from current VDL So what happens after DB instance crash? Figure out VCL by polling at least R servers from each PG May include log records for uncommitted transactions Compute VDL, which is highest CPL <= VCL Anything after VDL is incomplete transaction Tell servers to truncate all LSNs between VDL and VDL+LAL Tag truncation with durably recorded epoch--why? (4.3 p. 1047) Would be a problem if truncation didn't happen or re-executed later Are we done? No, could have uncommitted log records before VDL Must undo possibly uncommitted records below VDL All committed transactions already at W servers, so will get fixed by gossip What happens after a storage server crash? Don't need to apply the log synchronously (lowers latency)--why? Already might need to apply log after re-fetching evicted page in normal path How do read-only replicas work? Up to 15 DB instances can "mount" a PG read-only Primary (writer) sends log records asynchronously to replicas Does not wait for replicas to ACK, so may be out of date (e.g., 20ms) Read-only replica applies log records to keep cache up-to-date Which log records are safe to apply? Must ensure LSN <= VDL--why? avoid exposing state that could be aborted Also make sure you apply MTRs atomically When can a storage server discard old log records? When old versions of a data page will not be needed for a read DB instance knows oldest serialization point needed by pending transaction Gossip with read-only instances to determine PGMRPL Storage servers can discard anything older than PGMRPL What is the impact of Aurora on: Network bandwidth? saves by sending ops instead of pages IO/sec? saves (see Table 1) Storage consumption? uses more (6x copies instead of 2x or 4x) CPU consumption? more because logic pushed down to 6 storage servers What evaluation questions should we ask? Does it work and meet performance goals? seems like it How does it compare to other alternatives? Clearly beats InnoDB on top of EBS Maybe InnoDB is a particularly bad comparison point (e.g., double writes) Do we believe evaluation? Reads almost like marketing copy--but real product, so presumably works Basically this is highly plausible but non-reproducible research