Goals - Understand cloud-scale storage systems What is aurora? MySQL database Starts from a fork of a community MySQL engine It wants to be used as a drop-in replacement for customer SQL databases, so it needs to support the same types of queries and consistency semantics as other SQL databases. SQL spec isn't always particularly clear on consistency semantics, but their implementation supports a set of "standard" semantics. Context: Many websites separate components into many stateless services (webserver, business logic) and a stateful database. Stateless components can be scaled up easily (just spin up more replicas), stateful db is harder. Motivation: - Reduce latency on critical path (mostly IO) - Reduce datacenter network usage - They claim, in their experience, that network I/O not storage or compute on an individual node, has become bottleneck in large datacenter applications. Response time of one slow network path or disk can be dominant factor. Strawman: replicate a SQL DB - We'll have several replicas of the database engine - Each maintains its own storage system - Each replicates data across several disks - each SQL replica writes each data page update - Have to wait how long? - until backups on storage for primary commit - until backups on storage for secondary commit - these waits are sequential - don't want secondary to commit to an update that the primary does not commit. - Double write on each page (avoid shorn writes) -> bad Their idea: separate out storage as its own replicated system The log is the DB only broadcast log entries, not actual data pages What does this mean? Split database state into pages 10GB SQL machines maintain a cache of pages SQL machines also maintain for each page, the highest log entry number that modified a given page, and track which storage nodes are complete for each page Max 64 TB means ~6400 pages max, so this metadata is relatively small, fits easily in memory. Storage maintain copies of all data pages (*) Important observation: Storage nodes don't actually need to maintain an up to date copy of data pages on disk. They need the ability to always answer a query of the form "what is the most recent version of a particular data page?" What happens when we get a write transaction? 1. Primary computes state updates/writes 2. Sends out state update to SQL backups AND to all storage nodes 3. Waits for a "write quorum" of responses from storage nodes 4. Storage nodes confirm AS SOON AS they write the LOG, not the actual state update. What is a write quorum? They replicate each chunk 6 ways, require a write quorum to be 4/6 (ensure all write quorums overlap). What happens on cache miss? Request page from *1* storage node, not a "quorum". Important - there's only one primary (writer) SQL node, so it always knows what's the highest sequence number that modified a given page, and since it tracks which storage nodes are complete for each page, it can always just select one node. What do storage nodes keep? For each page: 1. Records of all update logs of the page This is sufficient to compute the value of the data page We don't actually need the data page itself, or any snapshot. This computation is prohibitively expensive So what do we actually keep? 1. Data page, at some recent but not necessarily latest verion 2. Records of all update logs since the snapshot Why? IO path is parallelized all updates to each storage node happen in parallel Amount written is (probably) smaller log size << data page Garbage collection, checkpointing can be done async, in background one catch - make sure not to gc log entries that don't fully commit. Big difference - replicate storage, but DB replicas use same storage rather than each replica maintains its own replicated storage Eliminates sequential IO waits Compare with Harp, e.g. Recall: in Harp, have to keep log entries on one replica until all replicas have applied them to disk (have to keep until GLB passed), might need primary's log to update the backup, e.g. stuck at speed of slowest replica's speed of computing checkpoints/filesystem updates Harp uses nonvolatile memory for its log Why did they do that? They want to keep disk IO off critical path Could we do that here? Probably? Why not? 1. The kernel mods, nonvolatile memory makes life really hard for system administration, UPS expensive (although some datacenters might have them, for certain machines). 2. hard disks are a lot faster now so waiting for a single log entry write on an SSD is not a huge cost Recovery SQL server doesn't maintain log of data pages maybe just some metadata about how many pages there are, where the storage servers are On startup: query a read quorum (3/6) of every data page, compute the max log sequence number for which the log entry achieved a write quorum *this is why we need a read quorum - want to make sure we don't miss a write quorum 3/6 guaranteed to overlap with a 4/6 They have a slightly odd notion of "write commit" - the "volume complete log" and the "volume durable log" VCL - written to a write quorum of storage nodes VDL - written to write quorum, & is the commit point for a SQL tx. VDL <= VCL This lets SQL engine break up a SQL transaction's writes into smaller "mini-txs" Each Minitx is a log write, some are tagged as "commit points" Why? Pipeline writes, reduce latency Make log updates smaller They don't specify, but it might be convenient to break writes up by data page, when a SQL tx modifies several pages So on recovery, compute VCL, then look backwards to compute highest VDL, then truncate logs to this VDL. This means SQL engine has to continually track VDL, tell storage nodes what VDL is so that they can garbage collect. (can't gc above VDL). Quorums Important Takeaway Failures in modern datacenters can be highly correlated Fires, power outage, whatever else can take down an entire datacenter Who here remembers a time when their nearest AWS datacenter went down? Undersea cables get cut, DNS misconfigurations (Meta) So what do they do? 3 "availability zones" -- loosely speaking, regions within a datacenter with separate support infrastructure 2x replicas per data page per AZ. Write quorum: 4x replicas Read quorum: 3x replicas Ensure writes intersect Ensure every read sees every write Write avail in response to 1 full AZ outage No data loss (full read avail) in response to 1 AZ outage + 1 random failure Is this sufficient? Need "mean time to repair" << "mean time to fault" want to have very small chance of a double fault. Hard to answer this question "yes/no" without looking in detail at their production system. 10GB chunk on 10Gbps link -> copy chunk in around 10s. Lose disk means losing many chunks, but recovery parallelizable, maybe Depends on the allocation scheme for chunks to storage nodes. If 100 chunks are all replicated on the same 6 servers, then slow to recover from one fault on these 6 (and another 100 on the next 6, and so on) But if all chunks are mixed up across many servers, then recovery of a lost disk can draw from many different storage servers (and maybe you can put what was on the lost disk into many separate storage servers) -> parallelizable -> faster There can be other kinds of failure correlations as well Hard disks from the same manufacturing batch might all fail at around the same time, so maybe datacenter should mix disk batches As we talked about earlier, e.g. Harp, a software bug might get hit by several replicas simultaneously Other cool things Storage replication makes sysadmining easier Overloaded node - mark one data page on the node as failed, pull a replica up elsewhere Handle this the same way as a node outage - some automated service notices a data page missing, assigns it to another node/pulls up a new storage node. Patch node, software upgrade - just take one node offline, since system handles failures, this doesn't significantly affect system operation Important - you don't have to turn off the SQL machine to update storage or adjust disk usage. Updating SQL engine - because persistent storage is all on storage machines, not on SQL machine, there's very little state maintained by the SQL machine - mostly things like lists of open connections with clients As such, the small state remaining can be very quickly written to local storage Enables a so-called "zero-downtime patch" -- wait for a time when there's no active database transactions (so no pending writes), then write state to local disk, restart service with new config, reload, continue. introduce latency for any connections during the upgrade, but no connections dropped Note -- only possible because we can do this without flushing e.g. a data page cache to disk. not impossible, but in general less local state -> less work, less time to do an upgrade Read replicas Backup SQL instances can serve read requests Goal is that their state does not lag primary by too much Lagtime: time to download write request from primary, apply requests to their local cache. (Have to be careful to not apply log entries that don't fully commit in the primary, i.e. less than volume durable log number). They claim lagtime of about 20ms. They also claim some client of theirs experienced occasionaly 12 minute (!) replica lag under the strawman setup. So, main takeaway? - Think about your failure model - Correlated failures - How can you reduce mean time to recovery? - You don't necessarily need many replicated copies of a replicated storage system, just one replicated storage system - The log is the database, everything else serves as a cache of the log - Huge operational benefits to having a resilient, easily repairable datastore - Much easier to have many chunks distributed across many servers than a few large servers that are exactly replicated. Digression: Chain Replication (if there's time) - The diagram in the paper compares their replicated datastore against something called "Chain Replication", which is a replication strategy that's particularly bad for their use case but one that can be quite useful in other contexts. And it's fairly simple. What is chain replication? Replicas ordered in a chain (head -> ... -> tail) Send writes to the head. Each node commits writes to local disk, and then forwards write requests to the next node. So each node always knows more information than its successor, and less than its predecessor. So where do queries go? I want to query for data that's committed Queries -> tail Benefit of this technique? Simplicity What happens on failure? Fail head -> next node becomes head Any pending queries at head get dropped, but that's ok, they didn't commit yet Fail tail -> its predecessor becomes tail Any pending queries the previous tail hadn't received from the second to last server get instantly committed (since they're now written at the new tail) Fail in middle -> adjacent servers connect Each server maintains a list of all the updates it has sent that haven't committed at the tail On connect, forward all uncommitted updates (We'll gc this list by having tail periodically send out an ack of its last update) Don't need any kind of voting protocol Very little network usage just a single dedicated link between adjacent nodes Downside high latency, many sequential writes