Goals
	- Understand cloud-scale storage systems


What is aurora?
	MySQL database
	Starts from a fork of a community MySQL engine
	It wants to be used as a drop-in replacement
	for customer SQL databases, so it needs to support
	the same types of queries and consistency semantics
	as other SQL databases.
		SQL spec isn't always particularly clear on consistency semantics,
		but their implementation supports a set of "standard" semantics.

Context:
	Many websites separate components into many 
	stateless services (webserver, business logic)
	and a stateful database.

	Stateless components can be scaled up easily (just spin up more replicas),
	stateful db is harder.

Motivation: - Reduce latency on critical path (mostly IO) - Reduce
datacenter network usage - They claim, in their experience, that
network I/O not storage or compute on an individual node, has become
bottleneck in large datacenter applications. Response time of one
slow network path or disk can be dominant factor.


Strawman: replicate a SQL DB
	- We'll have several replicas of the database engine
	- Each maintains its own storage system
		- Each replicates data across several disks
	- each SQL replica writes each data page update
	- Have to wait how long?
		- until backups on storage for primary commit
		- until backups on storage for secondary commit
		- these waits are sequential - don't want secondary
		to commit to an update that the primary does not commit.
	- Double write on each page (avoid shorn writes) -> bad

Their idea: 
	separate out storage as its own replicated system
	The log is the DB
		only broadcast log entries, not actual data pages

What does this mean?
	Split database state into pages
		10GB
	SQL machines maintain a cache of pages

	SQL machines also maintain for each page, the 
		highest log entry number that modified a given page,
		and track which storage nodes are complete for each page

		Max 64 TB means ~6400 pages max, so this metadata 
		is relatively small, fits easily in memory.

	Storage maintain copies of all data pages (*)
		Important observation:
			Storage nodes don't actually need to 
			maintain an up to date copy of data pages on disk.
			They need the ability to always answer a query of the form
				"what is the most recent version of a particular data page?"

	What happens when we get a write transaction?
		1. Primary computes state updates/writes
		2. Sends out state update to SQL backups AND to all storage nodes
		3. Waits for a "write quorum" of responses from storage nodes
		4. Storage nodes confirm AS SOON AS they write the LOG, 
			not the actual state update.

	What is a write quorum?
		They replicate each chunk 6 ways, require a write quorum to be 4/6
		(ensure all write quorums overlap).

	What happens on cache miss?
		Request page from *1* storage node, not a "quorum".
		Important - there's only one primary (writer) SQL node,
		so it always knows what's the highest sequence 
		number that modified a given page,
		and since it tracks which storage nodes are 
		complete for each page, it can always just
		select one node.

	What do storage nodes keep?
		For each page:
			1. Records of all update logs of the page
				This is sufficient to compute the value of the data page
				We don't actually need the data page itself, or any snapshot.
		This computation is prohibitively expensive
		So what do we actually keep?
			1. Data page, at some recent but not necessarily latest verion
			2. Records of all update logs since the snapshot

	Why?
		IO path is parallelized
			all updates to each storage node happen in parallel
		Amount written is (probably) smaller
			log size << data page
		Garbage collection, checkpointing can be done async, in background
			one catch - make sure not to gc log entries that don't fully commit.

	Big difference - replicate storage, but DB replicas use same storage
		rather than each replica maintains its own replicated storage
		Eliminates sequential IO waits

	Compare with Harp, e.g.
		Recall: in Harp, have to keep log entries on one replica until all replicas
		have applied them to disk (have to keep until GLB passed), 
			might need primary's log to update the backup, e.g.

		stuck at speed of slowest
		replica's speed of computing checkpoints/filesystem updates


	Harp uses nonvolatile memory for its log
		Why did they do that?
			They want to keep disk IO off critical path
		Could we do that here?
			Probably?
		Why not?
			1. The kernel mods, nonvolatile memory makes life really hard
			for system administration, UPS expensive 
			(although some datacenters might have them,  for certain machines).
			2. hard disks are a lot faster now
				so waiting for a single log entry write 
				on an SSD is not a huge cost

Recovery
	SQL server doesn't maintain log of data pages
		maybe just some metadata about how many pages there are,
		where the storage servers are
	On startup: query a read quorum (3/6) of every data page, 
		compute the max log sequence number
	for which the log entry achieved a write quorum
		*this is why we need a read quorum 
			- want to make sure we don't miss a write quorum
		3/6 guaranteed to overlap with a 4/6

	They have a slightly odd notion of "write commit" 
		- the "volume complete log" and the "volume durable log"
	
	VCL - written to a write quorum of storage nodes
	VDL - written to write quorum, & is the commit point for a SQL tx.

	VDL <= VCL

	This lets SQL engine break up a SQL transaction's 
		writes into smaller "mini-txs"
		Each Minitx is a log write, some are tagged as "commit points"

	Why?
		Pipeline writes, reduce latency
		Make log updates smaller
		They don't specify, but it might be convenient 
		to break writes up by data page,
		when a SQL tx modifies several pages

	So on recovery, compute VCL, then look 
		backwards to compute highest VDL, 
		then truncate logs to this VDL.
		
		This means SQL engine has to continually track VDL, 
		tell storage nodes what VDL is
		so that they can garbage collect. (can't gc above VDL).


Quorums
	Important Takeaway
		Failures in modern datacenters can be highly correlated
		Fires, power outage, whatever else can take down an entire datacenter
			Who here remembers a time when their nearest AWS datacenter went down?
			Undersea cables get cut, DNS misconfigurations (Meta)


	So what do they do?
		3 "availability zones" -- loosely speaking, regions within a datacenter
		with separate support infrastructure

		2x replicas per data page per AZ.

	Write quorum: 4x replicas

	Read quorum: 3x replicas

	Ensure writes intersect
	Ensure every read sees every write

	Write avail in response to 1 full AZ outage
	No data loss (full read avail) in response to 1 AZ outage + 1 random failure

	Is this sufficient?
		Need "mean time to repair" << "mean time to fault"
		want to have very small chance of a double fault.

		Hard to answer this question "yes/no" without looking in detail
		at their production system.

		10GB chunk on 10Gbps link -> copy chunk in around 10s.

		Lose disk means losing many chunks, but recovery parallelizable, maybe
			Depends on the allocation scheme for chunks to storage nodes.

			If 100 chunks are all replicated on the same 6 servers, 
			then slow to recover from
			one fault on these 6 (and another 100 on the next 6, and so on)

			But if all chunks are mixed up across many servers, 
			then recovery of a lost disk
			can draw from many different storage servers 
			(and maybe you can put what was on
			the lost disk
			into many separate storage servers) 
				-> parallelizable -> faster

There can be other kinds of failure correlations as well
	Hard disks from the same manufacturing batch 
	might all fail at around the same time,
	so maybe datacenter should mix disk batches

	As we talked about earlier, e.g. Harp, a 
	software bug might get hit by several replicas simultaneously

Other cool things
	Storage replication makes sysadmining easier
		Overloaded node - mark one data page on the node as failed, 
			pull a replica up elsewhere
			Handle this the same way as a node outage 
				- some automated service notices a data page missing, 
				assigns it to another node/pulls up a new storage node.
		Patch node, software upgrade 
			- just take one node offline, since system handles failures,
			this doesn't significantly affect system operation

		Important - you don't have to turn off the SQL machine 
			to update storage or adjust disk usage.

	Updating SQL engine - because persistent storage is all 
		on storage machines, not on SQL machine,
		there's very little state maintained by the SQL machine
			- mostly things like lists of open connections with clients

	As such, the small state remaining can be very 
		quickly written to local storage
		Enables a so-called "zero-downtime patch" 
			-- wait for a time when there's no active database
				transactions (so no pending writes), 
				then write state to local disk, 
				restart service with new config,
				reload, continue.
			introduce latency for any connections during the upgrade, 
			but no connections dropped

		Note -- only possible because we can do this without 
			flushing e.g. a data page cache to disk.

			not impossible, but in general less local state 
			-> less work, less time to do an upgrade

	Read replicas
		Backup SQL instances can serve read requests
		Goal is that their state does not lag primary by too much

		Lagtime: time to download write request from primary,
			 apply requests to their local cache.
			(Have to be careful to not apply log entries 
			that don't fully commit in the primary,
			i.e. less than volume durable log number).
			They claim lagtime of about 20ms.

			They also claim some client of theirs experienced occasionaly
			12 minute (!) replica lag under the strawman setup.

So, main takeaway?
	- Think about your failure model
		- Correlated failures
		- How can you reduce mean time to recovery?
		- You don't necessarily need many replicated copies of a 
		replicated storage system, just one replicated storage system
	- The log is the database, everything else serves as a cache of the log
	- Huge operational benefits to having a resilient, 
		easily repairable datastore
		- Much easier to have many chunks distributed across many servers than
		a few large servers that are exactly replicated.


Digression: Chain Replication (if there's time)

	- The diagram in the paper compares their replicated datastore
	against something called "Chain Replication", which is a replication
	strategy that's particularly bad for their use case but one that
	can be quite useful in other contexts.  And it's fairly simple.

	What is chain replication?

	Replicas ordered in a chain (head -> ... -> tail)

	Send writes to the head.

	Each node commits writes to local disk, and then forwards
	write requests to the next node.

	So each node always knows more information than its successor, and less
	than its predecessor.

	So where do queries go?  I want to query for data that's committed

	Queries -> tail

	Benefit of this technique?

		Simplicity
			What happens on failure?
				Fail head -> next node becomes head
					Any pending queries at head get dropped, but that's ok,
					they didn't commit yet
				Fail tail -> its predecessor becomes tail
					Any pending queries the previous tail hadn't received from the
					second to last server get instantly
					committed (since they're now written at the new tail)
				Fail in middle -> adjacent servers connect
					Each server maintains a list of all the updates it has sent
					that haven't committed at the tail
					On connect, forward all uncommitted updates
					(We'll gc this list by having tail periodically 
					send out an ack of its last update)
		Don't need any kind of voting protocol
		Very little network usage
			just a single dedicated link between adjacent nodes

	Downside
		high latency, many sequential writes