FAWN ==== What is the goal of this paper? Key-value store with lowest total cost of ownership (TCO) Some key observations: 50% of a system's 3-year TCO can be power + cooling! I/O is the bottleneck for many key-value stores Fast CPUs are expensive, disproportionately consume power More cycles/sec, but also more voltage required for higher clocks Speculative execution, branch prediction, out of order/superscaler Big caches consume power Low power states not nearly low enough, so idle CPUs bad Plus motherboards, power supplies still drawing power 20% computing capacity draws 50% of power still (p. 2, sec. 2) 2GB DRAM consumes as much power as 1TB disk Flash is low power and fast for random reads What is flash (NAND flash for SSDs)? Store bits by charging up "floating gates" on MOSFET transistors Solid state makes random reads way faster than spinning disk! Can read and write individual pages (e.g., 2-4KB) But only get a finite number of write cycles (e.g., 1,000-100,000) Flash translation layer (FTL) in SSDs provides *wear leveling* So writing same logical block over and over won't wear it out Must erase entire block (32-64 pages) before writing pages (slow)! So FTL keeps reserve of pre-erased blocks, uses log structure Write amplification: must erase/write whole block for small write Makes random writes disastrous What is proposed hardware configuration & power consumption? Wimpy back-end nodes store actual key value pairs PCEngine Alix 3c2: 500MHz CPU, 256MB RAM, 4GB flash, 100Mbps Ethernet Comsumes 3W idle, 6W busy One modest (but less wimpy) front end machine per ~80 back-ends Less detail, but some intel Atom with 1Gbps Ethernet, consumes 27W Ethernet switches (10W for each 16-port switch in paper's testbed) How is data assigned to nodes? Back-end servers use consistent hashing, tunable replication factor R Back-end has V virtual nodes each running a FAWN-DS data store Each FAWN-DS has a separate log file for that virtual node Also has an in-memory hash table indexing log file 6 bytes per bucket entry: 4-byte offset, 15-bit keyfrag, 1-bit valid Periodically also write checkpoint of hash table to log Front-end servers divide keyspace into equal-size wedges Single management node assignes wedges to front-end nodes Nodes forward out-of-range requests to appropriate other front-end node Apply caching at front end On join, back-end nodes responsible for notifying appropriate front-ends p. 4 (first par): "With the 15-bit key fragment, only 1 in 32,768 retrievals from the flash will be incorrect and require fetching an additional record." Is this correct? No: That's probability of two random keys in same bucket colliding Obviously number of false fetches depends on hash table load Reduces collisions by factor of 2^{15} over not having fragments in index What's going on in Figure 3? Confusing, because text says "hash chaining", but pseudo-code doesn't show Each entry 6 bytes, so makes sense wouldn't also want a 4-byte pointer Looks like they use NUM_HASHES separate tables to find elements Why not just use open-addressed hashing? (don't know) Alternative interpretation: maybe they really do use chaining Comment mentions chaining in code, but seems to mean next loop iteration If chaining, maybe have multiple hash tables during maintenance? How does a read from FAWN-DS work? How does a write work? Must log a record to appropriate log file on disk Then update/insert pointer in the in-memory hash table How does a delete work? Like a 0-byte write except record key as invalid Log delete record and update hash table What are semi-random writes, why is it good that FAWN-KV has them? Each of V virtual nodes appending to its own file Many flash devices seem to be good at this (fig. 9, p. 9) What is chain replication? Say you want to replicate linearizable state and have Zookeeper Use zookeeper to organize replicas into a chain Send all writes to head of chain and reads to tail of chain Both read & write replies come from tail of chain: read--\ write -> [head] -> [node] -> [tail] -> reply All updates ordered by head; read results already replicated everywhere How do you remove a node? Remove the head? Just make second node the head Remove the tail? Okay, doesn't have any info other nodes don't have Remove X in A -> X -> B? Now must ensure B has all info A pushed to X So downstream nodes ack writes, starting with the tail Upstream nodes just track unacknowledged writes so can resend Everything ordered by head, so can have sequence numbers How do you add a node? Must copy state from predecessor. E.g., pre-copy snapshot, predecessor logs and sends writes during pre-copy How does chain replication work with FAWN-KV? Chain along consistent hashing circle; any node change is R joins+leaves Leverage the fact that FAWN-DS is log-structured for joins/leaves (3.4.4) Leaves: Keep a log pointer to last write unaknowledged by downstream nodes Know exactly what you might need to retransmit after failure Joins: Walk through Figure 8 (p. 7) (insert C1 in B1->C1->D1->E1) Phase 1: Pre-copy from current tail of the chain (100s of seconds) Phase 2: Head sends "chain membership message" Causes nodes to update membership So B1 starts sending new writes to C1 But C1 keeps them in a special temporary log Finally, E1 get membership message, flushes it + all prior writes to C1 C1 now knows it has all writes until chain membership message Now C1 applies temporary log (from B1) to real data store What if joining as head or tail? Must notify appropriate front-end node What if joining as tail (bottom of page 7)? "node joins before current tail and replies to put requests ...predecessor serves as interim tail for gets" Linearizability guaranteed? Not as described--must be some other trick How do split/merge/compact work? (sec. 3.3.3, p. 5) Scan existing log(s) to create new log(s) How do you know what to write to new log? Check hash table to see if current log entry is live Then briefly lock the data store (20-30ms) to switch to new log(s) Presumably also write checkpoints to new log? Any corner cases to worry about? Node fails during split/merge? Ok, because local operation Node fails during join? Chain membership message gets stuck? Chain membership message should be no different from other writes Front-end machine fails during back-end join/leave? New front end can query back-end machines for authoritative information Non-transitive communication outages? System gets stuck (homework), but can recover Network bisected? Maybe node manager preserves consistency? Does FAWN do something useful with adequate performance/cost? Scalable key-value stores are clearly useful if performance adequate Does FAWN save power and TCO over the current approach? 330 queries/Joule at peak rate (p. 10) Conventional at most 52 queries/Joule (sec. 5.1, p. 10) FAWN hardware both cheaper and smaller datacenter footprint Could alternate approaches to FAWN do better? (Fig. 16) Why the Traditional slice in Figure 16? Assumes FAWN nodes have at most 2GB DRAM, vs. much more for conventional What explains drop from 125MB to 250MB in Table 2? 256MB RAM can fit most of 250MB DS, making reads not go to flash Figure 10: Why are writes faster than reads? Log structure makes writes fast on flash Figure 11: Why is this only 70% of the raw random reads reported in Table 2? Marshaling overhead But also load balance is not perfect Could wimpy node w. 100Mbps Ethernet become hot spot? Should be alleviated by front-end caching Keys are hashed, so hot keys uniformly distributed around ring Does Dynamo make you question consistent choice of hashing? What benefits did Dynamo get by ditching, and do we care? 1. Easier reconciliation/avoid re-scanning when nodes join+leave system Maybe don't care since we have to re-scan for compaction 2. Better load balancing Maybe Figure 15 would look better Figure 12: Why are writes more efficient than reads? (footnote 4) Flash erase+write much *more* expensive than read But FAWN-KV's log structured storage requires less work per write Will this scale? Data partitioning seems scalable What about network bandwidth? Will switches consume more power? What is "full bisection bandwidth" (sec 4.2, p. 10), fat tree (p. 11)? Fig 13: Does this tell us anything interesting? Only 3 nodes affected, but measuring whole system throughput Tail latency or specific node performance would be more interesting What experiments would you need to consider replacing Dynamo with FAWN? Normal case (non-split/join) 99.9%ile latency competitive with Dynamo Doesn't even sacrifice consistence or require buffered writes But would it scale sufficiently? Chain replication maybe not great for tail latency Replication across data centers / regions?