Dynamo: Amazon's Highly Available Key-value Store
=================================================

Brief intro to Linearizability
  Suppose you had one server getting requests from many clients
  Concurrent requests can be arbitrarily ordered by the network, however:
    1. The server will execute requests in some particular order
    2. Non-concurrent requests will be executed in temporal order
  Above two properties are known as *Linearizability*
  No big deal with one server, but gold standard for distributed systems.  Why?
    Easy to reason about, but costly to achieve
    Danger:  under network partitions, could end up with "divergent" states
    Paxos lets you achieve linearizability with very few assumptions
    In class we are studying systems that push the envelope or "cheat" a bit

Dynamo is a simple key-value store, used in a production environment
  Amazon's applications do not need the full power of SQL queries
  But all else equal, traditional consistency would be fine...
What are traditional "ACID" database semantics (§2.1)?
    Atomicity - All or none of actions happen
    Consistency - Transaction leaves data in valid state (all invariants hold)
    Isolation - All actions in a transaction appear to happen before
                or after other transactions
    Durability - Effects of transactions will survive reboots
Why doesn't Amazon want to pursue traditional ACID semantics?
  Problem:  Consistency vs. availability and response time (§2.3)
  In many scenarios, actually better to get wrong answer than no/slow answer
  Exacerbating problem is focus on 99.9% latency--why?
    Exceeding latency costs Amazon serious money (blocks purchase)
    A single end-user operations may require many Dynamo operations
      Any one of them could blow the whole operation and lose a customer
      Strict SLAs ensure such failures will be kept to a minimum

What does Dynamo API look like (§4.1)?
  get (key)                 -> (context, list of values)
  put (key, context, value) -> void
  What is this weird context value from the client's point of view
    Just an opaque string of bytes needing to be sent back in a put
    But idea is to help server resolve conflicts (more on this later)

What are issues in deciding how to spread keys amongst the servers?
  Replicate values on at least N machines (in case of failure)
  Handle heterogeneous nodes (new machines more powerful than old)
  Minimize churn when machines are added or removed

What is consistent hashing?
  Idea: assign keys to nodes so as to minimize churn on addition deletion
  Use ring trick shown in Figure 2, used by Akamai
    Popularized by Chord, a multi-hop key,value storage system
  Note: Each server needs many tokens to spread join/leave effects
  How to handle heterogeneity?  Scale number of tokens with power of machine
  Some good alternatives to Chord ring to know about:
    Kademlia - store on server with token T that minimizes T XOR key
      Popularized by trackerless bit torrent (and some malware...)
      Easy to find servers with tree data structure
      Nearby servers will tend to share same key subspace in symmetric way
    CARP - store on server S that maximizes: machine-power * Hash (S, key)
      + No need for multiple tokens--join/leave spreads load uniformly
      - Need to re-hash everything on joins/leaves

How does Dynamo achieve geographic replication?
  §4.6: "In essence, the preference list of a key is constructed such
         that the storage nodes are spread across multiple data centers."
  Hmm... Not clear how that is compatible with consistent hashing
  Maybe they ensure this while doing Token assignment to servers?

What is a "quorum technique"?
  Say you store a value on N servers and care about linearizability...
  Write must reach some write-set of W servers before deemed complete
  Read must hear from some read-set of R servers before being deemed complete
  If every write-set and read-set have a non-empty intersection:
    Guarantees a reader will see effects of a previously completed write
  An easy way to guarantee this is to ensure R + W > N
    Convince everyone this works even with failure within fixed N nodes
Each individual Dynamo instance configured with its own N, R, W

What happens when client issues a request (§4.5)?
  A key's N successors around Chord ring are its *preference list* (§4.3)
  Option 1:  Send to generic load balancer, which sends to Dynamo server
  Option 2:  Clients have server map and know where to send (§6.4)
    Clients poll random server every 10 seconds for membership change
  Then:
    If not in preference list, forward to (ideally first) in pref list
      Will usually only be necessary for Option 1 above
      Forwarding is required for writes, optional for reads
    Wait for W or R servers (depending on op), then reply
    Keep waiting for stragglers to update them and simplify anti-entropy (§5)

What happens when a server is down or partitioned (§4.6)?
  Hinted handoff:  Write values to next server around ring
    E.g., in Fig. 2 N=3, if B down, store node K on E
    But tell E that write was really intended for B
  Server (E) stores hinted values in separate file,
    Makes it easy to transfer whole file to target (B) when back up
  What does this do to our quorum algorithm!?!
    Basically breaks it--they call this a "Sloppy quorum"
    But fits with philosophy of prioritizing availability over correctness

What happens if server is permanently down?
  Administrator must manually decide this and configure system
  Fortunately, new writes have been going to the right place
    But may now have old values replicated at only N-1 servers
  Administrator may also add new nodes to replace old
After add/remove/failure--how to ensure state properly replicated
  Transfer needed ring arcs from other servers
  Confirmation round (§4.9) ensures you don't get something you don't need
In many cases, don't need to transfer a whole arc
  Just want to fix up a few missing entries
  Merkle trees make it easy to compare state when only small parts are missing

How does every node know the composition of the whole Chord ring?
  Only human administrator can change ring membership
  Gossip protocol (§4.8) - every sec exchange membership info with random peer
When adding servers, what's a seed?  (§4.8.2)
  Server known to all--want to make sure you don't get divergent parallel rings

What can lead to an update conflict?
  Servers partitioned, multiple coordinators
  Concurrent client updates
How are such conflicts resolved?
  Servers may resolve "syntactically" using vector clocks (discuss)
  Note Dynamo truncates vector clocks at 10 entries
  Failing vector clocks, clients get multiple answers resolve "semantically"
  Can also have servers just do something (last update wins based on time)

How does Dynamo modify Chord ring for load balancing (§6.2)?
  Split ring into fixed, equal size arcs (Figure 7 strategy 3)
  How does this help the system?
    Store keys for each segment in separate file--makes transfer easy
  How does this reduce size of membership info by three orders of magnitude?
    Could draw horizontal line on graph in Fig 8 and see this
      But you wouldn't necessarily run Dynamo in those two states 
    Segment numbers smaller than MD5 hashes, but not 1,000x
    Not totally clear...

No global failure state (§4.8.3)--why?

Durability compromises (§6.1)

Background tasks (§6.5)

What types of task might you tune N, R, W for

What are scaling bottlenecks?