Peer-to-peer systems
====================

Say you want to build a gigantic <key,value> storage system
  So big each machine only knows about a small fraction of other machines
  So big that nodes come and go all the time--nodes owned by many people
  Napster approach--centralized server coordinates, can be shut down
  Gnutella approach--ad-hoc configuration, broadcast queries
    Robust to technical (and legal) attacks, but not scalable

Better approach
  Give each node a particular ID (e.g., randomly chosen)
  Have a rule for assigning keys to nodes based on node/key ID
    key X goes on node with ID "closest" to X for some notion of closest
    That node stores data for X
      (e.g., napster index data, not necessarily mp3 file)

The lookup problem:  Now how, given X, do we find the node storing it?
  Need to know about some machines
    But no one can have exact view of 1,000,000 machines coming & going
  Idea:  Each node tracks some small number of other nodes (e.g., log N)
    May not know who has X, but I know someone with better info than me
    So query, e.g., log N other nodes to find successor of X

Chord lookup system is built on consistent hashing
  Each node has an ID which is a point on consistent hashing circle
  Store key X on successor node to X around circle
  Routing table state:
    Finger table - point half way around circle, 1/4 way, 1/8, etc.
    Successor pointers - point to next node
    Other state - multiple successors + predecessor pointer (for robustness)
  Find successor by querying other nodes
    Halve the distance to the predecessor node with each query
    Ask the predecessor for its successor
  So given N nodes, each node has O(log N) state
  Can locate successor to X in O(log N) messages

What about reliability of stored data?
  In general, also need to replicate data to survive failures
  In Chord, store <key,value> pairs on multiple successors of key
    Because of consistent hashing, should survive failures well

Are O(log N) state & messages the best we can do?
  Actually, could potentially have O(1) state & O(log N) messages
    Instead of halving distance each time, shift in bits from right:
      E.g., Node x knows node 2x and node 2x+1
      (Actually need to compensate for the fact that not every ID exists)
  But for system to avoid partitions, nodes need Omega(log N) contacts
    Otherwise, if you lose 1/2 of nodes people will be cut off
    Also, state is cheap while lookups are expensive
  Turns out you can have O(log N) state and O(log N / log (log N)) hops
    E.g., shift in O(log N) bits at a time
  Other systems have proposed, e.g., sqrt(N) state O(1) hops

Many so-called Distributed Hash Tables (DHTs) have been proposed
  A complete DHT algorithm has to:
    1. Define IDs and document ID to node ID assignment
    2. Define per-node routing table contents
    3. Specify lookup algorithm that uses routing tables
    4. Specify join procedure to reflect new nodes in tables
    5. Recover from failures of nodes
       (e.g., studies have shown node half lives can be just hours)

Today, let's talk in more detail about the Pastry system Scribe is based on

Assignment of key IDs to node IDs?
  IDs are 128-bit numbers
  Node IDs chosen by MD5 (like SHA-1) hash of IP address
  Key stored on node with numerically closest ID
  If node and key IDs are uniform, we get reasonable load balance.

Routing?
  Query is at some node.
  Node needs to forward the query to a node "closer" to key.
  Note key ID and node ID share some high bits of ID prefix.
  Routing forwards to node that shares more bits of prefix.
    Sharing more bits of prefix is the definition of "closer".
    We're descending a b-ary tree, one digit at a time
    Not the same as "numerically closer..." Edge effects.
  How can we ensure every node always knows some node "closer" to any key?
    I.e. what's the structure of the Pastry routing table?

Routing table contents
  Divide ID into digits
  One routing table row per digit position
  Each row contains one colum per digit value
  Entry refers to a node w/ same prefix, but different value for this digit
    "Refers" means contains ID and IP address
  Example. 2 bits per digit. 3 digits per ID.
    ID of this node: 102
    022 102 223 312
    102 113 122 130
    100 101 102 103
  This node (102) shows up in each row.
  [Note real Pastry uses 128 bit IDs, 4-bit digits.]
  There may be many choices for each entry (e.g. 021 rather than 022)
    We can use *any* node with the right prefix
      In contrast, Chord required specific predecessor node
    With Pastry's flexibility, can select nodes with low network latency

To forward with the routing table:
  Row index: position of highest digit in which key disagrees with our ID.
  Column index: value of that digit.
  Send query to that node.
  If we reach the end of the key ID, we're done: key ID == node ID.
  This takes log_b (N) messages.

But what if no node exists for a particular routing table entry?
  There are more possible IDs than there are nodes.
  So we won't be able to fill all routing table entries.
    E.g. maybe node 103 does not exist.
  Indeed, only log_b(N) rows are likely to exist.
  We can't correct digits: what do we do now?
  Forward to a node that's numerically closer.
  We may happen to know one in our routing table.
  What if routing table contains no numerically closer node?
    Perhaps we are the right node!
    But maybe not -- maybe node closest to key doesn't share many bits with us.
      So it isn't in our table.
      Suppose 113, 122, and 130 didn't exist.
      Key 123.
      Prefix routing might find node 103.
      But there may be a node 200, which is numerically closer to 123.
  How can we know if we're closest, or forward to closest?
    Easy if each node knows its numeric neighbor nodes.

Each node maintains a "leaf set"
  Refers to the L nodes with numerically closest IDs.
    L/2 of them on each side of us.
  Now we can tell if node on other side of key is numerically closer.
  So node 103 would know about node 200.
  That's the complete routing story.

How does a new node acquire correct tables?
  Issues a lookup for its own key to any existing node.
    Finds numerically closest node.
    Ask that node for leaf set -- that's almost all the leaf set.
  Can't use numerically closest node to help initialize routing table.
    We might be 133, it might be 200; no help.
  Get complete tables from each node on path of lookup.
    These are nodes that share more and more of our prefix.
  We can use first node's first row directly.
  And second node's second row, since it agrees w/ us in first digit.
  At this point the new node can forward queries correctly.

Does routing *to* us now work?
  If new node doesn't do anything,
    query will go to where it would have gone before we joined.
    I.e. to the existing node numerically closest to us.
  So, for correctness, we just need to correct leaf sets of
    numerically closest L nodes.
    That's easy: closest node can tell us our leaf set.
  If that's all we do, lookups will take linear time.

Updating other nodes' routing tables to reflect new node.
  To preserve O(log N) lookup time.
  They are either correct, or missing an entry that we could fill.
  Send our state to each node mentioned in our routing table.
  And fill blank entries automatically as a side effect of lookups
    Existing nodes periodically do lookups to try to fill blank entries.

What about node failures?
  Assume nodes fail w/o warning. Strictly harder than graceful departure.
  Two issues:
    Other nodes' routing tables refer to dead node.
    Other nodes' leaf sets refer to dead node.
  For routing table, detect timeout, treat as empty table entry.
    I.e. route to numerically closer entry instead.
    Repair: ask any node on same row for a copy of its corresponding entry.
      Or any node on rows below.
      All these share the right prefix.
  For leaf sets,
    Failed node might have been closest to key ID!
    Need to know next-closest.
    That's why leaf set knows of more than just two closest nodes.
    We can route around failure, and easily update leaf set.

The system is effectively self-correcting.
  Easy to verify correctness of your own routing table (just look at prefix).
  Easy to keep yourself in other nodes' leaf sets.

What kinds of failures can the system withstand?
  If all else fails, routes numerically using leaf sets.
  This only fails if some node has lost all leaf set entries.
  Assume independent node failure...
    probability of all simultaneously failing is p^L
    p is probability of any one node failing
    Pretty small for reasonable values of L
  Why is independent node failure a reasonable assumption?
    Requires that nodes w/ nearby IDs not fail at the same time.
    I.e. are not in the same room/building/backbone.
    Probably true if nodeID = hash(IPAddress)
    And if the system is geographically big.

Locality
  Lookup takes log_b(n) messages.
    But they are to random nodes on the Internet!
    Will often be very far away.
  Can we route through nodes close to us on underlying network?
  This boils down to whether we have choices:
    If multiple correct next hops, we can try to choose closest.
  We do have choices for routing table:
    *Any* node with correct prefix will do for each entry.
    So we can pre-choose nearby entries.
  But there's no choice for leaf sets.

Pastry continuously adjusts routing table entries for locality
  Asks current entry for that entry's complete tables
  Ping suitable nodes from other node's tables
  Use them instead of current entry if ping says closer
    Note that when you recurse, new node will know of nodes close to it
    Very good if X close to Y and Y close to Z ==> X close to Z
      Which often holds on Internet, but definitely not always

What's the effect?
  This is the *short routes* property described in Scribe paper
    (the proximity metric here is just ping time)
  Individual hops are lower latency.
  But less and less choice (lower node density) as you get close in ID space.
  So last few hops likely to be very long. 
  Thus you don't *end up* close to the initiating node.
    You just get there quicker.

Any down side to locality routing?
  Harder to prove independent failure.
    Probably no big deal, since no locality for leaf sets.
  Maybe easier to trick me into using malicious nodes in my tables.

How might you use something like Pastry?
  A little bit like Petal (from Frangipani paper).
    But you don't want to trust people not to tamper with data!
  Typical trick: key = SHA-1(value)
    Build complex data structures using SHA-1 hash as pointer
  People have actually built file systems on DHTs, though no one uses them
  Today's topic:  multicast