Distributed Hash Tables (DHTs)
==============================

Goal is to build a gigantic <key,value> storage system
  So big each machine only knows about a small fraction of other machines
  So big that nodes come and go all the time--nodes owned by many people
    No Napster-like central point of vailure
    No Gnutella like scale vs. accuracy problems

Better approach
  Give each node a particular ID (e.g., randomly chosen)
  Have a rule for assigning keys to nodes based on node/key ID
    key X goes on node with ID "closest" to X for some notion of closest
    That node stores data for X
      (e.g., napster index data, not necessarily mp3 file)

The lookup problem:  Now how, given X, do we find the node storing it?
  Need to know about some machines
    But no one can have exact view of 1,000,000 machines coming & going
  Idea:  Each node tracks some small number of other nodes
    May not know who has X, but I know someone with better info than me
    So query, e.g., log N other nodes to find successor of X

Example:  CAN (Content-addressable network)
  Let's embed each node in a d-dimensional space, say d=2 for simplicity
  Each node is responsible for rectangular area around it
    Ration of sides of rectangle is 1:1, 2:1, or 1:2
    When you join:
      Pick a random point, find closest node
      Split that node's rectangle, so new node takes over half of space
  Each node must know about all of its neighbors
    Neighbors are nodes who overlap on d-1 dimensions, and abut on one
  Now hash each key to a point in d-dimensional space
    Store at node whose rectangle point lies in
  How to find something?
    Greedily route along multiple dimensions
    What is expected cost?
      Number of hops in each dimension O(n^{-d})
      Would expect average path to be O(d(n^(-d)))
  But can we do better than O(n^{-d})?

Chord lookup system is built on consistent hashing
  Each node has an ID which is a point on consistent hashing circle
  Store key X on successor node to X around circle
  Routing table state:
    Finger table - point half way around circle, 1/4 way, 1/8, etc.
    Successor pointers - point to next node
    Other state - multiple successors + predecessor pointer (for robustness)
  Find successor by querying other nodes
    Halve the distance to the predecessor node with each query
    Ask the predecessor for its successor
  So given N nodes, each node has O(log N) state
  Can locate successor to X in O(log N) messages

What about reliability of stored data?
  In general, also need to replicate data to survive failures
  In Chord, store <key,value> pairs on multiple successors of key
    Because of consistent hashing, should survive failures well

Chord is a hash table, but also can be viewed as overlay
Internet Indirection Infrastructure (I3) uses Chord to solve
  - IP mobility
  - Multicast
  - Anycast
  - Service composition

All require some layer of *indirection*.
  In the past, each has required its own, incompatible mechanism
    How does mobile IP work?
      E.g., "Home agent" handles things
    How does IP multicast work?
      Clever schemes like PIM use rendezvous point (RP)

Can we implement a single layer of indirection suitable for all needs?
  Idea:  Rendezvous-based communication
    Packets sent to generic IDs, not addresses
    To receive a packet, insert a "trigger" mapping ID to your address
      Trigger = (ID, addr)
  IDs are m-bits; let's write X.Y for an m-bit ID, where X is first k bits
  A packet sent to ID A.B gets sent to addr from trigger (X.Y, addr) iff:
     A = X (the first k bits of the two IDs are the same), AND
     Y is the closest trigger to B
  Example:  Mobile IP
     Just insert a trigger for your new IP address when you move
  Example:  Multicast
     Send packets to A.B
     Everybody who wants to listen inserts trigger on A.B
       (you can have multiple triggers on same ID)
  Example:  Anycast
     Each recipient choses random r, inserts trigger for A.r 
     Sender also chooses random r and sends packet to A.r
     One recipient for each message, roughly load-balanced

What about large-scale multicast (many recipients)?
  We are going to let a trigger contain another ID, so can have
    (ID1, addr), or (ID1, ID2)
  This is good enough to allow us to set up a multicast tree

What about service composition?
  E.g., want to use error-correcting codes on multiple flows
  Let's allow packets to contain stacks of IDs, not just single target
  Triggers can now map an ID onto a stack of IDs
  Route a packet towards its first ID
    If you find matching trigger, push its contents on front of stack
    If you don't (meaning you are at node that handles ID), pop front
  So by pushing non-existent IDs, you can achieve source routing
  And using IDs with triggers, you can achieve service composition
  
Are O(log N) state & messages the best we can do?
  Actually, could potentially have O(1) state & O(log N) messages
    Instead of halving distance each time, shift in bits from right:
      E.g., Node x knows node 2x and node 2x+1
      (Actually need to compensate for the fact that not every ID exists)
  But for system to avoid partitions, nodes need Omega(log N) contacts
    Otherwise, if you lose 1/2 of nodes people will be cut off
    Also, state is cheap (except for maintenance) while lookups are expensive
  Turns out you can have O(log N) state and O(log N / log (log N)) hops
    E.g., shift in O(log N) bits at a time

Many so-called Distributed Hash Tables (DHTs) have been proposed
  A complete DHT algorithm has to:
    1. Define IDs and document ID to node ID assignment
    2. Define per-node routing table contents
    3. Specify lookup algorithm that uses routing tables
    4. Specify join procedure to reflect new nodes in tables
    5. Recover from failures of nodes
       (e.g., studies have shown node half-lives can be just hours)

Let's talk in more detail about Pastry, which is in text book
  Next class we will see how Pastry is used for multicast

Assignment of key IDs to node IDs:
  IDs are 128-bit numbers
  Node IDs chosen by MD5 (like SHA-1) hash of IP address
  Key stored on node with numerically closest ID
  If node and key IDs are uniform, we get reasonable load balance.

Routing:
  Query is at some node.
  Node needs to forward the query to a node "closer" to key.
  Note key ID and node ID share some high bits of ID prefix.
  Routing forwards to node that shares more bits of prefix.
    Sharing more bits of prefix is the definition of "closer".
    We're descending a b-ary tree, one digit at a time
    Not the same as "numerically closer..." Edge effects.
  How can we ensure every node always knows some node "closer" to any key?
    I.e. what's the structure of the Pastry routing table?

Routing table contents
  Divide ID into digits
  One routing table row per digit position
  Each row contains one colum per digit value
  Entry refers to a node w/ same prefix, but different value for this digit
    "Refers" means contains ID and IP address
  Example. 2 bits per digit. 3 digits per ID.
    ID of this node: 102
    022 102 223 312
    102 113 122 130
    100 101 102 103
  This node (102) shows up in each row.
  [Note real Pastry uses 128 bit IDs, 4-bit digits.]
  There may be many choices for each entry (e.g. 021 rather than 022)
    We can use *any* node with the right prefix
      In contrast, Chord required specific predecessor node
    With Pastry's flexibility, can select nodes with low network latency

To forward with the routing table:
  Row index: position of highest digit in which key disagrees with our ID.
  Column index: value of that digit.
  Send query to that node.
  If we reach the end of the key ID, we're done: key ID == node ID.
  This takes log_b (N) messages.

But what if no node exists for a particular routing table entry?
  There are more possible IDs than there are nodes.
  So we won't be able to fill all routing table entries.
    E.g. maybe node 103 does not exist.
  Indeed, only log_b(N) rows are likely to exist.
  We can't correct digits: what do we do now?
  Forward to a node that's numerically closer.
  We may happen to know one in our routing table.
  What if routing table contains no numerically closer node?
    Perhaps we are the right node!
    But maybe not -- maybe node closest to key doesn't share many bits with us.
      So it isn't in our table.
      Suppose 113, 122, and 130 didn't exist.
      Key 123.
      Prefix routing might find node 103.
      But there may be a node 200, which is numerically closer to 123.
  How can we know if we're closest, or forward to closest?
    Easy if each node knows its numeric neighbor nodes.

Each node maintains a "leaf set"
  Refers to the L nodes with numerically closest IDs.
    L/2 of them on each side of us.
  Now we can tell if node on other side of key is numerically closer.
  So node 103 would know about node 200.
  That's the complete routing story.