Distributed Hash Tables (DHTs) ============================== Goal is to build a gigantic storage system So big each machine only knows about a small fraction of other machines So big that nodes come and go all the time--nodes owned by many people No Napster-like central point of vailure No Gnutella like scale vs. accuracy problems Better approach Give each node a particular ID (e.g., randomly chosen) Have a rule for assigning keys to nodes based on node/key ID key X goes on node with ID "closest" to X for some notion of closest That node stores data for X (e.g., napster index data, not necessarily mp3 file) The lookup problem: Now how, given X, do we find the node storing it? Need to know about some machines But no one can have exact view of 1,000,000 machines coming & going Idea: Each node tracks some small number of other nodes May not know who has X, but I know someone with better info than me So query, e.g., log N other nodes to find successor of X Example: CAN (Content-addressable network) Let's embed each node in a d-dimensional space, say d=2 for simplicity Each node is responsible for rectangular area around it Ration of sides of rectangle is 1:1, 2:1, or 1:2 When you join: Pick a random point, find closest node Split that node's rectangle, so new node takes over half of space Each node must know about all of its neighbors Neighbors are nodes who overlap on d-1 dimensions, and abut on one Now hash each key to a point in d-dimensional space Store at node whose rectangle point lies in How to find something? Greedily route along multiple dimensions What is expected cost? Number of hops in each dimension O(n^{-d}) Would expect average path to be O(d(n^(-d))) But can we do better than O(n^{-d})? Chord lookup system is built on consistent hashing Each node has an ID which is a point on consistent hashing circle Store key X on successor node to X around circle Routing table state: Finger table - point half way around circle, 1/4 way, 1/8, etc. Successor pointers - point to next node Other state - multiple successors + predecessor pointer (for robustness) Find successor by querying other nodes Halve the distance to the predecessor node with each query Ask the predecessor for its successor So given N nodes, each node has O(log N) state Can locate successor to X in O(log N) messages What about reliability of stored data? In general, also need to replicate data to survive failures In Chord, store pairs on multiple successors of key Because of consistent hashing, should survive failures well Chord is a hash table, but also can be viewed as overlay Internet Indirection Infrastructure (I3) uses Chord to solve - IP mobility - Multicast - Anycast - Service composition All require some layer of *indirection*. In the past, each has required its own, incompatible mechanism How does mobile IP work? E.g., "Home agent" handles things How does IP multicast work? Clever schemes like PIM use rendezvous point (RP) Can we implement a single layer of indirection suitable for all needs? Idea: Rendezvous-based communication Packets sent to generic IDs, not addresses To receive a packet, insert a "trigger" mapping ID to your address Trigger = (ID, addr) IDs are m-bits; let's write X.Y for an m-bit ID, where X is first k bits A packet sent to ID A.B gets sent to addr from trigger (X.Y, addr) iff: A = X (the first k bits of the two IDs are the same), AND Y is the closest trigger to B Example: Mobile IP Just insert a trigger for your new IP address when you move Example: Multicast Send packets to A.B Everybody who wants to listen inserts trigger on A.B (you can have multiple triggers on same ID) Example: Anycast Each recipient choses random r, inserts trigger for A.r Sender also chooses random r and sends packet to A.r One recipient for each message, roughly load-balanced What about large-scale multicast (many recipients)? We are going to let a trigger contain another ID, so can have (ID1, addr), or (ID1, ID2) This is good enough to allow us to set up a multicast tree What about service composition? E.g., want to use error-correcting codes on multiple flows Let's allow packets to contain stacks of IDs, not just single target Triggers can now map an ID onto a stack of IDs Route a packet towards its first ID If you find matching trigger, push its contents on front of stack If you don't (meaning you are at node that handles ID), pop front So by pushing non-existent IDs, you can achieve source routing And using IDs with triggers, you can achieve service composition Are O(log N) state & messages the best we can do? Actually, could potentially have O(1) state & O(log N) messages Instead of halving distance each time, shift in bits from right: E.g., Node x knows node 2x and node 2x+1 (Actually need to compensate for the fact that not every ID exists) But for system to avoid partitions, nodes need Omega(log N) contacts Otherwise, if you lose 1/2 of nodes people will be cut off Also, state is cheap (except for maintenance) while lookups are expensive Turns out you can have O(log N) state and O(log N / log (log N)) hops E.g., shift in O(log N) bits at a time Many so-called Distributed Hash Tables (DHTs) have been proposed A complete DHT algorithm has to: 1. Define IDs and document ID to node ID assignment 2. Define per-node routing table contents 3. Specify lookup algorithm that uses routing tables 4. Specify join procedure to reflect new nodes in tables 5. Recover from failures of nodes (e.g., studies have shown node half-lives can be just hours) Let's talk in more detail about Pastry, which is in text book Next class we will see how Pastry is used for multicast Assignment of key IDs to node IDs: IDs are 128-bit numbers Node IDs chosen by MD5 (like SHA-1) hash of IP address Key stored on node with numerically closest ID If node and key IDs are uniform, we get reasonable load balance. Routing: Query is at some node. Node needs to forward the query to a node "closer" to key. Note key ID and node ID share some high bits of ID prefix. Routing forwards to node that shares more bits of prefix. Sharing more bits of prefix is the definition of "closer". We're descending a b-ary tree, one digit at a time Not the same as "numerically closer..." Edge effects. How can we ensure every node always knows some node "closer" to any key? I.e. what's the structure of the Pastry routing table? Routing table contents Divide ID into digits One routing table row per digit position Each row contains one colum per digit value Entry refers to a node w/ same prefix, but different value for this digit "Refers" means contains ID and IP address Example. 2 bits per digit. 3 digits per ID. ID of this node: 102 022 102 223 312 102 113 122 130 100 101 102 103 This node (102) shows up in each row. [Note real Pastry uses 128 bit IDs, 4-bit digits.] There may be many choices for each entry (e.g. 021 rather than 022) We can use *any* node with the right prefix In contrast, Chord required specific predecessor node With Pastry's flexibility, can select nodes with low network latency To forward with the routing table: Row index: position of highest digit in which key disagrees with our ID. Column index: value of that digit. Send query to that node. If we reach the end of the key ID, we're done: key ID == node ID. This takes log_b (N) messages. But what if no node exists for a particular routing table entry? There are more possible IDs than there are nodes. So we won't be able to fill all routing table entries. E.g. maybe node 103 does not exist. Indeed, only log_b(N) rows are likely to exist. We can't correct digits: what do we do now? Forward to a node that's numerically closer. We may happen to know one in our routing table. What if routing table contains no numerically closer node? Perhaps we are the right node! But maybe not -- maybe node closest to key doesn't share many bits with us. So it isn't in our table. Suppose 113, 122, and 130 didn't exist. Key 123. Prefix routing might find node 103. But there may be a node 200, which is numerically closer to 123. How can we know if we're closest, or forward to closest? Easy if each node knows its numeric neighbor nodes. Each node maintains a "leaf set" Refers to the L nodes with numerically closest IDs. L/2 of them on each side of us. Now we can tell if node on other side of key is numerically closer. So node 103 would know about node 200. That's the complete routing story.