Consistent hashing Say you want to cache web pages on a number of machines Hash URL, and chose machine based on hash: e.g., machine = hash % nservers What happens when you add and remove machines? If URL -> server mapping changes completely, most cache data invalid What happens if a client hasn't yet heard of a new machine Could use entirely wrong hash function Consistent hashing: Only invalidate data in proportion to reconfiguration Add one new server, should invalidate 1/N of data Now, say you want to build a gigantic data cache So big each machine only knows about a small fraction of other machines So big that nodes come and go all the time--nodes owned by many people Napster approach--centralized server coordinates, can be shut down Gnutella approach--ad-hoc configuration, broadcast queries Better approach Give each node a particular ID (e.g., randomly chosen) Have a rule for assigning keys to nodes based on node/key ID e.g. key X goes on node with nearest ID to X That node stores data for X (e.g., napster index data, not necessarily mp3 file) Now how, given X, do we find that node? Arrange nodes in an ID-space, based on ID i.e. use ID as coordinates Examples: 1D line, 2D square, Tree based on bits, hypercube, or ID circle Build routing tables to allow ID-space navigation Each node knows about ID-space neighbors Perhaps each node knows a few farther-away nodes To move long distances quickly What does a complete algorithm have to do? 1. Define IDs, document ID to node ID assignment 2. Define per-node routing table contents 3. Lookup algorithm that uses routing tables 4. Join procedure to reflect new nodes in tables 5. Failure recovery The "Pastry" peer-to-peer lookup system By Rowstron and Druschel http://www.research.microsoft.com/~antr/pastry/ An example system of this type You've already read about Chord (in CFS paper), which is similar Assignment of key IDs to node IDs? IDs are 128-bit numbers Key stored on node with numerically closest ID Node IDs chosen by (e.g.) hash of IP address If node and key IDs are uniform, we get reasonable load balance. Routing? Query is at some node. Node needs to forward the query to a node "closer" to key. Note key ID and node ID share some high bits of ID prefix. Routing forwards to node that shares more bits of prefix. Sharing more bits of prefix is the definition of prefix. We're descending a b-ary tree, one digit at a time Not the same as "numerically closer..." Edge effects. How can we ensure every node always knows some node "closer" to any key? I.e. what's the structure of the Pastry routing table? Routing table contents Divide ID into digits One routing table row per digit position Each row contains one colum per digit value Entry refers to a node w/ same prefix, but different value for this digit "Refers" means contains ID and IP address Example. 2 bits per digit. 3 digits per ID. ID of this node: 102 022 102 223 312 102 113 122 130 100 101 102 103 This node (102) shows up in each row. There many be many choices for each entry (e.g. 021 rather than 022) We can use a random choice: just need to correct any digit. Note real Pastry uses 128 bit IDs. To forward with the routing table: Row index: position of highest digit in which key disagrees with our ID. Column index: value of that digit. Send query to that node. If we reach the end of the key ID, we're done: key ID == node ID. There's a complete tree rooted at every node Starts at that node's row 0 Threaded through other nodes' row 1, &c Every node acts as a root, so there's no root hotspot This *better* than simply arranging the nodes in one tree But what if no node exists for a particular routing table entry? There are more possible IDs than there are nodes. So we won't be able to fill all routing table entries. E.g. maybe node 103 does not exist. Indeed, only log_b(N) rows are likely to exist. We can't correct digits: what do we do now? Forward to a node that's numerically closer. We may happen to know one in our routing table. What if routing table contains no numerically closer node? Perhaps we are the right node! But maybe not -- maybe node closest to key doesn't share many bits with us. So it isn't in our table. Key 123, we are 102, node 200 exists... How can we know if we're closest, or forward to closest? Easy if each node knows its numeric neighbor nodes. Each node maintains a "leaf set" Refers to the L nodes with numerically closest IDs. L/2 of them on each side of us. Now we can tell if node on other side of key is numerically closer. That's the complete routing story. How does a new node acquire correct tables? Issues a lookup for its own key to any existing node. Finds numerically closest node. Ask that node for leaf set -- that's almost all the leaf set. Can't use numerically closest node to help initialize routing table. We might be 133, it might be 200; no help. Get complete tables from each node on path of lookup. These are nodes that share more and more of our prefix. We can use first node's first row directly. And second node's second row, since it agrees w/ us in first digit. At this point we can forward queries correctly. Updating other nodes' tables to reflect new node. Other nodes' leaf sets are easy -- we know who they are, we can tell them. Other nodes' routing tables? They are either correct, or missing an entry that we could fill. Send our state to each node mentioned in our routing table. Does routing *to* us now work? We can't fix the routing tables of all nodes that could refer to us. If all else fails, query will go where it would have before we joined. I.e. to the existing node numerically closest to us. That node will have us in its leaf set. What about node failures? Assume nodes fail w/o warning. Strictly harder than graceful departure. Two issues: Other nodes' routing tables refer to dead node. Other nodes' leaf sets refer to dead node. For routing table, detect timeout, treat as empty table entry. I.e. route to numerically closer entry instead. Repair: ask any node on same row for a copy of its corresponding entry. Or any node on rows below. All these share the right prefix. For leaf sets, Failed node might have been closest to key ID! Need to know next-closest. That's why leaf set knows of more than just two closest nodes. We can route around failure, and easily update leaf set. The system is effectively self-correcting. Easy to verify correctness of your own routing table (just look at prefix). Easy to keep yourself in other nodes' leaf sets. What kinds of failures can the system withstand? If all else fails, routes numerically using leaf sets. This only fails if some node has lost all leaf set entries. Assume independent node failure... probability of all simultaneously failing is p^L p is probability of any one node failing Pretty small for reasonable values of L Independent node failure? Requires that nodes w/ nearby IDs not fail together. I.e. are not in the same room/building/backbone. Probably true if nodeID = hash(IPAddress) And if the system is geographically big. Locality Lookup takes log_b(n) messages. But they are to random nodes on the Internet! Will often be very far away. Can we route through nodes close to us on underlying network? This boils down to whether we have choices: If multiple correct next hops, we can try to choose closest. We do have choices for routing table: *Any* node with correct prefix will do for each entry. So we can pre-choose nearby entries. But there's no choice for leaf sets. Pastry continuously adjusts routing table entries for locality Asks current entry for that entry's complete tables Ping suitable nodes from other node's tables Use them instead of current entry if ping says closer This works OK even if other nodes don't have local entries Since other nodes do this too, even initial table will be reasonable Assuming initial node contacted is close to new node What's the effect? Individual hops are lower latency. But less and less choice (lower node density) as you get close in ID space. So last few hops likely to be very long. Thus you don't *end up* close to the initiating node. You just get there quicker. What does this assume about underlying network topology? Works best if: X close to Y and Y close to Z means X close to Z Not always true: DSL, NYU, MIT (DSL far from MIT) Any down side to locality routing? Harder to prove independent failure. Maybe no big deal, since no locality for leaf sets. Easier to trick me into using malicious nodes in my tables. Costs Lookup and join: log(n) What about anonymity? Maybe client anonymity achievable through e.g. mix-net routing of queries. But this system is hostile to server anonymity The whole point is to store keys at predictable nodes So it's easy for attacker to target node holding undesirable key Open issues: Concurrent join/fail Failure model: independent, fail-stop Network partition? Will a partition ever heal? Malicious nodes? Non-transitive Internet connectivity? Others think a node is my leaf, I can't reach it Join/leave rate Relationship to Chord Pastry routing table similar to Chord finger table Pastry leaf set similar to Chord successor list Both derive correctness from linear progress in ID space Via leaf sets or successors Routing/finger table just an optimization for log(n) speed It's a verifiable hint Chord/Pastry don't store any data! Are their properties useful for storage applications? I.e. robustness / always route to "correct" node despite failures. CFS === Let's use the efficient location primitives (Chord, CAN, PAST) to build a file system to distribute a large, popular software distribution. Assume this API for the lookup primitives: IP_adddress lookup (key k); where key is a large integer What are we trying to accomplish? -------------------------------- - load balance Q: Why distribute? Why not just buy a single machine? A: Single machine must have the resources (bw/cpu) to serve peak load. As a result, average utilization will be low. Most investment wasted. Q: What can we do to improve the utilization of this machine's resources? A: Serve multiple content sources w/ independent load (in time) + each source creator pays for the resources to serve fractional load + problem.... content creators must have administrative connections We want: many machines, each serving some fraction of the _average_ load. also: what about balancing space? between hosts w/ different capacities? - performance Same as TCP to (median server? mean?) - reliability - Tolerate node failures (at what rate?) - lookup failures (handled by primitive) - data loss --> arrange for 'good enough' reliability at reasonable cost - what is enough: at least as good as single server? better? + power outage at NYU gives server: 99 percent uptime + 'five nines (99.999%)' -> five minutes down per year x anonymity/authentication? - anonymity ignored by CFS - conflicts with performance / reliability goals of primitives x indexing/keyword search? - different class of application ==> Load balance is main force in designing this system How will we achieve these goals? ------------------------------- - Need a different API - CFS provides DHASH layer: void insert (key k, void *data); void *data = lookup (key k); - really a layer? Example of why it isn't a layer: successor placement, lookup really performed in DHASH --> as we design CFS we are really building DHASH: CFS is really all about DHASH. Chord's main job is maintaining finger tables. - basic DHASH operation * to insert data d under key k 1. get IP = lookup (k); 2. contact IP, send k,d * to fetch 1. get IP = lookup (k); 2. contact IP, send k, get d - we will use this interface and extend it to meet our design goals - load balance: - striping + Large files distributed across large number of nodes + imposes cost of one lookup per block (i.e. many lookups per file) + insert is: for (all blocks) insert (NAME (block[i]), block[i]); + how to track stripes? akin to forming files from disk sectors --- SFSRO file system layout: merkle tree + can add block caching for small files - how to modify DHASH to do this? --> along path, DHASH must do lookups/ be informed of path to cache files - recursive v. iterative operation? - (whole file) caching + employed by Freenet/Napster/Gnutella + distributes files around the network as they are requested + only one lookup per file + insert is: insert (NAME (file), file); ---> now we just need to get DHASH right (add reliability) and we have a file system --> which is better (whole file/block oriented)? - striping imposes more lookups (requires prefetch) uses less disk space to achieve same level of load balance - whole file caching reduces lookup overhead - dealing w/ heterogeneous node capacities: - virtual servers (CFS) - redirection (PASTRY) - reliability - need replication if we will tolerate failures - where to place replicas? - need: replica failure independent - successors (chord specific) - rehashing: replica 2 = SHA(SHA(key)); - how to modify DHASH to provide reliablity? - needs to place replicas - needs to recogonize failure and know how to recover --> must be invovled in the lookup (DHASH knows where replicas are located) --> Chord only provides routing table information to DHASH which does the lookups - how does striping affect reliability? - all blocks of a file must be found to reconstruct file p (finding a block) = 1 - p (servers are down) [assume no lookup failures] = 1 - l^r where l is prob a server is down, r is replication degree p (finding a file) = p (finding a block)^B where B is the # of blocks = (1 - l^r)^B approx = 1 - B*(l^r) ==> easy to drive this to one by increasing r - performance - how does performance depend on load balance scheme? - whole file: speed will be equal to that of TCP between client/server - striping: problem: lookup latency leads to poor performance w/ prefetch, lookup latency hidden by block fetch - interaction w/ network protocol? - what about server selection? Choose 'close' servers to find blocks (see replication below) - explain chord server selection? - what about authentication/naming? - possible primitives for authentication: - PK - MAC - content hash - CFS uses SFSRO style auth: merkle tree - nit: how to handle ".." (show circular dependency) - anonymity? - open question - is system useful w/o anonymity?