Coral ===== Consistent hashing Say you want to cache web pages on a number of machines Hash URL, and chose machine based on hash: e.g., machine = hash % nservers What happens when you add and remove machines? If URL -> server mapping changes completely, most cache data invalid What happens if a client hasn't yet heard of a new machine Could use entirely wrong hash function Consistent hashing: Only invalidate data in proportion to reconfiguration Add one new server, should invalidate 1/N of data How to implement consistent hashing? Give each node a number of randomly chosen IDs (e.g., 160-bit number) compute key = Hash(URL), which is also in same ID space Approach 1: Map IDs onto circle [Chord] Store URL at node whose ID follows key on circle (successor) [Go over what happens when you add or remove a node] Approach 2: Arrange node IDs in a tree of height 160 [Kademlia] Start with most significant bit of URL hash (key) Go down tree choosing left or right branch based on next bit of key If only one branch, just chose that one In other words, find node whose ID x minimizes key XOR x [Go over what happens when you add or remove a node] Now, say you want to build a gigantic data cache So big each machine only knows about a small fraction of other machines So big that nodes come and go all the time--nodes owned by many people Napster approach--centralized server coordinates, can be shut down Gnutella approach--ad-hoc configuration, broadcast queries Better approach: Use consistent hashing for *routing* as well Store data as before on "closest" node to key in ID space Chord: closest node closest clockwise around circle Kademlia: closest node is closest treating ID XOR as distance Build routing tables to allow ID-space navigation Each node knows about ID-space neighbors Perhaps each node knows a few farther-away nodes To move long distances quickly For n nodes, can do this with log n RPCs and log n state at each node These data structures are called distributed hash tables (DHTs) How would you do this in Chord circle? Each node must know next node around circle (successor) Know node should cache node almost 1/2 way around circle, 1/4 way, etc. This is called the "finger table" So ask closest node you know about for its routing table How would you do this in Kademlia tree? If a node's id is b_0 b_1 b_2 ..., the node must know: - A node with first bit ~b_0 (other half of tree) - A node with prefix b_0 ~b_1 (other quarter in this half of tree) - A node with prefix b_0 b_1 ~b_2 - etc. Then you can get one bit closer to key with each RPC (Note: To handle unbalanced trees, each node actually must actually know all nodes under some point in the tree with two children.) In a very large system, machines will be failing all the time If N machines and a constant fraction fail, then in subsets of size < O(log(N)) stand good chance of all failing Pick "replication factor" k-sized set unlikely to fail simultaneously In Chord, want to avoid "breaking" the circle So each node must know k successors around circle, in case k-1 fail Can always re-build finger tables as long as circle not broken In Kademlia, should know up to k nodes with each prefix ~b_0..., b_0 ~b_1..., b_0 b_1 ~b_2... I.e., k nodes in other half of tree, other quarter of this half, etc. Must do *maintenance* to ensure you learn of dead nodes Note: Advantage of Kademlia is symmetry You receive RPCs from nodes in same distribution as you send them to So will often already know that nodes in routing table are alive Chord goes clockwise around circle, so incoming RPCs less useful Optimization Note Kademlia lets you contact *any* node in appropriate subtree So for better performance, cache nodes with low round-trip-time (RTT) Note: failure causes RPC timeouts, which can seriously hurt latency Can trade off bandwidth for latency w. concurrent requests Digression: Bit shifting vs. bit correcting W. bit shifting, can actually have constant-size routing table E.g., node with ID x knows nodes with IDs 2x and 2x+1 But makes failure recovery and optimization harder Note Kadmlia is used by trackerless bit torrent Seems to scale to 1,000,000+ nodes So what is goal of Coral paper? Most web sites have peak usage many times average Pool resources to absorb flash crowds Straw-man: put web objects into Kademlia compute key = Hash(URL) = key, value = contents of web object Store in Kademlia Just shifted problem; nodes in Kademlia now subject to flash crowds So at a minimum, should just use Kademlia for metadata Straw-man2: Network of web proxies When you have a URL, store in Kademlia When you want a URL, look up Hash(URL) in Kademlia, ask proxy If no one has it, then get data from origin web server Problems? 1. How to map clients to nearby proxies? E.g., sending client in Asia to proxy in USA very bad for performance 2. If many copies, which one(s) should you download from? 3. Could require slow RPCs to find nearby copies E.g., USA node must query Asia node to fine USA copy of data 4. Could overload Kademlia nodes just with metadata requests Imagine everyone storing the same key to Kademlia Go over big picture in Figure 1 Note in particular usage models through URL re-writing What is coral abstraction (p.3 Sec 2.3). Simplified: put (key, val, ttl) - stores as expected get (key) - returns some *subset* of values stored Idea: Returns values stored by nodes closer to requester nodes (level, target) - return some other nodes in cluster level levels () - returns configuration Questions 1-3 addressed by clustering. How does clustering work? Benefits on Kademlia optimizations (can bound timeouts by level RTT) How to avoid hot spots in indexing infrastructure? Full/loaded trick Is evaluation convincing? Break into 4 groups to discuss 1-4 in section 6 (p. 8) Each group comes up with one question *not* addressed by paper