Scalable, Distributed Data Structures for Internet Service Construction Gribble, Brewer, Hellerstein, Culler OSDI 2000 What is purpose of this system? "Internet services"--e.g., need stable storage for cgi, but web servers crash Separate web server functionality (DDS clients) from storage "bricks" (fig 2) What's the API their system (library) provides to apps (services)? put(key, value) get(key) -> value Why is that a useful API? What interesting properties does their implementation have? Highly available. High performance. Take care of replication and recovery. How does this compare to an RDBMS? RDBMS has ACID (atmomicity, consistency, isolation, durability) DDS doesn't offer full transactions (can't transfer money between accounts) RDBMS supports SQL queries, DDS has way simpler interface How does this compare to a file system? Weaker consistency FS has less structured data than DDS Straw man design: Clients, services/DDSlibs, bricks. Suppose we have 100 bricks. brick = hash(key) % 100. We're done. Very scalable! Suppose we want to support adding new bricks? Want to re-partition data over new set of bricks. Need a level of indirection to map keys to bricks. This is the DP (data partitioning) map. The services/DDSlibs have to have the DP to choose the right brick. How do we make sure everyone agrees on the DP map? Service/DDSlib checks on every operation. How do bricks agree on the DP among themselves? I don't think the paper says. How can we support replication? Have DP map entries point to sets of bricks. This is the RG (replica group) map. What consistency model do they promise? They claim one-copy equivalence (p.3) Presumably same as single-copy serializable So they must make sure operations happen one-at-a-time per value And they must have some all-or-nothing replica write story Example 1: Two replicas of x. Currently zero. DDSlib D1 writes 1. Example 2: D1 writes 2. Concurrently, D2 writes 3. Example 3: Concurrent: D1 writes, D2 reads. D2's read happens between prepare and commit. Example 4: A replica crashes during a write. p.5: "If a replica crashes during a two-phase commit... continue onward" Example 5: A replica crashes just before a read. Example 6: D1 wants to write 4. D1 crashes before sending any commit. Example 7: D1 wants to write 5. D1 crashes after sending just one commit. What does the other replica do? What happens if we now read? Is this really two-phase commit? 2-p c should block when unsure; DDS bricks never block permanently. reads don't participate in the atomicity plan, i.e. aren't ordered When does the service/DDSlib ACK to the client? Before or after sending commit messages? Suppose a brick reboots? It assumes it missed some put()s. So its replicas are worthless. So act just like adding a totally new brick. Pick a replica dataset. Ask one of its owners to lock it (so no put()s). Copy the whole replica. Unlock, add to RG. What if there's a power failure? Can we fix this by putting replicas in different buildings? What if the network partitions? Example 8: Like 7, but replica that received commit crashes after read Then client reads again What if one particular piece of data is popular? Why don't you need multi-object transactions? Let's build a Hotmail back-end with DDS. Tons of users. version 1 Hash user name to 64-bit key. Store all user's mail in one hash table entry. So message arrival is get(), append new message, put(). What if two messages arrive at the same time? No locking, no atomic multi-operation transactions. So one message will be lost. version 2 We can tell if we lost a put() race: get() will show some other new message, but not ours. In that case, retry. We're depending on atomicity of put()s. Does this always work? Perhaps both parties' put() will fail, depending on locking order? We still probably aren't as good as Porcupine (big mail server from UW): Suppose all of my replicas crash? Can't guarantee always to accept messages. Can't show some messages if some bricks are down. version 3 Give each user a few mailbox fragments. With predictable keys. hash(dm-1), hash(dm-2), hash(dm-3),... When a message arrives, append to a random available fragment. Could start a new one if required. Reading mail requires probing to see which fragments exist. Now we have some soft state! The fragment list. Are we as good as Porcupine now? Still can't recover from a complete power failure. Evaluation: What to conclude from the performance section? Garbage collection sucks! Fig 6. Packet processing under load (queueing them) looks like a bad idea Maybe they should have read Mogul & Ramakrishnan livelock paper! p. 7 "Read throughput achieved... 5.3Gread/day, 1.2Gwrite/day... adequate to serve the hit rates of most popular web sites". So DDS good? But that was in-core benchmark! Have to look in 5.2.2--looks smaller, maybe still okay But also no replication in 5.2.2 experiment! p. 10 end of 5.2.1 "Eliminating this replication would double the throughput..." Do you believe the claim? Probably would more than double Does the interface seem useful?