Caching and replication ======================= In order to have good performance, want to cache data in network FS Problem: ensuring network file system clients see the same data Two interesting cases Sequential write sharing: Only one writer or multiple readers at a time How do NFS and AFS offer consistency under sequential write sharing? NFS - poll server for freshness on every open Disadvantage -- heavy load on server Disadvantage -- incur round trip on every open AFS - have server call you back Advantage -- less load on server, fewer round trips Disadvantage -- server must keep state (potentially large) Disadvantage -- what if client disappears? Another mechanism: Leases - server promises callback for limited time Advantages -- same as AFS callbacks Advantages -- plus, can recover from dead client Disadvantages -- Have to wait lease time to recover from dead client or server reboot (since leases in memory) Lease implementation issues: Lease may include expiration time Valid until ExpTime - MaxClockSkew so requires roughly synchronized clocks Or can be relative: "N seconds from now" So conservitive estimate of expiration is RPC_Issue_Time + N Concurrent write sharing: Multiple simultaneous accesses, at least one of which is a write How do NFS and AFS provide consistency under concurrent write sharing? Short answer: They don't, really NFS - changes asynchronously written to server Even when writes not on server disk (COMMITted) other clients will see changes - unless maybe server crashed So you don't have strict consitency Say you read immediately after my write returns The write is asynchronous (before close) so you might miss it If you are caching the data block, you won't check with server What about mmap? (Forget it) AFS - even worse that NFS Writes are not visible to other clients until you close the file If two clients write a file, the last one to close it wins How to achieve concurrent write sharing? Example: Sprite file system Motivation - study of BSD file usage 75% of files open less than 1/2 second, 90% -> 10 sec 20-30% of files gone within 30 seconds So wanted to avoid writing file through to server on close Sprite write policy Write first go just to client cache, then server cache, then disk Write to next stage when unmodified for 30 seconds Benefit of delay -- exploit overwrites and deletes How useful in practice? Speculation: many files in /tmp, so maybe just make /tmp local... Sprite consistency -- Put each file in one of several modes: Sequential write-sharing After every modifying close, bump version number Version number tells other clients to flush cache Concurrent write sharing -- disable caching What happens when second open puts file in concurrent write sharing? Server keeps track of last writer Server asks writer to flush dirty blocks, and waits Then server lets open succeed So which of NFS/AFS/Leases/Sprite is preferable? Is concurrent write sharing a problem? Not with most applications Which would you want for your laptop when on an airplane? None - they all require communication with the server Idea: Replicate data *optimistically* Example: disconnected operation with laptops source-code repository, cooperating programmers pair-wise file synchronization between programmer laptops want to be able to modify files while disconnected Try to get right answer and be consistent if you can easily But better to return an old version of a file than none at all! What about concurrent updates? Need to enforce some order of operations e.g. Ivy DSM ensured sequential consistency Example 1: Just focus on a modifications to a single file. H1 writes "a" H1 -> H2 H2 writes "b" H1 -> H3 H3 -> H2 What is the right thing to do? Is it enough to compare modification times? Yes in this case, as long as you carry them along correctly. Example 2: H1 writes "a" H1 -> H2 H1 writes "b" H2 writes "c" H2 -> H1 H2's mtime will be bigger. Does that mean we should use "c" and discard "b"? What is the principle here? We're not *enforcing* a sequential update history. No locks, no Ivy-style owners. But *if* updates were actually sequential, we want to detect it and carry along most recent copy. And if updates were not sequential, we want to detect the problem (after the fact). This is where the "optimism" comes in. So what is the problem with using wall-clock time? Certainly if T1 < T2, T1 is not the latest version. *But* if T2 > T1, that doesn't imply that T2 supersedes T1. The underlying problem is that time doesn't reflect causality. That is, we care if the write at T2 was on a host that had seen write at T1. Also you cannot rely on computers having synchronized clocks. How can we decide more precisely whether we're looking at sequential history? We could record each file's entire modification history. List of hostname/localtime pairs. And carry history along when synchronizing between hosts. For example 1: H2: H1/T1,H2/T2 H3: H1/T1 For example 2: H1: H1/T1,H1/T2 H2: H1/T1,H2/T3 Then its easy to decide if version X supersedes version Y: If Y's history is a prefix of X's history. This exactly captures our desire for an *ordered* sequence of updates. Note we're not comparing times, so this works w/o sync. Why is complete history not such a great solution? The histories grow without bound if we modify a file a lot. Can we compress the histories? Proposal: Vector Timestamps. Just record the latest time for each host. So throw away all but the last history record for each host. All we care about is whether there's a *new* update by some host that we hadn't heard about when we changed the file. We assume that such an update must be *after* last time we heard from the host: that the host isn't going backwards in time. A vector timestamp is a vector of times, one entry per host. The times are (as in the histories) local wall-clock times. Though they could be any monotonic count. a = b if they agree at every element. a < b if a[i] <= b[i] for every i, but !(a = b). a > b a || b if a[i] < b[i], a[j] > b[j], for some i,j. i.e. a and b conflict If one history was a prefix of the other, then one vector timestamp will be less than the other. If one history is not a prefix of the other, then (at least by example) VTs will not be comparable. So now we can perform file synchronization. When a pair of laptops has to files, they accept the one with the higher vector timestamp. And signal an error if the VTs conflict. Illustrate with the two examples. What if there *are* conflicting updates? VTs can detect them, but then what? Depends on the application. Might be easy: mailbox file with distinct messages, just merge. Might be hard: changes to the same line of C source. Reconciliation must be done manually for the hard cases. CVS keeps extra information, equiv to complete history. So it can generate diffs from common ancestor. Much better than diffing the two final versions against each other.