LBFS ==== What's the motivation for this work? Make Network File Systems work over cable modems Why would you want this? What are alternatives? Copy files (or use CVS) - works pretty well when it fits your model Use ssh - edit files remotely... can be kind of a pain Figure 10 shows ssh - What's going on here? At 0% packet loss, we are seeing effects of latency Users (or expect script) often wait for characters to echo By 4%, have added nearly 10 minutes to benchmark. What's going on? TCP treats packet loss as a sign of congestion - slows down sending TCP goes into exponential back-off mode, waits 2sec, 4sec, etc. *Very* *frustrating* for users But LBFS goes over TCP... why doesn't it experience same penalty? TCP contains "fast retransmit" optimization If packet P1 lost, but P2, P3, P4 arrive, recipient keeps ACKing P0 After 3 duplicate ACKs, sender cuts sending rate in half, re-sends P1 But requires at least 4 packets in flight ...doesn't happen with ssh, where wait for echo to send next key So should we just use an existing file system? Why LBFS? In fact, at 8% pkt loss, looks like AFS beats LBFS... what's wrong? AFS doesn't use TCP, and doesn't back off--could cause network meltdown More importantly, AFS consumes huge amount of bandwidth! Ed benchmark had ~2.5M Key presses, AFS required ~45MB upstream b/w So what is the idea behind LBFS? Avoid redundant data transfers Assume client has huge on-disk cache. Means only care about consistency: Read file previously written by someone else, or write new contents Why would we expect such redundant transfers? Edit part of file, write back very similar file What does vi do? truncate and re-write file What does emacs do? write, fsync, rename, to avoid corruption RCS appends to revision log, again creates new file for crash safety How about convert a GIF image to JPEG? tar? tar.gz? GIF -> JPEG would not benefit from LBFS. Any related work that would? Operation-based updates can ship convert command to proxy But only good for shell commands (Imagemagick ok, gimp not) tar should work well tar.gz probably won't work -- compression will change all bytes Key idea: Use collision-resistant hashes to elide data transfers H(x) fixed size (e.g., 20 bytes), Can't find any x != y s.t. H(x) = H(y) Say I want to send you x: I first send you H(x) You have some x', compute H(x') If H(x) = H(x'), assume x' = x, so no need to transmit x What if found H(x) = H(x'), but x' != x? More likely a hardware failure To give perspective: Most digital signature schemes require a C-R hash Yes, attacker might break a particular digital signature scheme But there will be other schemes, larger key sizes, etc. Note LBFS used SHA1, if done today, would use SHA-256 or SHA-512. Straw man: Break files into aligned 8K chunks To transfer file, send hashes, then send only missing chunks Problem? Add one byte at beginning of file LBFS not the first system to elide redundant transfers. Two examples: Rsync [go over algorithm] - Would this work for LBFS? Not so well, because rsync synchronizes two files Sometimes use chunks from multiple files. Example? tar Not always clear which two files--could use sketches [go over algorithm] Spring & Wetheral (also sim) [go over algorithm] - Would this work for LBFS? No, because often data already at server but not sent by same client So how does LBFS actually work? Base chunk boundaries on file contents, not position Use running checkpoint to identify points like Spring/Wetheral anchors but treat those points as chunk boundaries Look at figure 1 (p. 4) to get intuition for this What about pathological cases? Could have tiny or huge chunks. Why is this bad? For small chunks, just ignore breakpoints less than 2K from last Maybe should have chosen breakpoint if min hash in 8K region around point For large chunks just cap at 64K Note, clamping not the end of the world, because gzip may help, too Different file systems have different consistency mechanisms and semantics Generally at a minimum want close-to-open consistency If I write and close a file, you then open file, you see my changes With NFS may see more than this: A write can commit at any time before close All pending writes to file flushed before close system call returns Client cache flushed whenever (mtime, ctime) changes in file attributes mtime - time when file last written ctime - time when attributes (e.g., mtime) last set (Why does ctime exist? E.g., useful for incremental backups) NFS3 open always gets attributes from server (w. ACCESS RPC to check perms) AFS provides only close-to-open consistency Clients cache files in large persistent cache Cache dirty file while writing, only write back to server on close With concurrent writers, last writer wins Probably not what you want, so avoid concurrent updates E.g., might use lockfiles (or directories like CVS) Server remembers clients caching files, invalidate caches with *callbacks* Opening cached file requires no interaction with server What are advantages/disadvantages of callbacks vs. polling? + Callbacks decrease latency for opening a cached file + Callbacks reduces load on server compared to polling - Callbacks Increase state requirements on server - Callbacks must be recorded to disk to survive crash and reboot - Can't have both liveness and correctness if client fails (server promised to call it back before updating file, but can't contact) Good compromise between the two: leases Server promises to notify client of invalidation for, e.g., 2 minutes If server reboot takes more than 2 minutes, no need to write leases to disk After 2 minutes, can forget about dead clients What happens with concurrent writers who have O_APPEND in NFS and AFS? NFS might work, might get mishmash with appended writes not at end AFS always throws one client's changes away How does LBFS read protocol work? Uses both leases and attribute-based consistency If have read-lease on attributes and file cached, no RPCs Else, fetch attributes, if mtime/ctime unchanged, use cached data Else, use LBFS file transfer protocol: GETHASH (fh, offset, count) -> list of pairs, eof flag READ (fd, offset, count) -> data, eof How should this compare to NFS in best and worst cases? Best case: Single GETHASH RPC vs. many large reads for NFS Worst case: LBFS incurs extra network round trip time Should do same for writes? PUTHASH + WRITE RPCs? Bad idea: Often truncate/delete file before re-writing; kills best source of chunks Also, pipelining PUTHASH would yield vastly re-ordered writes E.g., chunks of 0 bytes in the middle of text files Also if file not cached on server, unaligned writes cause read from disk So how do writes actually work? Atomic updates MKTMPFILE (fd, fh) CONDWRITE (fd, offset, count, sha) -> OK, HASHNOTFOUND, other error TMPWRITE (fd, offset, count, data) -> OK, error COMMITTMP (fd, fh) Best case comparison to NFS? Send MKTMPFILE, plus bunch of CONDWRITEs All CONDWRITEs succeed, send COMMITTMP 2 Round trips, very little bandwidth Not a win for small files: If less than 8K (uncompressed), just send WRITE Worst case? One extra round trip, plus bunch of CONDWRITE RPCs, plus COMMITTMP which may be expensive at server Note server keeps old temp files and deleted files (may have useful chunks) How is LBFS implemented? LBFS client: use xfs device driver. Why not NFS loopback server? Need close LBFS server: implemented as NFS loopback client Advantages: Portable, simplifies access control Disadvantages: COMMITTMP must copy files (expensive). Why? Static i-number problem--applications don't expect i-number to change Truncation--might happen before overwrite and need to be visible So it's actually good to have a copy around in a temp file What is atomic truncate+exchange proposal? Truncate A, then atomically replace B's contents with A's old contents Okay to throw away A after a crash as long as B remains unchanged Claim easier to implement than rename - how would you do this? - Set inode A to show zero-length file with no blocks - Replace inode B w. old contents of inode A - Mark all of B's old blocks as unused - Note: no need to change A's indirect blocks--now just part of B Would allow emacs/rcs to update files atomically Could also be used by LBFS server to make truncate & COMMITTMP work well Both sides: Use B-tree for chunk index, NOT crash recoverable Why? Would require expensive synchronous writes Could run into trouble if someone changes FS w/o LBFS So follow philosophy of never assuming database is correct (just hints) What questions should we be asking when we turn to evaluation? 1. Do files really contain redundant chunks? 2. Does LBFS reduce bandwidth consumption? 3. Would a simpler scheme have worked? 4. Does b/w reduction matter to performance? 5. Any other costs we should care about (e.g., CPU time, disk space or b/w)? 6. Any pathological cases? How serious is 2K/64K clamping? How well are these points addressed: 1 -> Table 1 2 -> Figure 6 3 -> Leases+Gzip comparison throughout 4 -> Figure 7a 5 -> Not completely clear, but 7 suggests it's ok (gcc CPU-bound) 6 -> Fig 5 shows clamping does kick in, not clear how serious this is How would you expect CVS to work on top of LBFS? Probably not so well, because CVS already takes advantage of revision history What is estimated performance impact of static i-number problem? Hard to say w. paper... what experiment would you design to find this out? Maybe replace copy with rename, even if not correct, just for timing