Rethink the Sync ================ What do we think of as standard contract with FS? Stuff older than 30 seconds never disappears even after crash In particular, means file system is always recoverable after power failure When fsync returns, previous writes to file will not disappear What do we actually get in practice? When fsync returns, data is in disk cache, not necessarily on disk! So data will survive a crash but might not survive a power failure Moreover (p. 15) commit record could be written w/o all transaction blocks! What is a write barrier? Mount option for ext3 file system. Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use, at some performance penalty. The ext3 filesystem does not enable write barriers by default. Be sure to enable barriers unless your disks are battery-backed one way or another. Otherwise you risk filesystem corruption in case of power failure. - linux mount man page Why aren't write barrier's enough? Saves the file system from inconsistencies, but might: 1) Unnecessarily hurt performance to get that consistency 2) Require apps to pay too much performance or complexity for correctness Boils down to having to sprinkle fsync's in your application. Examples? Emacs wants to save file w/o losing original write new contents to "#file#" fsync "#file#" <-- the key step link "file" to "file~" rename "#file#" "file" Copying a file before editing to avoid losing the original (fsync copy) Application-level write-ahead logging (fsync log before writing original) Example: dbutil (BerkeleyDB) calls fdatasync 5 times to add key to DB fsync/fdatasync can seriously harm performance Can require multiple rotations (8.3 msec each at 7200 RPM) Prevents batching transactions and disk arm scheduling optimization Overkill if you just wanted order, E.g.: Want copy on disk before original written, doesn't have to happen now. Can featherstitch patch groups replace fsync within applications? In some cases, yes: Wherever fsync is used just for ordering In other cases not. What's an example? SMTP (mail server) tells client, "OK, I've queued your message" Client will delete copy, so message must survive power failure on server What is a synchronous mount? All IO operations force the log (like an implicit fsync after each syscall) Who uses synchronous mounts? No one. People expect apps to use fsync properly Why does paper talk about synchronous mounts? Semantically, they set the gold standard--couldn't ask for more But performance numbers aren't so interesting, as nobody uses them What is externally synchronous IO? external output produced by the computer system cannot be distinguished from output that could have been produced if the I/O had been synchronous. - p. 6:4 Means: don't have to touch disk until some other (e.g,. network) IO happens Most file system calls only affect buffer cache. Fsync becomes a no-op! Automatic detect when a disk write is really needed Then commit entire batch -> good performance How is externally synchronous IO different from a synchronous mount? - Catastrophic errors (media failure) cannot be reported at time of IO - When checkpointing long-running computation, checkpoint might not be stable - Cannot preserve order across file systems (each FS has its own log) + Can achieve disk schedule much closer to normal mount What are output triggered commits? Delay flushing file writes until externally visible output (or 5 sec) Must flush all writes that causally precede the output During flush, must buffer all externally visible actions What is causal order? Op A depends on result of op B, will never see second w/o first Example: SMTP server writes file, then responds "OK" Example: Execute command from shell: cp: write(fd, ...) -> exit() bash: -> wait() -> printf("$ ") Note: Causality extends across processes E.g., might pipe message to separate process for spooling How does xsyncfs track dependencies? Some fairly intrusive kernel changes to track IPC, buffer output Plus conservatively lump all processes that share memory together What about X server (memory maps frame buffer)? Could be problem How important is causality tracking? Could conservatively flush log on every externally visible IO Could be worse if machine runs two servers, and one rarely touches disk If both touch disk, probably not huge win, as have to flush all log records Would be different if you had multiple (per-application) logs Why did they implement dependency tracking? Already had machinery around for other interesting paper on Speculator What evaluation questions should we ask (this paper makes it easy p. 6:14)? 1. How does durability compare to current FSes? 2. How does performance compare to current FSes? 3. How does xsyncfs affect apps that use fsync? 4. How do output triggered commits help? But what else should we ask? Is there a real problem in terms of performance, semantics, or app needs? E.g., is there evidence that programmers misuse fsync? Or do apps use fsync only for lack of a weaker alternative? Figure 3: durability Do lots of write()s and send a network msg *after* each one Crash at some point Are all writes on disk for which net msg was received? Definitively addresses question 1 Scary that fsync isn't really durable because no barriers! Figure 4: Single-threaded read/write/create/delete benchmark, no external output Addresses questions 2 and 3 Would we expect xsyncfs to be faster or slower than async ext3? Slower, as some overhead to tracking dependencies (and using barriers) Why ext3-barrier slower than ext3-sync? Disk cache absorbs many writes in ext3-sync So why is ext3-sync slower than ext3-async? Eventually disk cache fills up Figure 6: mysql, vs async ext3 with write barriers For this experiment, async ext3 w/ write barriers (meaning proper fsync) mysql fsync()s log after each SQL transaction So both systems are arguably equally durable why does xsyncfs win with few threads? why does ext3 catch up with many threads? would xsyncfs win if client on diff host than mysql server? Figure 7: web server, not much difference, why? Could we simplify dependency tracking with better user-level API be? How about ability to flag a patch group for external sync in Featherstitch?