Rethink the Sync
================

What do we think of as standard contract with FS?
  Stuff older than 30 seconds never disappears even after crash
    In particular, means file system is always recoverable after power failure
  When fsync returns, previous writes to file will not disappear
What do we actually get in practice?
  When fsync returns, data is in disk cache, not necessarily on disk!
  So data will survive a crash but might not survive a power failure
  Moreover (p. 15) commit record could be written w/o all transaction blocks!

What is a write barrier?  Mount option for ext3 file system.
        Write barriers enforce proper on-disk ordering of journal
        commits, making volatile disk write caches safe to use, at
        some performance penalty.  The ext3 filesystem does not enable
        write barriers by default.  Be sure to enable barriers unless
        your disks are battery-backed one way or another.  Otherwise
        you risk filesystem corruption in case of power failure.
            - linux mount man page

Why aren't write barrier's enough?
  Saves the file system from inconsistencies, but might:
    1) Unnecessarily hurt performance to get that consistency
    2) Require apps to pay too much performance or complexity for correctness
  Boils down to having to sprinkle fsync's in your application.  Examples?
    Emacs wants to save file w/o losing original
      write new contents to "#file#"
      fsync "#file#"    <-- the key step
      link "file" to "file~"
      rename "#file#" "file"
    Copying a file before editing to avoid losing the original (fsync copy)
    Application-level write-ahead logging (fsync log before writing original)
  Example:  dbutil (BerkeleyDB) calls fdatasync 5 times to add key to DB
  fsync/fdatasync can seriously harm performance
    Can require multiple rotations (8.3 msec each at 7200 RPM)
    Prevents batching transactions and disk arm scheduling optimization
  Overkill if you just wanted order, E.g.:
    Want copy on disk before original written, doesn't have to happen now.
  
Can featherstitch patch groups replace fsync within applications?
  In some cases, yes:  Wherever fsync is used just for ordering
  In other cases not.  What's an example?
    SMTP (mail server) tells client, "OK, I've queued your message"
    Client will delete copy, so message must survive power failure on server

What is a synchronous mount?
  All IO operations force the log (like an implicit fsync after each syscall)
  Who uses synchronous mounts?
    No one.  People expect apps to use fsync properly
  Why does paper talk about synchronous mounts?
    Semantically, they set the gold standard--couldn't ask for more
    But performance numbers aren't so interesting, as nobody uses them

What is externally synchronous IO?
	external output produced by the computer system cannot be
	distinguished from output that could have been produced if the
	I/O had been synchronous.   - p. 6:4
  Means: don't have to touch disk until some other (e.g,. network) IO happens
    Most file system calls only affect buffer cache.  Fsync becomes a no-op!
    Automatic detect when a disk write is really needed
    Then commit entire batch -> good performance

How is externally synchronous IO different from a synchronous mount?
  - Catastrophic errors (media failure) cannot be reported at time of IO
  - When checkpointing long-running computation, checkpoint might not be stable
  - Cannot preserve order across file systems (each FS has its own log)
  + Can achieve disk schedule much closer to normal mount

What are output triggered commits?
  Delay flushing file writes until externally visible output (or 5 sec)
  Must flush all writes that causally precede the output
    During flush, must buffer all externally visible actions
  What is causal order?
    Op A depends on result of op B, will never see second w/o first
      Example:  SMTP server writes file, then responds "OK"
      Example:  Execute command from shell:
         cp: write(fd, ...) -> exit()
       bash:                          -> wait() -> printf("$ ")
    Note: Causality extends across processes
      E.g., might pipe message to separate process for spooling
  How does xsyncfs track dependencies?
    Some fairly intrusive kernel changes to track IPC, buffer output
    Plus conservatively lump all processes that share memory together
    What about X server (memory maps frame buffer)?  Could be problem

How important is causality tracking?
  Could conservatively flush log on every externally visible IO
    Could be worse if machine runs two servers, and one rarely touches disk
    If both touch disk, probably not huge win, as have to flush all log records
      Would be different if you had multiple (per-application) logs
  Why did they implement dependency tracking?
    Already had machinery around for other interesting paper on Speculator

What evaluation questions should we ask (this paper makes it easy p. 6:14)?
  1. How does durability compare to current FSes?
  2. How does performance compare to current FSes?
  3. How does xsyncfs affect apps that use fsync?
  4. How do output triggered commits help?
  But what else should we ask?
    Is there a real problem in terms of performance, semantics, or app needs?
      E.g., is there evidence that programmers misuse fsync?
      Or do apps use fsync only for lack of a weaker alternative?

Figure 3: durability
    Do lots of write()s and send a network msg *after* each one
    Crash at some point
    Are all writes on disk for which net msg was received?
  Definitively addresses question 1
  Scary that fsync isn't really durable because no barriers!

Figure 4:
  Single-threaded read/write/create/delete benchmark, no external output
  Addresses questions 2 and 3
  Would we expect xsyncfs to be faster or slower than async ext3?
    Slower, as some overhead to tracking dependencies (and using barriers)
  Why ext3-barrier slower than ext3-sync?
    Disk cache absorbs many writes in ext3-sync
    So why is ext3-sync slower than ext3-async?
      Eventually disk cache fills up

Figure 6: mysql, vs async ext3 with write barriers
  For this experiment, async ext3 w/ write barriers (meaning proper fsync)
  mysql fsync()s log after each SQL transaction
 So both systems are arguably equally durable
  why does xsyncfs win with few threads?
  why does ext3 catch up with many threads?
  would xsyncfs win if client on diff host than mysql server?

Figure 7: web server, not much difference, why?
  
Could we simplify dependency tracking with better user-level API be?
  How about ability to flag a patch group for external sync in Featherstitch?