Soft Updates
============

What is the problem?  Metadata consistency

Suppose you delete a file and subsequently crash
  * If file system imposes write order with synchronous writes (FFS)
    - Free map, count will probably be wrong
    - Inode may have link count 1 but no directory entry
    - New inode may point to block of deleted file
        Chunks of a deleted file show up in new unrelated files
    - fsck can fix 1 and 2 but not 3
  * If file system imposed no ordering (Linux ext2fs)
    - Same problems as FFS, plus:
    - Inode may have been recycled before directory written
        Old directory will contain link to new file
    - Indirect block could get reused before old inode cleared
        Random file data will get interpreted as block pointers
    - fsck cannot fix these
  * Log structured file system/shadow paging (LFS)
    - May end up with pointer to old file system state before the delete
        But always consistent--blocks never overwritten while pointed to
    - fsck/mount could roll forward log to lose less state
  * Journaling/write-ahead logging (XFS)
    - Can end up with same problems as ext2fs
    - But fsck replays log to bring file system into consistent state

Advantages/disadvantages of FFS:
  + create/delete operations always survive after a crash
  - Chunks of deleted files resurface in unexpected places
  - fsck takes a long time

What are advantages/disadvantages of LFS & Journaling
  * LFS
    - Cleaner overhead: many open questions in how to clean properly
  * Journaling
    - Must perform more metadata writes (once in log, once in file system)
  * Both
    - Contention for lock at end of log?
    - fsync must wait for other files' data
    + Most operations don't require synchronous disk writes
    + fsck/mount is fast
    + True atomic rename
    - Create/delete requires writes even for short-lived files

Without logging, what is correct order in which to write info to disk?
  1. Never write pointer before initializing the structure it points to
  2. Never reuse a resource before nullifying all pointers to it
  3. Never clear last pointer to live resource before setting new one

So what are goals of this paper?
  * Eliminate most synchronous disk writes
  * Make fsck much faster, or at least don't wait for it to restart
  * Fix resurrected data problem

Straw man:  Why not just impose partial order on disk queue?
  * Write things in correct order, but delay writes

Problem:  Crash might occur between ordered but related writes
  * E.g., summary information wrong after block freed

Problem:  Dependency cycles and false sharing
  * Several inodes or directory entries in same block
  * Example:  figure 1
    - Create file A, delete file B, same dir/inode blocks
        Can't write directory until inode A initialized
        Can't write inode B until pointer cleared in directory
  * Why can't you write inode B until pointer cleared? (see footnote 2)
    - Have to make sure it isn't reallocated
    - Fsck would have to check every directory entry before restart (slow)
    - Otherwise, might get incorrect link count
        deleting file would clear inode even when another link existed!

Problem:  Block aging
  * Block that always has dependency will never get written back

What are soft updates?
  * Data structure for each updated field or pointer, contains:
    - old value
    - new value
    - list of updates on which this update depends
  * Can write blocks in any order
    - But must temporarily undo updates with pending dependencies
    - Must lock rolled-back version so applications don't see it
    - Choose ordering based on disk arm scheduling
  * p. 134: other dependencies can "be more efficiently handled by
        postponing in-memory updates until after the updates on
        which they depend reach stable storage."

Example:  Create A delete B revisited
  * See figure 2 - requires directory to be written twice?
  * What if inode written first?
  * How many writes required in XFS?
      3 - one for log, one for inode block, one for directory block
  * How many in LFS?
      All part of one big write (+ checkpoint for many updates)

Four main structural changes requiring sequenced updates:
  1. Block allocation
       Must write:  disk block, free map, pointer
       Req: Disk block, free map must be written before pointer
       Use: Undo/redo on pointer (+ possibly file size)
  2. Block deallocation
       Must write:  previous pointer, free map
       Just update free map after pointer written
       Or, immediately deallocate blocks if pointer was never written to disk
         How do you know?
  3. Link addition
       Must write:  Directory entry, inode, and free map (if new inode)
       Req: inode and free map must be written before dir entry
       Use: Undo/redo on i-number in dir entry (ignore entries w. ino 0)
  4. Link removal
       Must write:  Dir. entry, inode and free map (if nlinks==0)
       Req: Decrement nlinks only after pointer cleared
       Use: Clear directory entry immediately
            decrement in-memory nlinks once pointer written
            If directory entry was never written, decrement immediately

Issues
  * fsync
    - Must ensure names for files are also stably on disk
    - Must ensure names of parent directories are stably on disk!
        keep data structures to track such dependencies
        recurse to higher level directories
        but parent directories can be written in any order, so still
		good disk arm scheduling
  * unmounting a file system
    - May need to flush dirty buffers multiple times
  * memory usage
    - Deleting large directory trees--memory goes faster than disk
    - Cap number of directory structures allocated
  * useless write-backs
    - syncer wrote many blocks at once--worst case
      even with circular dependencies better to write one at a time
    - LRU evection scheme tweaked to know about dependencies
  * fsck
    - Many, many inodes with non-zero link counts
    - Don't stick them all in lost+found

Performance
  * Figure 3:  Is this what we expect?  yes
    - Why the dip after 64KB?
       Write coalescing in 64K chunks
       Indirect block kicks in at 104K
    - How would you expect LFS and XFS to do here?
       LFS - probably better (no seeks)
       XFS - between conventional and soft-updates, since more writes needed
  * Figure 4:  How does soft updates beat No Order?
    - Soft updates defers the work from actually removing files
  * Figure 5:  How does soft-updates beat no order?
    - Artifact of benchmark:
        reallocation is process of coalescing writes into 64K chunks
          sometimes relocate blocks to do this.  May be farther from
          indirect block if old space freed and made available
  * Figure 6:  Pretty good, yes?
  * Figure 7:
      How does concurrency hurt convenional?  less locality
      Helps soft-updates because more flexibility in disk scheduling
        too much concurrency just adds overhead
  * Figure 8:  journaling has cost

Limitations of soft updates:
  * Not as general as logging, e.g., no atomic rename
  * Metadata updates might proceed out of order
      create /dir1/a than /dir2/b
      after crash /dir2/b might exist but not /dir1/a