Quick traditional FS background
===============================

Original Unix file system
  All data stored in 512-byte blocks
  Free blocks kept in a linked list
  Disk partitioned disk into three regions
    - Superblock: parameters of file system, pointer to head of free list
    - Inodes: type/mode/size, plus pointer to file/directory blocks
      For large files, has pointer to indirect, double indirect, etc. blocks
    - Data blocks: Plain old file data, also indirect blocks,
      for directories have blocks of filename->inode# mappings
Limitations of Unix FS
  Never transfer more than 512 bytes/disk access
    On old disks required at least one rotation per 512/bytes
      @3,600 RPM that means 30 KBytes/sec!
  Many factors make for long seek times reducing throughput even further
    After a while, allocation from free list random, so no locality
    Inodes are far from both the directory and file's data
    Within a directory, inodes of related files not near each other
  [Functionality restrictions like short file names and no rename]

Berkeley Fast File System (FFS) improved Unix FS in 1980s
  Larger block size of at least 4K
    (use "fragments" when last block of file < blk size, to avoid wasted space)
  Bitmaps replace the free list
  Cylinder Groups spread inodes throughout disk
    Try to located files in same directory in same CG
    Try to allocate file blocks in same CG as inode
  After crash, run fsck (file system check) program
    Checks every inode's fields for consistency
      (e.g., size, data block pointers, link count)
    Checks directory structure for consistency
    Time to run is proportional to size of file system (non-scalable w. trends)

What happens when you create a 1K file on FFS?
  Must make at least 5 writes:
    - directory block: to record the name of the new file
    - directory inode: to record new modification time
    - file inode: to initialize it with metadata
                  then (possibly again) for new file size and data block ptr
    - file data block: to store file contents
  Worse yet, many of these writes must be synchronous!
    Much more on this next lecture
    Basic point is you have to leave file system in a state fsck can fix
    Requires doing writes in a particular order
      (E.g., delete file and create new one with same data block
             want to make sure you don't end up w. both files cross-allocated)
    FFS uses synchronous writes to ensure ordering
      could imagine not doing it, but next lecture will see how not so simple

Log-structured file system
==========================

What problems is this paper addressing?
 Disk seek times not improving as quickly as CPU, transfer times, etc.
 Larger file caches
 - Increase write-to-read bandwidth ratio
 - Potentially allows large write buffers
 FFS bad for small file accesses (especially creates)
 - Office and engineering workloads
 FFS requires many synchronous writes for metadata operations
 Crash recovery time increases linearly with disk sizes

How is disk layed out?
 Superblock: #segments, segment size, etc.
 Checkpoint region
 Segments
   Segment summary
   - inumber, version, offset for each block
   Log entries:
   - Inode: mtime, data/indirect block locations
   - Inode map: inode locations, inode versions, atimes
   - Indirect block
   - Directory change log: link/unlink/rename
   - Segment usage table: #bytes free in segments, write times

F = file data block, D = directory data block
	| F1,1 | F1,2 | I1 | F2,5 | F2,8 | I2 | D7,1 | I7 |
            \--<---\--<-+
 Where do you store time of last access (inode map).  Why?
 Within segment, sort by inode and offset.

How to find the data for a file?
 Checkpoint region gives you inode map
 Inode map gives you inode, which gives you file data
 Changes after checkpoint, generally in memory

What happens in a checkpoint?
 Write all modified files: data blocks, indirect blocks, inodes,
   inode map table, segment usage table.
 Write to checkpoint region: address of all blocks in inode map &
   segment usage table, current time, pointer to last written segment
 What if you crash while writing checkpoint region?
     - Use two checkpoint regions, take one with later time stamp
       (must write timestamp last & ensure disk/driver does not reorder)

When do you take a checkpoint?
 Every 30 seconds, or on sync
 Could do it by size of log, but LFS doesn't

Crash recovery
 What needs to be done to use FS again?
   - Need to recover inode map, segment usage table
 Read current versions from checkpoint region
 To lose less data, roll forward to get subsequent changes:
   - What do you do when you find a file data block?
       If new inode is written, then they are automatically part of file
       If not, file blocks considered part of an incomplete write and ignored
   - How do you know when to stop?
     (segment summary can keep a checksum of the entire segment)
 What about directory block / inode link count atomicity?
   - Write directory change entries to the log
   - What happens if directory block and/or new inode not written?
       Can complete the operation based on change entry
       Except:  Don't create new files if inode was never written

Free space management
  Why not just keep writing into free portions of disk? Fragmentation
  Solution:  Thread on a segment level for ~1Meg segments
   - Why 1 Meg?  Transfer time >> seek time
   - Cleaner goes through and rewrites segments to compact them
  How do you know if a block is still live?
   - Segment summary tells you what files the blocks belong to
   - Check current inode/indir block to see if block still pointed to
   - Inode version numbers save you the check when inode truncated/recycled

When and where to run cleaner?
 What is "write cost"?
   - "Avereage time disk is busy per byte of new data written"
   - Determined by utilization of segments being cleaned

            N = number of segments, u = segment utilization

	# bytes read & written      read segs + write live + write new
	----------------------  =  -----------------------------------
           # bytes new data                 # bytes new data

        N + N*u + N*(1-u)     2
      = ----------------- = -----
            N*(1-u)          1-u

 What's going on in figure 3 (p. 6)?
     must clean as many blocks as written, lower u is less cleaning
   But with variance, LFS does better (Figure 4, p. 7)
   Want a bimodal distribution:  Empty segments very empty
   Can you do better than 2/1-u?  Yes if u=0, don't need to read, cost=1
     Note for very sparse segments, maybe shouldn't read whole segment either

In figure 4, why does hot-and-cold do worse?
 Utilization decreases over time
   - "cold" segments linger for a long time just above cleaning threshold
   - Hot segment likely to accumulate more free space soon
   - Figure 5 shows *worse* utilization for hot-and-cold than for uniform
     What's going on?  See Figure 5
       Free space in cold segments more valuable than in hot segments
       Because will end up re-cleaning hot segment soon anyway
 How to fix?  Cost-benefit analysis.
   Estimate stability of data by its age
     - Older (more "stable") data more likely to stay live
   To clean segment, requires 1 seg read + write of live data (1+u)
   Free space gained is (1-u):
        benefit    free space gained * age of data     (1-u)*age
        ------- = --------------------------------- = -----------
         cost                   cost                     1+u
  How does this do?  See Figure 6 (p. 8)
 To support policy, put last block write time in segment usage table

Performance
  What benchmark is worse for LFS than FFS?
   - Random write, sequential read
   - Is this important?
  Why does real-world do better than synthetic benchmarks?
   - Very large files get written and deleted as a whole
   - Many files are almost never written (/bin) -- very cold segments
  What benchmarks would you like to see that are not in paper?
    LFS vs. some journaling file system w. write-ahead logging
    Effect of cleaner on end-to-end application throughput & latency

Can you think of other hardware/software techniques to enhance LFS?
 Read just live bytes in a segment when cleaning
 "Hole-plugging"
 NVRAM
 Architecting low-level storage to benefit from large writes (RAID)

Discussion:  Is LFS a good idea?
  How have predicted technology trends played out since 1991?
    + More workloads now seekbound instead of CPU bound
    - Disk capacity grew faster than main memory
      So cache sizes aren't big enough to eliminate read traffic
  How have file systems dealt with problems LFS identified?
    Mostly by journaling metadata in write-ahead log
    Soft updates (next class) another technique
  But note LFS ideas particularly good when combined with RAID
    Will see a successful product using similar ideas in AutoRAID paper
  Big (at least perceived) limitation of LFS is cleaner