Log-structured file system
==========================

What problems is this paper addressing?
 * Disk seek times not improving as quickly as CPU, transfer times, etc.
 * Larger file caches
   - Increase write-to-read bandwidth ratio
   - Potentially allows large write buffers
 * FFS bad for small file accesses (especially creates)
   - Office and engineering workloads
 * FFS requires many synchronous writes for metadata operations

How is disk layed out?
 * Superblock: #segments, segment size, etc.
 * Checkpoint region
 * Segments
   Segment summary
   - inumber, version, offset for each block
   Log entries:
   - Inode: mtime, data/indirect block locations
   - Inode map: inode locations, inode versions, atimes
   - Indirect block
   - Directory change log: link/unlink/rename
   - Segment usage table: #bytes free in segments, write times

F = file data block, D = directory data block
	| F1,1 | F1,2 | I1 | F2,5 | F2,8 | I2 | D7,1 | I7 |
            \--<---\--<-+

 * Where do you store time of last access (inode map).  Why?

Within segment, sort by inode and offset.

How to find the data for a file
 * Checkpoint region gives you inode map
 * Inode map gives you inode, which gives you file data
 * Changes after checkpoint, generally in memory

What happens in a checkpoint:
 * Write all modified files: data blocks, indirect blocks, inodes,
     inode map table, segment usage table.
 * Write to checkpoint region: address of all blocks in inode map &
     segment usage table, current time, pointer to last written segment
 * What if you crash while writing checkpoint region?
     - Use two checkpoint regions, take one with later time stamp
       (must write timestamp last)

When do you take a checkpoint?
 * Every 30 seconds, or on sync
 * Could do it by size of log, but LFS doesn't

Crash recovery
 * What needs to be done to use FS again?
   - Need to recover inode map, segment usage table
 * Read current versions from checkpoint region
 * Roll-forward to get subsequent changes
   - What do you do when you find a file data block?
       If new inode is written, then they are automatically part of file
       If not, file blocks considered part of an incomplete write and ignored
   - How do you know when to stop?
     (segment summary can keep a checksum of the entire segment)
 * What about directory block / inode link count atomicity?
   - Write directory change entries to the log
   - What happens if directory block and/or new inode not written?
       Can complete the operation based on change entry
       Except:  Don't create new files if inode was never written

Free space management
 * Why not just keep writing into free portions of disk? Fragmentation
 * Solution:  Thread on a segment level for ~1Meg segments
   - Why 1 Meg?  Transfer time >> seek time
   - Cleaner goes through and rewrites segments to compact them
 * How do you know if a block is still live?
   - Segment summary tells you what files the blocks belong to
   - Check current inode/indir block to see if block still pointed to
   - Inode version numbers save you the check when inode truncated/recycled

When to run cleaner
 * What is "write cost"?
   - "Avereage time disk is busy per byte of new data written"
   - Determined by utilization of segments being cleaned

            N = number of segments, u = segment utilization

	# bytes written      read segs + write live + write new
	---------------- =  -----------------------------------
	# bytes new data              #bytes new data

        N + N*u + N*(1-u)     2
      = ----------------- = -----
            N*(1-u)          1-u

 * Explain figure 3
     must clean as many blocks as written, lower u is less cleaning
 * But with variance, LFS does better (Figure 4)
 * Want a bimodal distribution:  Empty segments very empty
 * Can you do better than 2/1-u?  Yes if u=0, don't need to read, cost=1

In figure 4, why does hot-and-cold do worse?
 * Utilization decreases over time
   - "cold" segments linger for a long time just above cleaning threshold
   - Hot segment likely to accumulate more free space soon
   - Figure 5 shows *worse* utilization for hot-and-cold than for uniform
 * How to fix?  Cost-benefit analysis.
   Estimate stability by age of data
     - Older (more "stable") data more likely to stay live
   To clean segment, requires 1 seg read + write of live data (1+u)
   Free space gained is (1-u):
        benefit    free space gained * age of data     (1-u)*age
        ------- = --------------------------------- = -----------
         cost                   cost                     1+u

 * To support policy, put last write time in segment usage table

Performance
 * What benchmark is worse for LFS than FFS
   - Random write, sequential read
   - Is this important?
 * Why does real-world do better than synthetic benchmarks
   - Very large files get written and deleted as a whole
   - Many files are almost never written (/bin) -- very cold segments

Discussion:  Is LFS a good idea?

Other techniques
 * Read just live bytes in a segment when cleaning
 * "Hole-plugging"
 * NVRAM
 * Architecting low-level storage to benefit from large writes (RAID)