Log-structured file system ========================== What problems is this paper addressing? * Disk seek times not improving as quickly as CPU, transfer times, etc. * Larger file caches - Increase write-to-read bandwidth ratio - Potentially allows large write buffers * FFS bad for small file accesses (especially creates) - Office and engineering workloads * FFS requires many synchronous writes for metadata operations How is disk layed out? * Superblock: #segments, segment size, etc. * Checkpoint region * Segments Segment summary - inumber, version, offset for each block Log entries: - Inode: mtime, data/indirect block locations - Inode map: inode locations, inode versions, atimes - Indirect block - Directory change log: link/unlink/rename - Segment usage table: #bytes free in segments, write times F = file data block, D = directory data block | F1,1 | F1,2 | I1 | F2,5 | F2,8 | I2 | D7,1 | I7 | \--<---\--<-+ * Where do you store time of last access (inode map). Why? Within segment, sort by inode and offset. How to find the data for a file * Checkpoint region gives you inode map * Inode map gives you inode, which gives you file data * Changes after checkpoint, generally in memory What happens in a checkpoint: * Write all modified files: data blocks, indirect blocks, inodes, inode map table, segment usage table. * Write to checkpoint region: address of all blocks in inode map & segment usage table, current time, pointer to last written segment * What if you crash while writing checkpoint region? - Use two checkpoint regions, take one with later time stamp (must write timestamp last) When do you take a checkpoint? * Every 30 seconds, or on sync * Could do it by size of log, but LFS doesn't Crash recovery * What needs to be done to use FS again? - Need to recover inode map, segment usage table * Read current versions from checkpoint region * Roll-forward to get subsequent changes - What do you do when you find a file data block? If new inode is written, then they are automatically part of file If not, file blocks considered part of an incomplete write and ignored - How do you know when to stop? (segment summary can keep a checksum of the entire segment) * What about directory block / inode link count atomicity? - Write directory change entries to the log - What happens if directory block and/or new inode not written? Can complete the operation based on change entry Except: Don't create new files if inode was never written Free space management * Why not just keep writing into free portions of disk? Fragmentation * Solution: Thread on a segment level for ~1Meg segments - Why 1 Meg? Transfer time >> seek time - Cleaner goes through and rewrites segments to compact them * How do you know if a block is still live? - Segment summary tells you what files the blocks belong to - Check current inode/indir block to see if block still pointed to - Inode version numbers save you the check when inode truncated/recycled When to run cleaner * What is "write cost"? - "Avereage time disk is busy per byte of new data written" - Determined by utilization of segments being cleaned N = number of segments, u = segment utilization # bytes written read segs + write live + write new ---------------- = ----------------------------------- # bytes new data #bytes new data N + N*u + N*(1-u) 2 = ----------------- = ----- N*(1-u) 1-u * Explain figure 3 must clean as many blocks as written, lower u is less cleaning * But with variance, LFS does better (Figure 4) * Want a bimodal distribution: Empty segments very empty * Can you do better than 2/1-u? Yes if u=0, don't need to read, cost=1 In figure 4, why does hot-and-cold do worse? * Utilization decreases over time - "cold" segments linger for a long time just above cleaning threshold - Hot segment likely to accumulate more free space soon - Figure 5 shows *worse* utilization for hot-and-cold than for uniform * How to fix? Cost-benefit analysis. Estimate stability by age of data - Older (more "stable") data more likely to stay live To clean segment, requires 1 seg read + write of live data (1+u) Free space gained is (1-u): benefit free space gained * age of data (1-u)*age ------- = --------------------------------- = ----------- cost cost 1+u * To support policy, put last write time in segment usage table Performance * What benchmark is worse for LFS than FFS - Random write, sequential read - Is this important? * Why does real-world do better than synthetic benchmarks - Very large files get written and deleted as a whole - Many files are almost never written (/bin) -- very cold segments Discussion: Is LFS a good idea? Other techniques * Read just live bytes in a segment when cleaning * "Hole-plugging" * NVRAM * Architecting low-level storage to benefit from large writes (RAID)