Log-structured file system ========================== What problems is this paper addressing? Disk seek times not improving as quickly as CPU, transfer times, etc. Larger file caches - Increase write-to-read bandwidth ratio - Potentially allows large write buffers FFS bad for small file accesses (especially creates) - Office and engineering workloads FFS requires many synchronous writes for metadata operations Crash recovery time increases linearly with disk sizes How is disk layed out? Superblock: #segments, segment size, etc. Checkpoint region Segments Segment summary - inumber, version, offset for each block Log entries: - Inode: mtime, data/indirect block locations - Inode map: inode locations, inode versions, atimes - Indirect block - Directory change log: link/unlink/rename - Segment usage table: #bytes free in segments, write times F = file data block, D = directory data block | F1,1 | F1,2 | I1 | F2,5 | F2,8 | I2 | D7,1 | I7 | \--<---\--<-+ Where do you store time of last access (inode map). Why? Within segment, sort by inode and offset. How to find the data for a file? Checkpoint region gives you inode map Inode map gives you inode, which gives you file data Changes after checkpoint, generally in memory What happens in a checkpoint? Write all modified files: data blocks, indirect blocks, inodes, inode map table, segment usage table. Write to checkpoint region: address of all blocks in inode map & segment usage table, current time, pointer to last written segment What if you crash while writing checkpoint region? - Use two checkpoint regions, take one with later time stamp (must write timestamp last & ensure disk/driver does not reorder) When do you take a checkpoint? Every 30 seconds, or on sync Could do it by size of log, but LFS doesn't Crash recovery What needs to be done to use FS again? - Need to recover inode map, segment usage table Read current versions from checkpoint region To lose less data, roll forward to get subsequent changes: - What do you do when you find a file data block? If new inode is written, then they are automatically part of file If not, file blocks considered part of an incomplete write and ignored - How do you know when to stop? (segment summary can keep a checksum of the entire segment) What about directory block / inode link count atomicity? - Write directory change entries to the log - What happens if directory block and/or new inode not written? Can complete the operation based on change entry Except: Don't create new files if inode was never written Free space management Why not just keep writing into free portions of disk? Fragmentation Solution: Thread on a segment level for ~1Meg segments - Why 1 Meg? Transfer time >> seek time - Cleaner goes through and rewrites segments to compact them How do you know if a block is still live? - Segment summary tells you what files the blocks belong to - Check current inode/indir block to see if block still pointed to - Inode version numbers save you the check when inode truncated/recycled When and where to run cleaner? What is "write cost"? - "Avereage time disk is busy per byte of new data written" - Determined by utilization of segments being cleaned N = number of segments, u = segment utilization # bytes read & written read segs + write live + write new ---------------------- = ----------------------------------- # bytes new data # bytes new data N + N*u + N*(1-u) 2 = ----------------- = ----- N*(1-u) 1-u What's going on in figure 3? must clean as many blocks as written, lower u is less cleaning But with variance, LFS does better (Figure 4) Want a bimodal distribution: Empty segments very empty Can you do better than 2/1-u? Yes if u=0, don't need to read, cost=1 Note for very sparse segments, maybe shouldn't read whole segment either In figure 4, why does hot-and-cold do worse? Utilization decreases over time - "cold" segments linger for a long time just above cleaning threshold - Hot segment likely to accumulate more free space soon - Figure 5 shows *worse* utilization for hot-and-cold than for uniform What's going on? See Figure 5 Free space in cold segments more valuable than in hot segments Because will end up re-cleaning hot segment soon anyway How to fix? Cost-benefit analysis. Estimate stability of data by its age - Older (more "stable") data more likely to stay live To clean segment, requires 1 seg read + write of live data (1+u) Free space gained is (1-u): benefit free space gained * age of data (1-u)*age ------- = --------------------------------- = ----------- cost cost 1+u How does this do? See Figure 6 To support policy, put last write time in segment usage table Performance What benchmark is worse for LFS than FFS? - Random write, sequential read - Is this important? Why does real-world do better than synthetic benchmarks? - Very large files get written and deleted as a whole - Many files are almost never written (/bin) -- very cold segments What benchmarks would you like to see that are not in paper? LFS vs. some journaling file system w. write-ahead logging Effect of cleaner on end-to-end application throughput & latency Can you think of other hardware/software techniques to enhance LFS? Read just live bytes in a segment when cleaning "Hole-plugging" NVRAM Architecting low-level storage to benefit from large writes (RAID) Discussion: Is LFS a good idea? How have predicted technology trends played out? + More workloads now seekbound instead of CPU bound - Disk capacity grew faster than main memory So cache sizes aren't big enough to eliminate read traffic How have file systems dealt with problems LFS identified? Mostly by journaling metadata in write-ahead log Soft updates another technique But note LFS ideas particularly good when combined with RAID Will see a successful product using similar ideas in AutoRAID paper Big (at least perceived) limitation of LFS is cleaner