Quick traditional FS background =============================== Original Unix file system All data stored in 512-byte blocks Free blocks kept in a linked list Disk partitioned disk into three regions - Superblock: parameters of file system, pointer to head of free list - Inodes: type/mode/size, plus pointer to file/directory blocks For large files, has pointer to indirect, double indirect, etc. blocks - Data blocks: Plain old file data, also indirect blocks, for directories have blocks of filename->inode# mappings Limitations of Unix FS Never transfer more than 512 bytes/disk access On old disks required at least one rotation per 512/bytes @3,600 RPM that means 30 KBytes/sec! Many factors make for long seek times reducing throughput even further After a while, allocation from free list random, so no locality Inodes are far from both the directory and file's data Within a directory, inodes of related files not near each other [Functionality restrictions like short file names and no rename] Berkeley Fast File System (FFS) improved Unix FS in 1980s Larger block size of at least 4K (use "fragments" when last block of file < blk size, to avoid wasted space) Bitmaps replace the free list Cylinder Groups spread inodes throughout disk Try to located files in same directory in same CG Try to allocate file blocks in same CG as inode After crash, run fsck (file system check) program Checks every inode's fields for consistency (e.g., size, data block pointers, link count) Checks directory structure for consistency Time to run is proportional to size of file system (non-scalable w. trends) What happens when you create a 1K file on FFS? Must make at least 5 writes: - directory block: to record the name of the new file - directory inode: to record new modification time - file inode: to initialize it with metadata then (possibly again) for new file size and data block ptr - file data block: to store file contents Worse yet, many of these writes must be synchronous! Much more on this next lecture Basic point is you have to leave file system in a state fsck can fix Requires doing writes in a particular order (E.g., delete file and create new one with same data block want to make sure you don't end up w. both files cross-allocated) FFS uses synchronous writes to ensure ordering could imagine not doing it, but next lecture will see how not so simple Log-structured file system ========================== What problems is this paper addressing? Disk seek times not improving as quickly as CPU, transfer times, etc. Larger file caches - Increase write-to-read bandwidth ratio - Potentially allows large write buffers FFS bad for small file accesses (especially creates) - Office and engineering workloads FFS requires many synchronous writes for metadata operations Crash recovery time increases linearly with disk sizes How is disk layed out? Superblock: #segments, segment size, etc. Checkpoint region Segments Segment summary - inumber, version, offset for each block Log entries: - Inode: mtime, data/indirect block locations - Inode map: inode locations, inode versions, atimes - Indirect block - Directory change log: link/unlink/rename - Segment usage table: #bytes free in segments, write times F = file data block, D = directory data block | F1,1 | F1,2 | I1 | F2,5 | F2,8 | I2 | D7,1 | I7 | \--<---\--<-+ Where do you store time of last access (inode map). Why? Within segment, sort by inode and offset. How to find the data for a file? Checkpoint region gives you inode map Inode map gives you inode, which gives you file data Changes after checkpoint, generally in memory What happens in a checkpoint? Write all modified files: data blocks, indirect blocks, inodes, inode map table, segment usage table. Write to checkpoint region: address of all blocks in inode map & segment usage table, current time, pointer to last written segment What if you crash while writing checkpoint region? - Use two checkpoint regions, take one with later time stamp (must write timestamp last & ensure disk/driver does not reorder) When do you take a checkpoint? Every 30 seconds, or on sync Could do it by size of log, but LFS doesn't Crash recovery What needs to be done to use FS again? - Need to recover inode map, segment usage table Read current versions from checkpoint region To lose less data, roll forward to get subsequent changes: - What do you do when you find a file data block? If new inode is written, then they are automatically part of file If not, file blocks considered part of an incomplete write and ignored - How do you know when to stop? (segment summary can keep a checksum of the entire segment) What about directory block / inode link count atomicity? - Write directory change entries to the log - What happens if directory block and/or new inode not written? Can complete the operation based on change entry Except: Don't create new files if inode was never written Free space management Why not just keep writing into free portions of disk? Fragmentation Solution: Thread on a segment level for ~1Meg segments - Why 1 Meg? Transfer time >> seek time - Cleaner goes through and rewrites segments to compact them How do you know if a block is still live? - Segment summary tells you what files the blocks belong to - Check current inode/indir block to see if block still pointed to - Inode version numbers save you the check when inode truncated/recycled When and where to run cleaner? What is "write cost"? - "Avereage time disk is busy per byte of new data written" - Determined by utilization of segments being cleaned N = number of segments, u = segment utilization # bytes read & written read segs + write live + write new ---------------------- = ----------------------------------- # bytes new data # bytes new data N + N*u + N*(1-u) 2 = ----------------- = ----- N*(1-u) 1-u What's going on in figure 3 (p. 6)? must clean as many blocks as written, lower u is less cleaning But with variance, LFS does better (Figure 4, p. 7) Want a bimodal distribution: Empty segments very empty Can you do better than 2/1-u? Yes if u=0, don't need to read, cost=1 Note for very sparse segments, maybe shouldn't read whole segment either In figure 4, why does hot-and-cold do worse? Utilization decreases over time - "cold" segments linger for a long time just above cleaning threshold - Hot segment likely to accumulate more free space soon - Figure 5 shows *worse* utilization for hot-and-cold than for uniform What's going on? See Figure 5 Free space in cold segments more valuable than in hot segments Because will end up re-cleaning hot segment soon anyway How to fix? Cost-benefit analysis. Estimate stability of data by its age - Older (more "stable") data more likely to stay live To clean segment, requires 1 seg read + write of live data (1+u) Free space gained is (1-u): benefit free space gained * age of data (1-u)*age ------- = --------------------------------- = ----------- cost cost 1+u How does this do? See Figure 6 (p. 8) To support policy, put last block write time in segment usage table Performance What benchmark is worse for LFS than FFS? - Random write, sequential read - Is this important? Why does real-world do better than synthetic benchmarks? - Very large files get written and deleted as a whole - Many files are almost never written (/bin) -- very cold segments What benchmarks would you like to see that are not in paper? LFS vs. some journaling file system w. write-ahead logging Effect of cleaner on end-to-end application throughput & latency Can you think of other hardware/software techniques to enhance LFS? Read just live bytes in a segment when cleaning "Hole-plugging" NVRAM Architecting low-level storage to benefit from large writes (RAID) Discussion: Is LFS a good idea? How have predicted technology trends played out since 1991? + More workloads now seekbound instead of CPU bound - Disk capacity grew faster than main memory So cache sizes aren't big enough to eliminate read traffic How have file systems dealt with problems LFS identified? Mostly by journaling metadata in write-ahead log Soft updates (next class) another technique But note LFS ideas particularly good when combined with RAID Will see a successful product using similar ideas in AutoRAID paper Big (at least perceived) limitation of LFS is cleaner