A Fast File System for UNIX =========================== What was original Unix file system like? Each FS breaks partition into three regions: Superblock (parameters of file system, free ptr) Inodes -- type/mode/size + ptr to data blocks File and directory data blocks All data blocks 512 bytes Free blocks kept in a linked list What's in an inode? Metadata - type (file/dir/device), owner/group, permissions, times, size, ... Pointers to data blocks (5-13) Pointers to indirect, double indirect blocks What was wrong with original Unix FS? FS never transfers more than 512 bytes/disk access After a while, allocation essentially random Requires a random seek every 512 bytes of file data Inodes far from both directory data and file data Within a directory, inodes not near each other Usability problems: File names limited to 14 characters No way to update file atomically & guarantee existence after crash How does FFS allocate file blocks? New block size must be at least 4K To avoid wasting space, use ``fragments'' for ends of files Cylinder groups spread inodes around disk, keep near data Bitmaps replace free list FS reserves space to improve allocation Tunable parameter, default 10\% Only superuser can use space when over 90\% full What's in FFS superblock? File system parameters Disk characteristics, block size, CG info Information necessary to get inode given i-number Where is it replicated? Replicated once per cylinder group At shifting offsets, so as to span multiple platters How to find replicas if 1st superblock lost? Contains magic number--scan for blocks with number in right place Superblock also contains non-replicated "summary info." What's this? # blocks, fragments, inodes, directories Flag stating if FS was cleanly unmounted What are cylinder groups? Group related inodes and their data Contain a number of inodes (set when FS created) Default one inode per 2K data Contain file and directory blocks Contain bookkeeping information Block map -- bit map of available fragments Summary info within CG -- # free inodes, blocks/frags, files, directories # free blocks by rotational position (8 positions) How are inodes allocated? Allocate inodes in same CG as directory if possible New directories put in new cylinder groups Consider CGs with greater than average # free inodes Chose CG with smallest # directories Within CG, inodes allocated randomly (next free) Would like related inodes as close as possible OK, because one CG doesn't have that many inodes How do fragments work? Allocate space when user writes beyond end of file Want last block to be a fragment if not full-size If already a fragment, may contain space for write -- done Else, must deallocate any existing fragment, allocate new If no appropriate free fragments, break full block Problem: Slow for many small writes (Partial) soution: new stat struct field st_blksize Tells applications file system block size stdio library can buffer this much data How does block allocation work Goal: optimize for sequential access If available, use rotationally close block in same cylinder Otherwise, use block in same CG If CG totally full, find other CG with quadratic hashing. What's this? Add 1, 4, 9, 16, etc. to cylinder group # -- why is this good? Otherwise, search all CGs for some free space How to avoid one file filling up whole CG? Would make other inodes have data very far away Solution: Break big files over many CGs Want large extents per CG to minimize seek cost for sequential access First break at 48KB -- why? because that's when indirect block kicks in Then break every 1MB -- why? 1 MB transfer time >> seek time Directories Inodes like files, but with different type bits Contents considered as 512-byte \textit{chunks} Each chunk has \texttt{direct} structure(s) with: 32-bit inumber 16-bit size of directory entry 8-bit file type (NEW) 8-bit length of file name Coalesce when deleting If first \texttt{direct} in chunk deleted, set inumber = 0 Periodically compact directory chunks Updating FFS for the 90s No longer want to assume rotational delay With disk caches, want data contiguously allocated Solution: Cluster writes FS delays writing a block back to get more blocks Accumulates blocks into 64K clusters, written at once Allocation of clusters similar to fragments/blocks Summary info Cluster map has one bit for each 64K if all free Also read in 64K chunks when doing read ahead Dealing with crashes Suppose all data written asynchronously Delete/truncate a file, append to other file, crash New file may reuse block from old Old inode may not be updated Cross-allocation! Often inode with older mtime wrong, but can't be sure Append to file, allocate indirect block, crash Inode points to indirect block But indirect block may contain garbage Ordering of updates Must be careful about order of updates Write new inode to disk before directory entry Remove directory name before deallocating inode Write cleared inode to disk before updating CG free map Solution: Many metadata updates syncrhonous Of course, this hurts performance Worse yet, cannot update buffers on the disk queue E.g., untar much slower than disk b/w The Ganger rules for crash recovery: 1. Never write pointer before initializing the structure it points to 2. Never reuse a resource before nullifying all pointers to it 3. Never clear last pointer to live resource before setting new one Fixing file system corruption -- fsck Summary info usually bad after crash Scan to check free block map, block/inode counts System may have corrupt inodes (not simple crash) Bad block numbers, cross-allocation, etc. Do sanity check, clear inodes with garbage Fields in inodes may be wrong Count number of directory entries to verify link count, if no entries but count >0 0, move to lost+found Make sure size and used data counts match blocks Directories may be bad Holes illegal, . and .. must be valid, etc. All directories must be reachable What happens during the following operations, and after a crash: * Create a file - What are the operations? (+ = always, o = sometimes) + Allocate/initialize inode o Grow directory (allocate block, change dir inode) + Set directory entry to point to inode + Update directory mtime + Update inode free maps - What can happen after a crash? Allocated inode has no directory entry Link file into lost+found directory Directory entry points to unallocated inode Can this happen? Only after hardware failure or software bug Can detect if inode is zeroed out (file type 0) Allocation bitmaps are wrong fsck must recompute all bitmaps after reboot * Unlink (delete) a file - What are the operations? + Clear directory entry (coalesce w. previous entry in chunk) + Clear inode + Clear all of a file's blocks/fragments/indir blocks in the bitmap + Clear inode in bitmap - What might happen after a crash? Inode has no links in directory Fsck will put it in lost+found New file contains chunks of data from delete file This actually happens with FFS - After hardware failure/software bug Directory points to cleared inode Fsck can clear directory Two files contain the same data block Very bad. Linux used to do this, fsck had to make copy of block! Can maybe use trick of clearing file with older mtime One file's data block is another's indirect block! * Truncate a file * Append a block to a file * Mkdir * Rename (source, target) - What are crash recovery properties we want? (Target file must exist and be old or new version) - What are the operations? Bump link count on source file in case concurrently deleted Update target directory (sync) Decrement link count of target inode Update source directory Update source/.. if source is a directory Fix link count, update ctime of source inode Update mtimes of directories - What happens after a crash Two links to file under source and target names Fsck will not fix Two links to DIRECTORY under source and target names Fsck will find--fix based on ..? Old target file deleted but non-zero link count => lost+found etc. * How might you implement true atomic rename?