SGI XFS ======= What are XFS' goals? Big files Big directories Many files Near-hardware disk performance, even on arrays Instant crash recovery of enormous disk arrays Petabytes of storage Motivating applications? Uncompressed video -- 30 MB/sec, 108 GB files for 1 hour Where does FFS fall short? Crash recovery time proportional to file system size 1 minute for 8 Gig -> 20 minutes for 160 Gig! Unscalable data structures Bitmaps for free blocks, 32-bit pointers Can't achieve hardware disk throughput Too many bcopies through buffer cache Too many seeks, synchronous disk I/O for metadata operations What is a disk array? Why do you want XLV? Want file system larger than one physical disk Higher throughput if you can write to multiple disks in parallel What is a B+-Tree? Contains key -> value mappings Lookup(k1) finds entry whose key is "closest" to k1 So we assume an ordering of keys And we can find nearby keys easily XFS ordering is always numeric Lookup takes log(N) Internal organization: Each node is a fixed size disk block Fairly big, so base of log(N) is big and log(N) is small A leaf node contains just sorted key/value pairs An internal node contains: P1 K1 P2 K2 ... Kn Pn+1 Kx is a key Px is the disk address of the sub-node containing keys between Kx-1 and Kx So keys are duplicated between interior and leaf nodes Algorithms make sure every node is at least half full Keep a list of data structures on whiteboard: File extent map: file offset -> start block #, # of blocks AG free blocks: start block # -> # of contiguous blocks AG free blocks: # of free blocks -> start block # Directory contents: hash(name) -> name, i-number AG i-map: starting-i-number-within-block -> block #, free bitmap for that block of i-nodes Support for large files: The file extent map Inode contains extents, not a block list Each entry: file offset, block #, # of blocks Saves space for huge files; a form of compression If small, directly in i-node; otherwise in a per-inode b-tree B-tree makes it easy to look up particular offsets What if offset we want isn't actually mentioned in b-tree? b-tree lookup(key) finds us numerically closest keys! Contiguous allocation of big extents makes reads and writes much faster Can read and write without seeking With 8k blocks and 10ms seek times: 0.8 MBytes/s if seek between each read As opposed to 10 MBytes/s for sequential I/O But how does XFS find large contiguous free regions when creating files? Straw man: Binary buddy allocator One bitmap for each power of two block size Break up large blocks if no small blocks available Drawbacks: Extents must be aligned, else two 512-byte regions != 1K region Can keep linked list of free regions Linked lists bad idea on disk (must read free block to overwrite) Linked list gives random allocation order Can scan bitmaps, but linear when not much free space XFS approach: Allocation groups What's an allocation group? 1/2 GB - 4 GB region of disk New directories put in new AGs Within directory, inodes of files put in same AG What's the point of AGs? Parallelize allocation of blocks/inodes Use 32-bit block pointers to keep tract of extents within AG Keeps size of AG data structures small (32-bit block #s, not 64) Allows independent locking of free-space data structures How does an AG differ from an FFS CG? AGs much bigger than CGs (AGs too big for seek-minimizing locality) AGs do not have fixed number of inodes Any disadvantages of AGs? Can't find a free range that starts in one AG and ends in another How does XFS find large chunks of free space in AGs? Goals: be able to find large runs of contiguous free space Be able to find it near existing blocks of existing file Or near file's i-node, for new files To find space of a known size: AG has a b-tree mapping size to start block #, # of free blocks One entry per run of free blocks (up to 2M blocks) To find space near the end of a file (for append), Or near the end of an allocated chunk in a sparse file: AG has a b-tree mapping block # to # of free blocks One entry per run of free blocks You have to keep the two b-trees in sync Every run of free space has an entry in each How do you know how much space a file is going to need? Must applications pre-declare space needs? No: Delayed allocation: write() just affects cache Wait until many write()s, then allocate, then write to disk Other advantages of delayed allocation: Short-lived files never need disk space allocated Mmapped files often written in random order in memory, will still be mostly contiguously written to disk Write clustering, often find other nearby stuff to write How does XFS support huge directories? What's the issue? Explain EFS column of Figure 4 Linear scans of FFS directories: 10,000 files takes 10x longer than 1,000 How do apps usually deal? Use extra directories: /cache/17/17345 XFS solution? Directory content held in a b-tree hash(name) -> name, i-number Why the hash? (b-tree implementation much simpler w. fixed-size keys) Why does XFS performance plummet from 90K to 150K files? IRIX metadata block cache limits memory used for directory caching Directory is so big that leaf nodes won't fit in buffer cache What's an i-number in XFS? Want to avoid fixed pre-determined number of i-nodes, as in FFS Must be able to allocate new ones So can't be in pre-determined positions Need a level of indirection to map i-number to disk location Could have one b-tree for i-mapping i-number -> disk location Why does XFS not do that? Probably to avoid lock contention Also, to use more space efficient 32-bit pointers to inode blocks Instead, one i-map per AG How do you know what AG an i-number is in? Probably 64 bits; 32 bits of AG#, 32 bits of i-number-within-ag For each block of i-nodes: starting-i-number-within-ag -> block # (within AG), bitmap of free i-nodes in block How does XFS allocate an inode? Chose an AG for inode, similarly to Cylinder Groups in FFS New directory goes in new AG, new file usually in same AG as directory May need to allocate a new chunk of inodes Or need to scan AG i-map to find a free inode Is this scalable? Linear in #inodes/AG, but at least AG size fixed Can you ever reclaim space used for inodes? Maybe, if all 64 inodes in a chunk are free--paper doesn't say How does XFS handle recovery? Lots of delicate data structures Don't want to scan them all after crash/reboot Goal: Restore disk to *some* consistent state after reboot I.e. reflect exactly the operations that happened before some time T But preserve atomicity of transactions T may not be the crash time, so may lose the last few updates But we will end up with consistent meta-data What about file contents? Why doesn't XFS care? Okay to lose non-fsynced data if machine running app. crashes But note you do care if file system NFS exported (XFS has special mode) Technique: write-ahead logging Log lives (mostly) on disk For every meta-data update, append a log entry: meta-data identifier (block or i-number), new value Start/end transaction markers Note that log format is independent of b-tree specifics The tail of the log is in memory Flush it (sequentially) from time to time Delay real writes until corresponding part of log is flushed After crash/reboot: Replay log onto disk Don't include unfinished transactions (how to detect end of log?) Note we may have missed the unflused end of the log To save time during recovery, and to reclaim log space, write a "checkpoint" to the disk indicating that disk consistent with log before a certain log point Why can we get away with something simpler that DO-UNDO-REDO in system R? Operations are idempotent Never need to roll back an operation (no "losers") What are pros/cons of writing the log asynchronously? Pros: Faster to group multiple records at a time Otherwise, will waste disk rotations between adjacent log writes Cons: Lose more after crash; not suitable for NFS-exported FS How to help performance of very metadata-intensive workloads? Dedicated log disk, or put log in NVRAM (good for NFS) What is direct I/O? DMA directly from disk to user memory Saves bcopies from buffer cache Saves fighting with buffer cache for memory (if policy bad) Examples where you want to do this? B-tree data structures, where even UBM imperfect (need LRU-2 or 2Q) Restrictions Must read an integral number of aligned sectors Application responsible for caching Application responsible for generating large I/O requests Large file performance Very fast! Even for today; try it on your Linux box Their FS is faster than most I/O buses.. Do they achieve h/w potential? h/w max is 60 disks * 7 MB/s per disk = 420 MB/s They only get 350, but they do almost as well as raw XLV Are they faster than EFS? Not much for read A lot for write Why the 1 vs. 4 thread gap in Figure 2? Probably can't keep an array of disk busy w/ one thread Why not? They're using "large" requests -- but < stripe? Why doesn't read-ahead help? Only 2-3 read-ahead requests, but more disks Maybe no read-ahead with direct I/O Same 1 vs 4 thread gap in Figure 3 (writing) Again, direct I/O prevents write-behind? Shows that fixing locks to allow parallel access to single file is good Why is EFS slower at create than write, but not XFS? They claim more efficient space allocation code Why is EFS close to XFS for read, but not for write? Maybe disks do lots of read-ahead and caching for us?