SGI XFS
=======

What are XFS' goals?
  Big files
  Big directories
  Many files
  Near-hardware disk performance, even on arrays
  Instant crash recovery of enormous disk arrays
  Petabytes of storage
  Motivating applications?
    Uncompressed video -- 30 MB/sec, 108 GB files for 1 hour

Where does FFS fall short?
  Crash recovery time proportional to file system size
    1 minute for 8 Gig  ->  20 minutes for 160 Gig!
  Unscalable data structures
    Bitmaps for free blocks, 32-bit pointers
  Can't achieve hardware disk throughput
    Too many bcopies through buffer cache
    Too many seeks, synchronous disk I/O for metadata operations
    
What is a disk array?  Why do you want XLV?
  Want file system larger than one physical disk
  Higher throughput if you can write to multiple disks in parallel

What is a B+-Tree?
  Contains key -> value mappings
  Lookup(k1) finds entry whose key is "closest" to k1
    So we assume an ordering of keys
    And we can find nearby keys easily
    XFS ordering is always numeric
  Lookup takes log(N)
  Internal organization:
    Each node is a fixed size disk block
      Fairly big, so base of log(N) is big and log(N) is small
    A leaf node contains just sorted key/value pairs
    An internal node contains:
      P1 K1 P2 K2 ... Kn Pn+1
      Kx is a key
      Px is the disk address of the sub-node containing keys
		between Kx-1 and Kx
      So keys are duplicated between interior and leaf nodes
    Algorithms make sure every node is at least half full

Keep a list of data structures on whiteboard:
  File extent map: file offset -> start block #, # of blocks
  AG free blocks: start block # -> # of contiguous blocks
  AG free blocks: # of free blocks -> start block #
  Directory contents: hash(name) -> name, i-number
  AG i-map: starting-i-number-within-block
		-> block #, free bitmap for that block of i-nodes

Support for large files: The file extent map
  Inode contains extents, not a block list
  Each entry: file offset, block #, # of blocks
  Saves space for huge files; a form of compression
  If small, directly in i-node; otherwise in a per-inode b-tree
  B-tree makes it easy to look up particular offsets
  What if offset we want isn't actually mentioned in b-tree?
    b-tree lookup(key) finds us numerically closest keys!
  Contiguous allocation of big extents makes reads and writes much faster
    Can read and write without seeking
    With 8k blocks and 10ms seek times: 0.8 MBytes/s if seek between each read
    As opposed to 10 MBytes/s for sequential I/O
  But how does XFS find large contiguous free regions when creating files?

Straw man:  Binary buddy allocator
  One bitmap for each power of two block size
  Break up large blocks if no small blocks available
  Drawbacks:
    Extents must be aligned, else two 512-byte regions != 1K region
    Can keep linked list of free regions
      Linked lists bad idea on disk (must read free block to overwrite)
      Linked list gives random allocation order
    Can scan bitmaps, but linear when not much free space

XFS approach:  Allocation groups
  What's an allocation group?
    1/2 GB - 4 GB region of disk
    New directories put in new AGs
    Within directory, inodes of files put in same AG
  What's the point of AGs?
    Parallelize allocation of blocks/inodes
    Use 32-bit block pointers to keep tract of extents within AG
    Keeps size of AG data structures small (32-bit block #s, not 64)
    Allows independent locking of free-space data structures
  How does an AG differ from an FFS CG?
    AGs much bigger than CGs (AGs too big for seek-minimizing locality)
    AGs do not have fixed number of inodes
  Any disadvantages of AGs?
    Can't find a free range that starts in one AG and ends in another

How does XFS find large chunks of free space in AGs?
  Goals: be able to find large runs of contiguous free space
    Be able to find it near existing blocks of existing file
    Or near file's i-node, for new files
  To find space of a known size:
    AG has a b-tree mapping size to start block #, # of free blocks
    One entry per run of free blocks (up to 2M blocks)
  To find space near the end of a file (for append),
    Or near the end of an allocated chunk in a sparse file:
    AG has a b-tree mapping block # to # of free blocks
    One entry per run of free blocks
  You have to keep the two b-trees in sync
    Every run of free space has an entry in each

How do you know how much space a file is going to need?
  Must applications pre-declare space needs?
  No: Delayed allocation:
    write() just affects cache
    Wait until many write()s, then allocate, then write to disk
  Other advantages of delayed allocation:
    Short-lived files never need disk space allocated
    Mmapped files often written in random order in memory,
	    will still be mostly contiguously written to disk
    Write clustering, often find other nearby stuff to write

How does XFS support huge directories?
  What's the issue?  Explain EFS column of Figure 4
    Linear scans of FFS directories:  10,000 files takes 10x longer than 1,000
    How do apps usually deal?  Use extra directories: /cache/17/17345 
  XFS solution?  Directory content held in a b-tree
    hash(name) -> name, i-number
    Why the hash?  (b-tree implementation much simpler w. fixed-size keys)
    Why does XFS performance plummet from 90K to 150K files?
      IRIX metadata block cache limits memory used for directory caching
      Directory is so big that leaf nodes won't fit in buffer cache

What's an i-number in XFS?
  Want to avoid fixed pre-determined number of i-nodes, as in FFS
    Must be able to allocate new ones
    So can't be in pre-determined positions
    Need a level of indirection to map i-number to disk location
  Could have one b-tree for i-mapping
    i-number -> disk location
  Why does XFS not do that?
    Probably to avoid lock contention
    Also, to use more space efficient 32-bit pointers to inode blocks
  Instead, one i-map per AG
    How do you know what AG an i-number is in?
      Probably 64 bits; 32 bits of AG#, 32 bits of i-number-within-ag
    For each block of i-nodes:
      starting-i-number-within-ag
	-> block # (within AG), bitmap of free i-nodes in block
  How does XFS allocate an inode?
    Chose an AG for inode, similarly to Cylinder Groups in FFS
      New directory goes in new AG, new file usually in same AG as directory
    May need to allocate a new chunk of inodes
    Or need to scan AG i-map to find a free inode
      Is this scalable?  Linear in #inodes/AG, but at least AG size fixed
  Can you ever reclaim space used for inodes?
    Maybe, if all 64 inodes in a chunk are free--paper doesn't say

How does XFS handle recovery?
  Lots of delicate data structures
  Don't want to scan them all after crash/reboot
  Goal:
    Restore disk to *some* consistent state after reboot
      I.e. reflect exactly the operations that happened before some time T
      But preserve atomicity of transactions
    T may not be the crash time, so may lose the last few updates
    But we will end up with consistent meta-data
    What about file contents? Why doesn't XFS care?
      Okay to lose non-fsynced data if machine running app. crashes
      But note you do care if file system NFS exported (XFS has special mode)

Technique: write-ahead logging
    Log lives (mostly) on disk
    For every meta-data update, append a log entry:
      meta-data identifier (block or i-number), new value
      Start/end transaction markers
      Note that log format is independent of b-tree specifics
    The tail of the log is in memory
      Flush it (sequentially) from time to time
    Delay real writes until corresponding part of log is flushed
  After crash/reboot:
    Replay log onto disk
    Don't include unfinished transactions (how to detect end of log?)
    Note we may have missed the unflused end of the log
  To save time during recovery, and to reclaim log space,
    write a "checkpoint" to the disk indicating that disk
    consistent with log before a certain log point
  Why can we get away with something simpler that DO-UNDO-REDO in system R?
    Operations are idempotent
    Never need to roll back an operation (no "losers")
  What are pros/cons of writing the log asynchronously?
    Pros:  Faster to group multiple records at a time
      Otherwise, will waste disk rotations between adjacent log writes
    Cons:  Lose more after crash; not suitable for NFS-exported FS
  How to help performance of very metadata-intensive workloads?
    Dedicated log disk, or put log in NVRAM (good for NFS)

What is direct I/O?
  DMA directly from disk to user memory
    Saves bcopies from buffer cache
    Saves fighting with buffer cache for memory (if policy bad)
      Examples where you want to do this?
        B-tree data structures, where even UBM imperfect (need LRU-2 or 2Q)
  Restrictions
    Must read an integral number of aligned sectors
    Application responsible for caching
    Application responsible for generating large I/O requests

Large file performance
  Very fast! Even for today; try it on your Linux box
    Their FS is faster than most I/O buses..
  Do they achieve h/w potential?
    h/w max is 60 disks * 7 MB/s per disk = 420 MB/s
    They only get 350, but they do almost as well as raw XLV
  Are they faster than EFS?
    Not much for read
    A lot for write
  Why the 1 vs. 4 thread gap in Figure 2?
    Probably can't keep an array of disk busy w/ one thread
    Why not?
      They're using "large" requests -- but < stripe?
      Why doesn't read-ahead help?
        Only 2-3 read-ahead requests, but more disks
        Maybe no read-ahead with direct I/O
  Same 1 vs 4 thread gap in Figure 3 (writing)
    Again, direct I/O prevents write-behind?
    Shows that fixing locks to allow parallel access to single file is good
  Why is EFS slower at create than write, but not XFS?
    They claim more efficient space allocation code
  Why is EFS close to XFS for read, but not for write?
    Maybe disks do lots of read-ahead and caching for us?