A Fast File System for UNIX
===========================

What was original Unix file system like?
  Each FS breaks partition into three regions:
    Superblock (parameters of file system, free ptr)
    Inodes -- type/mode/size + ptr to data blocks
    File and directory data blocks
  All data blocks 512 bytes
  Free blocks kept in a linked list

What's in an inode?
  Metadata - type (file/dir/device), owner/group, permissions, times, size, ...
  Pointers to data blocks (5-13)
  Pointers to indirect, double indirect blocks

What was wrong with original Unix FS?
  FS never transfers more than 512 bytes/disk access
  After a while, allocation essentially random
    Requires a random seek every 512 bytes of file data
  Inodes far from both directory data and file data
  Within a directory, inodes not near each other
  Usability problems:
    File names limited to 14 characters
    No way to update file atomically & guarantee existence after crash

How does FFS allocate file blocks?
  New block size must be at least 4K
    To avoid wasting space, use ``fragments'' for ends of files
  Cylinder groups spread inodes around disk, keep near data
  Bitmaps replace free list
  FS reserves space to improve allocation
    Tunable parameter, default 10\%
    Only superuser can use space when over 90\% full

What's in FFS superblock?
  File system parameters
    Disk characteristics, block size, CG info
    Information necessary to get inode given i-number
  Where is it replicated?
    Replicated once per cylinder group
    At shifting offsets, so as to span multiple platters
    How to find replicas if 1st superblock lost?
      Contains magic number--scan for blocks with number in right place
  Superblock also contains non-replicated "summary info."  What's this?
    # blocks, fragments, inodes, directories
    Flag stating if FS was cleanly unmounted

What are cylinder groups?
  Group related inodes and their data
  Contain a number of inodes (set when FS created)
    Default one inode per 2K data
  Contain file and directory blocks
  Contain bookkeeping information
    Block map -- bit map of available fragments
    Summary info within CG -- # free inodes, blocks/frags, files, directories 
    # free blocks by rotational position (8 positions)

How are inodes allocated?
  Allocate inodes in same CG as directory if possible
  New directories put in new cylinder groups
    Consider CGs with greater than average # free inodes
    Chose CG with smallest # directories
  Within CG, inodes allocated randomly (next free)
    Would like related inodes as close as possible
    OK, because one CG doesn't have that many inodes

How do fragments work?
  Allocate space when user writes beyond end of file
  Want last block to be a fragment if not full-size
    If already a fragment, may contain space for write -- done
    Else, must deallocate any existing fragment, allocate new
  If no appropriate free fragments, break full block
  Problem:  Slow for many small writes
    (Partial) soution:  new stat struct field st_blksize
    Tells applications file system block size
    stdio library can buffer this much data

How does block allocation work
  Goal:  optimize for sequential access
    If available, use rotationally close block in same cylinder
    Otherwise, use block in same CG
    If CG totally full, find other CG with quadratic hashing.  What's this?
      Add 1, 4, 9, 16, etc. to cylinder group # -- why is this good?
    Otherwise, search all CGs for some free space
  How to avoid one file filling up whole CG?
    Would make other inodes have data very far away
    Solution:  Break big files over many CGs
      Want large extents per CG to minimize seek cost for sequential access
      First break at 48KB -- why?  because that's when indirect block kicks in
      Then break every 1MB -- why?  1 MB transfer time >> seek time

Directories
  Inodes like files, but with different type bits
  Contents considered as 512-byte \textit{chunks}
  Each chunk has \texttt{direct} structure(s) with:
    32-bit inumber
    16-bit size of directory entry
    8-bit file type (NEW)
    8-bit length of file name
  Coalesce when deleting
    If first \texttt{direct} in chunk deleted, set inumber = 0
  Periodically compact directory chunks

Updating FFS for the 90s
  No longer want to assume rotational delay
    With disk caches, want data contiguously allocated
  Solution:  Cluster writes
    FS delays writing a block back to get more blocks
    Accumulates blocks into 64K clusters, written at once
  Allocation of clusters similar to fragments/blocks
    Summary info
    Cluster map has one bit for each 64K if all free
  Also read in 64K chunks when doing read ahead

Dealing with crashes
  Suppose all data written asynchronously
  Delete/truncate a file, append to other file, crash
    New file may reuse block from old
    Old inode may not be updated
    Cross-allocation!
    Often inode with older mtime wrong, but can't be sure
  Append to file, allocate indirect block, crash
    Inode points to indirect block
    But indirect block may contain garbage

Ordering of updates
  Must be careful about order of updates
    Write new inode to disk before directory entry
    Remove directory name before deallocating inode
    Write cleared inode to disk before updating CG free map
  Solution:  Many metadata updates syncrhonous
    Of course, this hurts performance
    Worse yet, cannot update buffers on the disk queue
    E.g., untar much slower than disk b/w

The Ganger rules for crash recovery:
  1. Never write pointer before initializing the structure it points to
  2. Never reuse a resource before nullifying all pointers to it
  3. Never clear last pointer to live resource before setting new one

Fixing file system corruption -- fsck
  Summary info usually bad after crash
    Scan to check free block map, block/inode counts
  System may have corrupt inodes (not simple crash)
    Bad block numbers, cross-allocation, etc.
    Do sanity check, clear inodes with garbage
  Fields in inodes may be wrong
    Count number of directory entries to verify link count,
	  if no entries but count >0 0, move to lost+found
    Make sure size and used data counts match blocks
  Directories may be bad
    Holes illegal, . and .. must be valid, etc.
    All directories must be reachable

What happens during the following operations, and after a crash:

* Create a file
  - What are the operations? (+ = always, o = sometimes)
    + Allocate/initialize inode
    o Grow directory (allocate block, change dir inode)
    + Set directory entry to point to inode
    + Update directory mtime
    + Update inode free maps

  - What can happen after a crash?
    Allocated inode has no directory entry
      Link file into lost+found directory
    Directory entry points to unallocated inode
      Can this happen?  Only after hardware failure or software bug
      Can detect if inode is zeroed out (file type 0)
    Allocation bitmaps are wrong
      fsck must recompute all bitmaps after reboot

* Unlink (delete) a file
  - What are the operations?
    + Clear directory entry (coalesce w. previous entry in chunk)
    + Clear inode
    + Clear all of a file's blocks/fragments/indir blocks in the bitmap
    + Clear inode in bitmap

  - What might happen after a crash?
    Inode has no links in directory
      Fsck will put it in lost+found
    New file contains chunks of data from delete file
      This actually happens with FFS
  - After hardware failure/software bug
    Directory points to cleared inode
      Fsck can clear directory
    Two files contain the same data block
      Very bad.  Linux used to do this, fsck had to make copy of block!
      Can maybe use trick of clearing file with older mtime
    One file's data block is another's indirect block!

* Truncate a file

* Append a block to a file

* Mkdir

* Rename (source, target)
  - What are crash recovery properties we want?
    (Target file must exist and be old or new version)

  - What are the operations?
    Bump link count on source file in case concurrently deleted
    Update target directory (sync)
    Decrement link count of target inode
    Update source directory
    Update source/.. if source is a directory
    Fix link count, update ctime of source inode
    Update mtimes of directories

  - What happens after a crash
    Two links to file under source and target names
      Fsck will not fix
    Two links to DIRECTORY under source and target names
      Fsck will find--fix based on ..?
    Old target file deleted but non-zero link count
      => lost+found
    etc.

* How might you implement true atomic rename?