Venti
=====

What is the motivation for this work?
  Storage capacity is growing faster than people can consume it
  People need archival storage (typically use tape backups)
  Observation:  If file system itself is archival, changes how people use FS
    No need to "clean up" to decide if data is important any more
    Because data will be archived in the file system anyway

How do traditional tape backups work?
  A "level 0" dump writes entire contents of the file system to tape
    Copy entire directory structure, plus contents of all i-nodes
    Operates on raw disk
      Good for performance--can back up inodes in order
      Good for atimes (won't artificially change atimes or ctimes of files)
      Potential race conditions if file system being changed
  "Level n" dump backs files changed since most recent level i < n
    Copy entire directory structure
    But only files with more recent max (mtime, ctime)
    Note:  This is why we care about ctime changing when renaming file
  Typical backup scheme uses "towers of hanoi" algorithm
    Want to minimize both size of backups and number of restores needed
    E.g., 0/1 3 2 5 4 7 6 9 8 9 9
  Disadvantages of tape backups
    Getting almost as expensive per byte as disks!
      100 GB IDE disks ~$1.50/GB, 20 GB DDS4 tapes ~$1/GB
    Unreliabile--after two years, reasonable chance of unrecoverable data
    More you use tape, less reliable--Make archival backups on virgin tape
    Speed--can read sequential data at a few MB/sec max
      Worse, seek takes very, very long time:
        Minutes to rewind tape
	Or tens of minutes to look through drawer and find another tape...
    Access control and semantics
      Only administrator can be allowed to access tape
      Means users don't have access to archives except in emergencies

What is Venti interface?
  Store a block
    Takes a data block + some metadata (user, type, etc.)
    Idempotent--multiple writes of same block have no effect
  Retrieve a block--takes hash, returns data block (like SFSRO)

How does Venti work?
  What is hardware platform?
    Single server (though Section 9 suggests distributing load)
    Big IDE RAID array for storing data blocks
    Smaller, fast SCSI disk for indexing blocks
  How is data laid out on IDE disks (see Fig 4)?
    Data is stored in a log, broken up into arenas
    Format of arena:
      Headaer--information identifying the arena
      Data Blocks
        Block header--fingerprint, type, size, user, wtime, encoding, esize
	Data block itself (upto 52 KB)
      Directory--list of <block-header, offset-in-arena> pairs
      Trailer--summarizes arena
        #blocks, size of log, if "sealed", then hash of whole arena
      Data blocks grow forwards, directory grows backwards from the end
    Indexed on SCSI disk
      Entries have:  fingerprint, type, size, address
      Map entries to buckets
  Why arenas?
    Sized so can be copied out to removable media
    Also helps with synchronization
      Want to store variable-sized blocks back-to-back
      If there is a problem, don't want to lose sync in the log
    Heavy redundancy checks maximize detection of errors
  What is the point of the block headers?
    Size is reasonably obvious, what about user & type?
      Might help finding "root" blocks if user lost hash
  What is the point of the directories?
    Redundancy, but most importantly speed of crash recovery
    If you lose index disk, need to reconstruct it from log
    Reading entire log much lower than just reading directories

Caching in venti
  Cache data blocks, cache hashes
  Cache buckets from SCSI index disk?
    No.  Why not?  No locality -> no point

Venti applications
  What is Vac?
    Like tar, but outputs a small file with a hash--data is in venti
    Two incremental options
      Based on modification times of files (like tar)
      Based on hashes in old vac archive
        Always guaranteed correct, but can automatically save bandwidth
  Physical backup
    Can represent entire disk with a log N height tree of hashes
    Could just copy every block of the disk to Venti--why/why not?
      + Dirt simple
      + Can mount the backup as a read-only file system (w. simple driver)
      + Works on any file system (don't need to know layout)
      + Could do lazy restory--copy back blocks as they are needed
      - Uses a lot of bandwidth (server may already have blocks)
      - Unused FS blocks may have data from deleted temporary files
        These will consume storage space but not offer any benefits
    Optimization:  Only copy blocks that are in use
      Easy for many file systems (e.g., in FFS just look at the bitmaps)
      Use special NULL pointer value in hash tree for missing blocks
    Other optimization:  Compare block hashes before sending over (like vac)
  Plan9 file system
    Used to use WORM jukebox behind a big disk cache, w. daily snapshots
    Can use Venti instead of WORM jukebox

Reliability and recovery
  What happens when IDE disk dies (use RAID 5)
  Can you do offsite backup?
    Sure, just keep pushing new parts of the log off-site
    Or could store the arenas on CDRs and ship them somewhere
  
Evaluation
  How does Venti do in terms of consuming storage?
    Compared to old plan9 file system, looks pretty good!  Why? (Table 2)
      No fragmentation--blocks are stored back-to-back
      Consolidating duplicates--same blocks in different files consilidated
      Conventional compression--aided by ability to have variable-sized blocks
  What about read/write performance (Table 1)?
    Sequential reads much slower than RAID.  Why?
      They become random
      But readahead might help, if data read in same order it was written
    Both Virgin and Duplicate writes slower than RAID?  Why?
      Always need to read index disk, even for duplicate writes!
      How to improve this?
        Can spread index over many disks
	Many spindles -> more throughput w. concurrency (but same latency)