Venti ===== What is the motivation for this work? Storage capacity is growing faster than people can consume it People need archival storage (typically use tape backups) Observation: If file system itself is archival, changes how people use FS No need to "clean up" to decide if data is important any more Because data will be archived in the file system anyway How do traditional tape backups work? A "level 0" dump writes entire contents of the file system to tape Copy entire directory structure, plus contents of all i-nodes Operates on raw disk Good for performance--can back up inodes in order Good for atimes (won't artificially change atimes or ctimes of files) Potential race conditions if file system being changed "Level n" dump backs files changed since most recent level i < n Copy entire directory structure But only files with more recent max (mtime, ctime) Note: This is why we care about ctime changing when renaming file Typical backup scheme uses "towers of hanoi" algorithm Want to minimize both size of backups and number of restores needed E.g., 0/1 3 2 5 4 7 6 9 8 9 9 Disadvantages of tape backups Getting almost as expensive per byte as disks! 100 GB IDE disks ~$1.50/GB, 20 GB DDS4 tapes ~$1/GB Unreliabile--after two years, reasonable chance of unrecoverable data More you use tape, less reliable--Make archival backups on virgin tape Speed--can read sequential data at a few MB/sec max Worse, seek takes very, very long time: Minutes to rewind tape Or tens of minutes to look through drawer and find another tape... Access control and semantics Only administrator can be allowed to access tape Means users don't have access to archives except in emergencies What is Venti interface? Store a block Takes a data block + some metadata (user, type, etc.) Idempotent--multiple writes of same block have no effect Retrieve a block--takes hash, returns data block (like SFSRO) How does Venti work? What is hardware platform? Single server (though Section 9 suggests distributing load) Big IDE RAID array for storing data blocks Smaller, fast SCSI disk for indexing blocks How is data laid out on IDE disks (see Fig 4)? Data is stored in a log, broken up into arenas Format of arena: Headaer--information identifying the arena Data Blocks Block header--fingerprint, type, size, user, wtime, encoding, esize Data block itself (upto 52 KB) Directory--list of pairs Trailer--summarizes arena #blocks, size of log, if "sealed", then hash of whole arena Data blocks grow forwards, directory grows backwards from the end Indexed on SCSI disk Entries have: fingerprint, type, size, address Map entries to buckets Why arenas? Sized so can be copied out to removable media Also helps with synchronization Want to store variable-sized blocks back-to-back If there is a problem, don't want to lose sync in the log Heavy redundancy checks maximize detection of errors What is the point of the block headers? Size is reasonably obvious, what about user & type? Might help finding "root" blocks if user lost hash What is the point of the directories? Redundancy, but most importantly speed of crash recovery If you lose index disk, need to reconstruct it from log Reading entire log much lower than just reading directories Caching in venti Cache data blocks, cache hashes Cache buckets from SCSI index disk? No. Why not? No locality -> no point Venti applications What is Vac? Like tar, but outputs a small file with a hash--data is in venti Two incremental options Based on modification times of files (like tar) Based on hashes in old vac archive Always guaranteed correct, but can automatically save bandwidth Physical backup Can represent entire disk with a log N height tree of hashes Could just copy every block of the disk to Venti--why/why not? + Dirt simple + Can mount the backup as a read-only file system (w. simple driver) + Works on any file system (don't need to know layout) + Could do lazy restory--copy back blocks as they are needed - Uses a lot of bandwidth (server may already have blocks) - Unused FS blocks may have data from deleted temporary files These will consume storage space but not offer any benefits Optimization: Only copy blocks that are in use Easy for many file systems (e.g., in FFS just look at the bitmaps) Use special NULL pointer value in hash tree for missing blocks Other optimization: Compare block hashes before sending over (like vac) Plan9 file system Used to use WORM jukebox behind a big disk cache, w. daily snapshots Can use Venti instead of WORM jukebox Reliability and recovery What happens when IDE disk dies (use RAID 5) Can you do offsite backup? Sure, just keep pushing new parts of the log off-site Or could store the arenas on CDRs and ship them somewhere Evaluation How does Venti do in terms of consuming storage? Compared to old plan9 file system, looks pretty good! Why? (Table 2) No fragmentation--blocks are stored back-to-back Consolidating duplicates--same blocks in different files consilidated Conventional compression--aided by ability to have variable-sized blocks What about read/write performance (Table 1)? Sequential reads much slower than RAID. Why? They become random But readahead might help, if data read in same order it was written Both Virgin and Duplicate writes slower than RAID? Why? Always need to read index disk, even for duplicate writes! How to improve this? Can spread index over many disks Many spindles -> more throughput w. concurrency (but same latency)