FS groups: Mark Mentovai (mark), Eugene Kushnir (kuemi), Stan Sagalovskiy (ysagal), Julien Wonderlick (jsw221) Jason Lee (jasonlee), Chris Kolanovic (chris), Gennadiy Vaynshteyn (gv218) Petal ===== * What is petal? - Network storage system - Provides "virtual disks" with 64-bit block addressing - Allocates real disk space on demand, with "decommit" to free up space - Split across many servers, with replication - Can access any block as long as one copy avail & majority of servers up - Snapshot service for easy backups - Incremental reconfiguration * Why build petal rather than a big NFS server? - Petal's lower-level RPC Interface simpler to implement - Device driver at clients can be used by arbitrary file systems/databases * Maintain only hints in client - Client sends request to it's best guess at correct server - If wrong, server sends back error and client updates state * Do clients cache data? - It's like a real disk device Char device not cached, block dev can use buffer cache Applications (or file systems) must manage caching policy (like NFS's use of the buffer cache) * Kinds of redundance supported: - None - Replication w. chained-declustering. What is chain declustering? Every block replicated on sequentially numbered servers Allows geographically divers replication to tolerate site failures Put even servers at one site, odd at another site Servers can take load from their neighbors for load balancing * Virtual to physical translation VDir - virtual directory maps virtual disks to GMap - maps offset -> server Immutable for one epoch PMap - per server, maps -> phys block Translates 64K at a time * How do snapshots work? Create new global map with new epoch; VDir points to new epoch Snapshot uses original epoch What is a crash-consistent snapshot? Why is that not good enough? What do you wait for to make snapshot consistent at application level? * What happens when you add a disk Local server adjusts automatically Should redistribute in the background, but not implemented yet * What happens when you add a server? New server added to membership list Liveness module probes new server Create new GMap Change VDir to point to new GMap Redistribute data according to new GMap While redistributing Any reads for data not yet moved get forwarded to old location For efficiency, reconfigure in chunks to minimize forwarded reads * How do reads and writes work with replication Reads -- either server Writes -- must go to primary Primary marks data busy, forwards to secondary Write completes when primary and secondary both written For efficiency, use write-ahead logging w. goup commit Clear busy bits lazily (exploits overwrites) * What happens when a server fails? Two neighboring servers take over servicing requests Reads no problem Writes--must flag data as stale When other server comes back, must update all stale data it missed Load balancing shifts other reads away from substitute servers * Performance Table 1 -- why are reads slower (network) -- why are writes with NVRAM slower? (primary has to send request to secondary) (still have to write data on both disks, longer seek dominates) Table 2 -- Why are reads slower? 3/4 servers -> 3/4 performance, pretty good, actually -- Why are writes faster after failure (only write to one disk) What can we conclude from Figure 7? Looks kind-of linear, but will it scale? What is scaling bottleneck? (If proper load-ballancing, could scale) Table 3 -- Why is mkdir so slow on UFS? (Synchronous writes) Do we care? (could also be bad for databases with synchronous writes) Frangipani ========== * Goals of paper - Scalable storage Many clients, lots of disk space Just add most servers everything works - Simplicity of design: Leverage Petal * Why not just use AdvFS on Petal? * How does Frangipani use Petal? - Figure 2 shows architecture of the system - Everything stored on Frangipani - even local server logs! Couldn't servers store their own logs on local disk? (would hamper recovery) * Frangipani exploits large virtual address space - Figure 4 shows storage layout - 64KB allocation size allows it to have many clusters of allocated space - Why are inodes 512 bytes? locking granularity, eliminate false sharing * How to deal with concurrency? - Virtual disk segments covered by shared rd/exclusive wr locks - Are segments contiguous? No: One lock covers inode and data - Every server has its own address space for log (no contention) - Split allocation bitmap amongst servers -- no alloc contention - When do you have contention? concurrent write sharing, file deletes * What about deadlock? - Is it a problem? e.g. rename - Acquire locks in two phases 1. Figure out what locks you need (acquiring & releasing as needed) 2. Get all the locks in increasing block order 3. Figure out if step 1 was still correct, if not, start over * How to implement locks - Centralized server? Not fault-tolerant - Store them in petal? Disk writes too expensive - Distributed lock server, and "clerks" on each Frangipani server - Clerks get lease on lock table - Servers check each other with "heartbeat" messages - Recover crashed server from clerks * What is hazard if messages delayed (3rd to last par in sec 6)? * What happens if you trip over the ethernet cord - If network is out for more than 30 seconds, lease will expire - If any dirty data in cache, file system must be unmounted and remounted * How to maintain order of updates? - Always log before updating permanent locations - Always write permanent locations before returning lock * How does recovery work? - Detect server failure. How? (client notices, or lock lease expires) - Recovery daemon gains ownership of failed server's locks includes log, inodes, etc. - Finds beginning of log. How? version number decrease - Replays log, releases locks - How many Frangipani servers can fail? All, as long as Petal still up * How to maintain update order after a crash? - What if log contains changes that were already overwritten? Server applied change, released lock, someone else changed, server reacquired lock - Never replay same log entry more than once - Version number in every metadata block - What if metadata block replaced by data block w/o version? don't do that * How to backup a Frangipani file system? - Petal offers snapshots. Is that good enough? yes logs in Petal - Restore entire Petal snapshot (including logs), and do crash recovery - What about recovering individual files? Painful; search all logs - What's alternate scheme? Block everyone with a global lock, also imperfect * Security plan - Export Frangipani w. another network file system protocol - Why not just authenticate users to petal servers? * Performance - What are goals? Good single client performance Scale with number of clients - Figure 5, 6: good - Figure 7: Why are writes slower? (must write to two petal servers) - Figure 8: Why is performance so bad under conention? Must flush cache/write back all data when giving up lock - Why does readahead hurt? * How does Frangipani compare to Zebra? - Very easy to implement (two months) - No need for central file manager - Frangipani has very heavy-weight sharing Must flush cache after returning lock