Petal ===== * Goals of petal - Let's say you want a big disk Bigger than the biggest SCSI disk you can buy Bigger than max number of max-sized disks you can put on a SCSI chain Bigger than total number of SCSI chains you can even fit in one computer - You also want scalable performance and availability More reliable than a single server Faster than network interface of single server - You also may want to manage your storage centrally Back up one set of disks for all your application servers * Idea: Petal is a network storage system On clients, provide kernel device that looks like a big disk Implemented by collection of servers, with replication Features - Provides "virtual disks" with 64-bit block addressing - Can access any block as long as one copy avail & majority of servers up - Allocates real disk space on demand Special "decommit" operation frees up blocks you have written do (Should solve problem Autoraid had of not knowing free blocks) - Snapshot service for easy backups - Incremental reconfiguration -- add/remove servers and disks * Why might petal be more desirable rather than a big NFS or Zebra server? - Petal's lower-level RPC Interface simpler to implement - Device driver at clients can be used by arbitrary file systems/databases * Petal architecture Maintain only hints about data location in client - Client sends request to it's best guess of correct server - If wrong, server sends back error and client updates state - Gives good performance in common case, and always correct Client caching scheme same as for real disks Char device not cached, block dev can use buffer cache Applications (or file systems) must manage caching policy (like NFS's use of the buffer cache) * Two levels of redundancy supported: - None - Replication w. chained-declustering. What is chain declustering? [Draw Figure 5, p. 4 from petal paper on board] Every block replicated on sequentially numbered servers Servers can take load from their neighbors for load balancing Allows geographically divers replication to tolerate site failures Put even servers at one site, odd at another site * Data mapped from virtual to physical location with three maps: VDir - virtual directory maps virtual disks to GMap - maps offset -> server(s) Immutable for one epoch PMap - per server, maps -> phys block Translates 64K at a time VDir, GMap globally replicated, PMap is per-server * Snapshot service Create new global map with new epoch; VDir points to new epoch Snapshot uses original epoch What is a crash-consistent snapshot? Why is that not good enough? What do you wait for to make snapshot consistent at application level? * Adding a disk to a server Local server adjusts automatically Should redistribute load in background, but not implemented yet * Adding a new server New server added to membership list Liveness module probes new server Create new GMap(s) Change VDir to point to new GMap Redistribute data according to new GMap While redistributing Any reads for data not yet moved get forwarded to old location Problem: Would require many reads to talk to two servers Reconfigure in chunks to minimize forwarded reads * Replicated reads and writes Reads -- either server, chose one for which client has shortest queue Writes -- must go to primary Primary marks data busy, forwards to secondary Write completes when primary and secondary both written For efficiency, use write-ahead logging w. goup commit Clear busy bits lazily (exploits overwrites) * Handling server failure Two neighboring servers take over servicing requests Reads no problem Writes--must flag data as stale When other server comes back, must update all stale data it missed Load balancing shifts other reads away from substitute servers * Performance summary Effects of chain declustering on latency Reads - slower than local disk because of network Writes - slower because must write to two disks First must mark block as busy Then write to both mirrored copies Seek time of slower disk will dominate Can improve things with NVRAM But network latency still takes longer than local disk Throughput with a failed disk Reads 3/4 throughput for 3/4 of servers up (pretty good) Writes get faster! (Only one disk to write to) Application performance: Good except for synchronous writes E.g., mkdir on UFS is bad, but on AdvFS (journaled) is good Possibly bad for database workloads