Petal
=====

* Goals of petal
  - Let's say you want a big disk
      Bigger than the biggest SCSI disk you can buy
      Bigger than max number of max-sized disks you can put on a SCSI chain
      Bigger than total number of SCSI chains you can even fit in one computer
  - You also want scalable performance and availability
      More reliable than a single server
      Faster than network interface of single server
  - You also may want to manage your storage centrally
      Back up one set of disks for all your application servers

* Idea:  Petal is a network storage system
  On clients, provide kernel device that looks like a big disk
    Implemented by collection of servers, with replication
  Features
  - Provides "virtual disks" with 64-bit block addressing
  - Can access any block as long as one copy avail & majority of servers up
  - Allocates real disk space on demand
      Special "decommit" operation frees up blocks you have written do
      (Should solve problem Autoraid had of not knowing free blocks)
  - Snapshot service for easy backups
  - Incremental reconfiguration -- add/remove servers and disks

* Why might petal be more desirable rather than a big NFS or Zebra server?
  - Petal's lower-level RPC Interface simpler to implement
  - Device driver at clients can be used by arbitrary file systems/databases

* Petal architecture
  Maintain only hints about data location in client
  - Client sends request to it's best guess of correct server
  - If wrong, server sends back error and client updates state
  - Gives good performance in common case, and always correct
  Client caching scheme same as for real disks
    Char device not cached, block dev can use buffer cache
    Applications (or file systems) must manage caching policy
      (like NFS's use of the buffer cache)

* Two levels of redundancy supported:
  - None
  - Replication w. chained-declustering.  What is chain declustering?
      [Draw Figure 5, p. 4 from petal paper on board]
      Every block replicated on sequentially numbered servers   
      Servers can take load from their neighbors for load balancing
      Allows geographically divers replication to tolerate site failures
        Put even servers at one site, odd at another site

* Data mapped from virtual to physical location with three maps:
  VDir - virtual directory maps virtual disks to <GMap, epoch-number>
  GMap - maps offset -> server(s)
    Immutable for one epoch
  PMap - per server, maps <GMap ID, epoch, offset> -> phys block
    Translates 64K at a time
  VDir, GMap globally replicated, PMap is per-server

* Snapshot service
  Create new global map with new epoch;  VDir points to new epoch
  Snapshot uses original epoch
  What is a crash-consistent snapshot?  Why is that not good enough?
  What do you wait for to make snapshot consistent at application level?

* Adding a disk to a server
  Local server adjusts automatically
  Should redistribute load in background, but not implemented yet

* Adding a new server
  New server added to membership list
  Liveness module probes new server
  Create new GMap(s)
  Change VDir to point to new GMap
  Redistribute data according to new GMap
  While redistributing
    Any reads for data not yet moved get forwarded to old location
    Problem:  Would require many reads to talk to two servers
      Reconfigure in chunks to minimize forwarded reads

* Replicated reads and writes
  Reads -- either server, chose one for which client has shortest queue
  Writes -- must go to primary
    Primary marks data busy, forwards to secondary
    Write completes when primary and secondary both written
    For efficiency, use write-ahead logging w. goup commit
    Clear busy bits lazily (exploits overwrites)

* Handling server failure
  Two neighboring servers take over servicing requests
    Reads no problem
    Writes--must flag data as stale
      When other server comes back, must update all stale data it missed
  Load balancing shifts other reads away from substitute servers

* Performance summary
  Effects of chain declustering on latency
    Reads - slower than local disk because of network
    Writes - slower because must write to two disks
        First must mark block as busy
	Then write to both mirrored copies
          Seek time of slower disk will dominate
      Can improve things with NVRAM
        But network latency still takes longer than local disk
  Throughput with a failed disk
    Reads 3/4 throughput for 3/4 of servers up (pretty good)
    Writes get faster!  (Only one disk to write to)
  Application performance:  Good except for synchronous writes
    E.g., mkdir on UFS is bad, but on AdvFS (journaled) is good
    Possibly bad for database workloads