FS groups:
Mark Mentovai (mark), Eugene Kushnir (kuemi), Stan Sagalovskiy (ysagal),
     Julien Wonderlick (jsw221)
Jason Lee (jasonlee), Chris Kolanovic (chris), Gennadiy Vaynshteyn (gv218)

Petal
=====

* What is petal?
  - Network storage system
  - Provides "virtual disks" with 64-bit block addressing
  - Allocates real disk space on demand, with "decommit" to free up space
  - Split across many servers, with replication
  - Can access any block as long as one copy avail & majority of servers up
  - Snapshot service for easy backups
  - Incremental reconfiguration

* Why build petal rather than a big NFS server?
  - Petal's lower-level RPC Interface simpler to implement
  - Device driver at clients can be used by arbitrary file systems/databases

* Maintain only hints in client
  - Client sends request to it's best guess at correct server
  - If wrong, server sends back error and client updates state

* Do clients cache data?
  - It's like a real disk device
      Char device not cached, block dev can use buffer cache
      Applications (or file systems) must manage caching policy
        (like NFS's use of the buffer cache)

* Kinds of redundance supported:
  - None
  - Replication w. chained-declustering.  What is chain declustering?
      Every block replicated on sequentially numbered servers   
      Allows geographically divers replication to tolerate site failures
        Put even servers at one site, odd at another site
      Servers can take load from their neighbors for load balancing

* Virtual to physical translation
  VDir - virtual directory maps virtual disks to <GMap, epoch-number>
  GMap - maps offset -> server
    Immutable for one epoch
  PMap - per server, maps <GMap ID, epoch, offset> -> phys block
    Translates 64K at a time

* How do snapshots work?
  Create new global map with new epoch;  VDir points to new epoch
  Snapshot uses original epoch
  What is a crash-consistent snapshot?  Why is that not good enough?
  What do you wait for to make snapshot consistent at application level?

* What happens when you add a disk
  Local server adjusts automatically
  Should redistribute in the background, but not implemented yet

* What happens when you add a server?
  New server added to membership list
  Liveness module probes new server
  Create new GMap
  Change VDir to point to new GMap
  Redistribute data according to new GMap
  While redistributing
    Any reads for data not yet moved get forwarded to old location
    For efficiency, reconfigure in chunks to minimize forwarded reads

* How do reads and writes work with replication
  Reads -- either server
  Writes -- must go to primary
    Primary marks data busy, forwards to secondary
    Write completes when primary and secondary both written
    For efficiency, use write-ahead logging w. goup commit
    Clear busy bits lazily (exploits overwrites)

* What happens when a server fails?
  Two neighboring servers take over servicing requests
    Reads no problem
    Writes--must flag data as stale
      When other server comes back, must update all stale data it missed
  Load balancing shifts other reads away from substitute servers

* Performance
  Table 1 -- why are reads slower (network)
          -- why are writes with NVRAM slower?
	     (primary has to send request to secondary)
             (still have to write data on both disks, longer seek dominates)
  Table 2 -- Why are reads slower?
	     3/4 servers -> 3/4 performance, pretty good, actually
	  -- Why are writes faster after failure (only write to one disk)
  What can we conclude from Figure 7?
    Looks kind-of linear, but will it scale?
    What is scaling bottleneck?  (If proper load-ballancing, could scale)
  Table 3 -- Why is mkdir so slow on UFS?  (Synchronous writes)
    Do we care?  (could also be bad for databases with synchronous writes)
    

Frangipani
==========

* Goals of paper
  - Scalable storage
    Many clients, lots of disk space
    Just add most servers everything works
  - Simplicity of design:  Leverage Petal

* Why not just use AdvFS on Petal?

* How does Frangipani use Petal?
  - Figure 2 shows architecture of the system
  - Everything stored on Frangipani - even local server logs!
    Couldn't servers store their own logs on local disk?
      (would hamper recovery)

* Frangipani exploits large virtual address space
  - Figure 4 shows storage layout
  - 64KB allocation size allows it to have many clusters of allocated space
  - Why are inodes 512 bytes?  locking granularity, eliminate false sharing

* How to deal with concurrency?
  - Virtual disk segments covered by shared rd/exclusive wr locks
  - Are segments contiguous?
    No:  One lock covers inode and data
  - Every server has its own address space for log (no contention)
  - Split allocation bitmap amongst servers -- no alloc contention
  - When do you have contention?  concurrent write sharing, file deletes

* What about deadlock?
  - Is it a problem?  e.g. rename
  - Acquire locks in two phases
     1. Figure out what locks you need (acquiring & releasing as needed)
     2. Get all the locks in increasing block order
     3. Figure out if step 1 was still correct, if not, start over

* How to implement locks
  - Centralized server?  Not fault-tolerant
  - Store them in petal?  Disk writes too expensive
  - Distributed lock server, and "clerks" on each Frangipani server
  - Clerks get lease on lock table
  - Servers check each other with "heartbeat" messages
  - Recover crashed server from clerks

* What is hazard if messages delayed (3rd to last par in sec 6)?

* What happens if you trip over the ethernet cord
  - If network is out for more than 30 seconds, lease will expire
  - If any dirty data in cache, file system must be unmounted and remounted

* How to maintain order of updates?
  - Always log before updating permanent locations
  - Always write permanent locations before returning lock

* How does recovery work?
  - Detect server failure.  How? (client notices, or lock lease expires)
  - Recovery daemon gains ownership of failed server's locks
    includes log, inodes, etc.
  - Finds beginning of log.  How?  version number decrease
  - Replays log, releases locks
  - How many Frangipani servers can fail?  All, as long as Petal still up

* How to maintain update order after a crash?
  - What if log contains changes that were already overwritten?
    Server applied change, released lock, someone else changed,
    server reacquired lock
  - Never replay same log entry more than once
  - Version number in every metadata block
  - What if metadata block replaced by data block w/o version? don't do that

* How to backup a Frangipani file system?
  - Petal offers snapshots.  Is that good enough?  yes logs in Petal
  - Restore entire Petal snapshot (including logs), and do crash recovery
  - What about recovering individual files?  Painful; search all logs
  - What's alternate scheme?  Block everyone with a global lock, also imperfect

* Security plan
  - Export Frangipani w. another network file system protocol
  - Why not just authenticate users to petal servers?

* Performance
  - What are goals?
     Good single client performance
     Scale with number of clients
  - Figure 5, 6:  good
  - Figure 7:  Why are writes slower?  (must write to two petal servers)
  - Figure 8:  Why is performance so bad under conention?
     Must flush cache/write back all data when giving up lock
  - Why does readahead hurt?

* How does Frangipani compare to Zebra?
  - Very easy to implement (two months)
  - No need for central file manager
  - Frangipani has very heavy-weight sharing
    Must flush cache after returning lock