The Zebra Striped Network File System
=====================================

What are the top-level goals?
  Increase throughput and availability
  By striping data across multiple servers

Do they present evidence that these goals are valuable?
  I.e. that existing systems were lacking in these areas
  Performance: yes, evidence that NFS and Sprite are slow
  Availability: no

Top-level architecture?
  Multiple clients
  Multiple storage servers
  One file manager

Why does this architecture increase reliability?
  Over what baseline? Presumably standard NFS, single server
  It probably does *not* with the FM described in the paper
    FM acts as a Sprite server, with a local non-redundant disk
  But it might; FM could store stuff in SS's, then could be re-started anywhere

What kinds of performance bottlenecks might this architecture eliminate?
  [draw with just one client first]
  Client: only good if multiple concurrent clients
  Server disk: probably not the best plan (multiple disks/server)
  Server CPU or network interface: good
  Network: not good
  Server's ability to handle meta-data: not good, still one FM

Why do they need the FM?
  Why not fully symmetric -- clients talk only to SS's?
  Need to synchronize updates to meta-data
  Can you imagine a design that eliminates the FM?

Is this a network disk?
  That is, did they preserve the standard file system / disk split
  I.e. each SS exports simple read/write sector interface
  Why not?
  In general, to avoid conflicting writes
  Need disk to manage free list
  Need to be able to ask disks "last stripe written by client X"

How do they decide how to divide up data over servers?
  File contents striped over SS's
  Each client stripes separately
  Meta-data only on the FM

Why this partition of data?
  Could split up directory tree: /home/c1, /home/c2, etc.
    Each SS gets a different sub-tree
  We want load balance
    Performance depends on balanced load, to harness parallelism
    Hot spot -> low overall utilization
    Could do volume location in the style of AFS...
  When might directory tree partition work well?
    Many independent clients
  Why did they choose to stripe every client over every SS?
    Good load balance for even a single client
    Also helps w/ availability: RAID
    Though this could probably be done otherwise; mirrored per-server disks

How should one choose a stripe size?
  How about one block (i.e. each fragment is block/N)?
    Doesn't this maximize throughput even for single-block operations?
    No: most of time is in the seek for short operations
    So higher performance if different disks can seek for different ops
  They use huge stripes (512 kbytes?); does this work well?
    Small read/writes: yes, if many concurrent ops.
    Long sequential r/w: yes.

They use RAID; won't this wreck small-write performance?
  Looks like four disk ops per small write.
  It's OK: LFS and RAID interact well.
    LFS makes small writes sequential, and batches them.
      Avoids small writes to random places on disk.
  Is there any downside to batching?  (Lose more after crash)

Why does each client have its own log?
  To avoid expense of synchronizing client writes
  Could imagine FM telling you where to write next for every write
  Also to allow per-client recovery

What is interface to Storage Servers (Section 3.2)
  Store - writes a fragment to disk (can also overwrite a parity fragment)
  Append - appends data to an incomplete fragment
  Retrieve - return frag or part of frag based on ID
  Delete - frag no longer needed
  Identify - return most recent fragment written by a client

What is interface to FM?
  Basically just the same as regular Sprite, directories, files, attributes
  Except contents of a file is just block pointers for where to find data
  Logged data immutable, sprite consistency automatically applies to Zebra
What's a block pointer?
  Fragment identifier: client #, client log seq #, which in stripe.
  What about location on disk?
    When you present a frag id to SS, how does it know where to read?
    Must maintain a map.

What happens during a read?
  1. Client sends Sprite open RPC to FM
  2. FM does cache consistency in case some other client has dirty data cached
  3. FM replies with file "contents": list of block pointers
     These may point into other clients' SS stripes
  4. Client reads from SS's in parallel

What happens during a write?
  1. Client sends open-for-write to FM
  2. Application issues writes. They are buffered locally.
  3. Client decides to flush, or FM asks it to.
  4. Client gathers *all* dirty blocks, decides how to append to log.
  5. Client bumps version #, generates a delta for each write:
     File, version#, file offset, new block, old block
     Puts deltas in the log as well
  6. Client computes RAID parity
  7. Client appends new data and parity to log
     Parallel asynchronous write RPCs to SS's
     Never over-writes existing data
     But may overwrite parity -- why?
       (Appending to fragments requires parity to be updated)
  8. Client sends deltas to the FM (and cleaner)
  9. FM applies deltas to its meta-data -- just file block lists
     FM stores this stuff on a normal local disk Sprite/LFS file system

What happens when a SS crashes?
  Can continue to operate with other SSes (but slower)
  Recovery:  What happens to data that was being written at time of crash?
    SS update RPCs have synchronous semantics, so completed updates on disk
    SS uses checksum on stripe units to detect incomplete writes
    + Writes not in-place, so will have old or new version of data intact
  What about recovering the stripe map?
    Probably synchronously written, which is okay, since fragments big
  How about recovering data that was written while SS down
    Can contact other servers to reconstruct missed data using parity

What happens if a client crashes?  Need to worry about:
  Losing recently written information
    Must ensure that fsync really syncs data to disk
    But it's okay to lose tail of log, provided no fsyncs
        ..and provided no one else has seen written data (remember Echo?)
    So flush everything at fsync or if someone else reads the file
  Inconsistent data between FM and SSes
    That's why step 7 happens before step 9
      Never update FM w. not-yet-valid block ptr.
  Inconsistent data across SSes
    In general, writes are multi-step operations, involve multiple SS's
    Say you update a stripe unit but not parity block?
    Very bad if you also lose an SS after the client crashes!
    SS's are designed to avoid the problem:
      Clients always append to a stripe unit (except for parity)
      SS updates atomically, so you always have old or new version
      SS w. parity block also records how much of segment written
        So treat any data after end as 0s, even if it was written

Recovery from client crash
  Server must read and apply any committed updates in client's log
  Closes any files the client has open (so other clients can access them)

Why log the deltas -- why not just send them to the FM?
  If FM crashes, it may miss deltas.
    Or get them, but not have finished applying them.
    It logs internally, so tail of log may be missing
  Why not just ignore tail of client logs after FM crash?
    After all, we're allowed to ignore tails of logs.
    Because the clients didn't crash!
    Client apps are still running, don't want their data to disappear.
  So what happens when FM crashes?
    FM recovery reads and replays tails of all client logs
    But maybe multiple clients have written the same file recently
    Order using file versions
What's a file version?
  What's the point?
    Multiple clients were writing same file offset before an FM crash.
    Need to replay the two log tails in correct relative order.
  Can two clients pick the same version number?
    No, sprite caching means must ask server when concurrent write sharing

How does cleaning work?
  What happens if the cleaner crashes?

Did they actually get higher performance?
 
Figure 6: (large writes)
  For 1 client, why do more servers get more b/w?
    limited by disk write performance
    so maybe should have just put multiple disks on a single server
  Why do more clients get more b/w?
    w/ many servers, limited by client cpu or net i/f speed
  Why is NFS/Sprite performance so low?
    They claim small blocks and no async rpc
    So no disk/net overlap
    But NFS has biod, should be able to overlap

Figure 8: (small writes)
  Why is Sprite/Zebra so much faster than NFS?

Figure 9: (utilization)
  What are they trying to demonstrate?
    That FM is not a serious performance bottleneck
  What's the right number of SS's per FM for large read/write?
    Many.
    How about SS's per active client? about one to one.
  How many clients per FM for small write?
    About two.