The Zebra Striped Network File System ===================================== What are the top-level goals? Increase throughput and availability By striping data across multiple servers Do they present evidence that these goals are valuable? I.e. that existing systems were lacking in these areas Performance: yes, evidence that NFS and Sprite are slow Availability: no Top-level architecture? Multiple clients Multiple storage servers One file manager Why does this architecture increase reliability? Over what baseline? Presumably standard NFS, single server It probably does *not* with the FM described in the paper FM acts as a Sprite server, with a local non-redundant disk But it might; FM could store stuff in SS's, then could be re-started anywhere What kinds of performance bottlenecks might this architecture eliminate? [draw with just one client first] Client: only good if multiple concurrent clients Server disk: probably not the best plan (multiple disks/server) Server CPU or network interface: good Network: not good Server's ability to handle meta-data: not good, still one FM Why do they need the FM? Why not fully symmetric -- clients talk only to SS's? Need to synchronize updates to meta-data Can you imagine a design that eliminates the FM? Is this a network disk? That is, did they preserve the standard file system / disk split I.e. each SS exports simple read/write sector interface Why not? In general, to avoid conflicting writes Need disk to manage free list Need to be able to ask disks "last stripe written by client X" How do they decide how to divide up data over servers? File contents striped over SS's Each client stripes separately Meta-data only on the FM Why this partition of data? Could split up directory tree: /home/c1, /home/c2, etc. Each SS gets a different sub-tree We want load balance Performance depends on balanced load, to harness parallelism Hot spot -> low overall utilization Could do volume location in the style of AFS... When might directory tree partition work well? Many independent clients Why did they choose to stripe every client over every SS? Good load balance for even a single client Also helps w/ availability: RAID Though this could probably be done otherwise; mirrored per-server disks How should one choose a stripe size? How about one block (i.e. each fragment is block/N)? Doesn't this maximize throughput even for single-block operations? No: most of time is in the seek for short operations So higher performance if different disks can seek for different ops They use huge stripes (512 kbytes?); does this work well? Small read/writes: yes, if many concurrent ops. Long sequential r/w: yes. They use RAID; won't this wreck small-write performance? Looks like four disk ops per small write. It's OK: LFS and RAID interact well. LFS makes small writes sequential, and batches them. Avoids small writes to random places on disk. Is there any downside to batching? (Lose more after crash) Why does each client have its own log? To avoid expense of synchronizing client writes Could imagine FM telling you where to write next for every write Also to allow per-client recovery What is interface to Storage Servers (Section 3.2) Store - writes a fragment to disk (can also overwrite a parity fragment) Append - appends data to an incomplete fragment Retrieve - return frag or part of frag based on ID Delete - frag no longer needed Identify - return most recent fragment written by a client What is interface to FM? Basically just the same as regular Sprite, directories, files, attributes Except contents of a file is just block pointers for where to find data Logged data immutable, sprite consistency automatically applies to Zebra What's a block pointer? Fragment identifier: client #, client log seq #, which in stripe. What about location on disk? When you present a frag id to SS, how does it know where to read? Must maintain a map. What happens during a read? 1. Client sends Sprite open RPC to FM 2. FM does cache consistency in case some other client has dirty data cached 3. FM replies with file "contents": list of block pointers These may point into other clients' SS stripes 4. Client reads from SS's in parallel What happens during a write? 1. Client sends open-for-write to FM 2. Application issues writes. They are buffered locally. 3. Client decides to flush, or FM asks it to. 4. Client gathers *all* dirty blocks, decides how to append to log. 5. Client bumps version #, generates a delta for each write: File, version#, file offset, new block, old block Puts deltas in the log as well 6. Client computes RAID parity 7. Client appends new data and parity to log Parallel asynchronous write RPCs to SS's Never over-writes existing data But may overwrite parity -- why? (Appending to fragments requires parity to be updated) 8. Client sends deltas to the FM (and cleaner) 9. FM applies deltas to its meta-data -- just file block lists FM stores this stuff on a normal local disk Sprite/LFS file system What happens when a SS crashes? Can continue to operate with other SSes (but slower) Recovery: What happens to data that was being written at time of crash? SS update RPCs have synchronous semantics, so completed updates on disk SS uses checksum on stripe units to detect incomplete writes + Writes not in-place, so will have old or new version of data intact What about recovering the stripe map? Probably synchronously written, which is okay, since fragments big How about recovering data that was written while SS down Can contact other servers to reconstruct missed data using parity What happens if a client crashes? Need to worry about: Losing recently written information Must ensure that fsync really syncs data to disk But it's okay to lose tail of log, provided no fsyncs ..and provided no one else has seen written data (remember Echo?) So flush everything at fsync or if someone else reads the file Inconsistent data between FM and SSes That's why step 7 happens before step 9 Never update FM w. not-yet-valid block ptr. Inconsistent data across SSes In general, writes are multi-step operations, involve multiple SS's Say you update a stripe unit but not parity block? Very bad if you also lose an SS after the client crashes! SS's are designed to avoid the problem: Clients always append to a stripe unit (except for parity) SS updates atomically, so you always have old or new version SS w. parity block also records how much of segment written So treat any data after end as 0s, even if it was written Recovery from client crash Server must read and apply any committed updates in client's log Closes any files the client has open (so other clients can access them) Why log the deltas -- why not just send them to the FM? If FM crashes, it may miss deltas. Or get them, but not have finished applying them. It logs internally, so tail of log may be missing Why not just ignore tail of client logs after FM crash? After all, we're allowed to ignore tails of logs. Because the clients didn't crash! Client apps are still running, don't want their data to disappear. So what happens when FM crashes? FM recovery reads and replays tails of all client logs But maybe multiple clients have written the same file recently Order using file versions What's a file version? What's the point? Multiple clients were writing same file offset before an FM crash. Need to replay the two log tails in correct relative order. Can two clients pick the same version number? No, sprite caching means must ask server when concurrent write sharing How does cleaning work? What happens if the cleaner crashes? Did they actually get higher performance? Figure 6: (large writes) For 1 client, why do more servers get more b/w? limited by disk write performance so maybe should have just put multiple disks on a single server Why do more clients get more b/w? w/ many servers, limited by client cpu or net i/f speed Why is NFS/Sprite performance so low? They claim small blocks and no async rpc So no disk/net overlap But NFS has biod, should be able to overlap Figure 8: (small writes) Why is Sprite/Zebra so much faster than NFS? Figure 9: (utilization) What are they trying to demonstrate? That FM is not a serious performance bottleneck What's the right number of SS's per FM for large read/write? Many. How about SS's per active client? about one to one. How many clients per FM for small write? About two.