Frangipani ========== Goals of paper - Scalable storage Many clients, lots of disk space Just add most servers everything works - Simplicity of design: Leverage Petal Why not just use AdvFS on Petal? How does Frangipani use Petal? - Figure 2 shows architecture of the system - Everything stored on Petal - even local server logs! Couldn't servers store their own logs on local disk? (would hamper recovery) Frangipani exploits large virtual address space - Figure 4 shows storage layout - 64KB allocation size allows it to have many clusters of allocated space - Why are inodes 512 bytes? locking granularity, eliminate false sharing How to deal with concurrency? - Virtual disk segments covered by shared rd/exclusive wr locks - Are segments contiguous? No: One lock covers inode and data - Every server has its own address space for log (no contention) - Split allocation bitmap amongst servers -- no alloc contention - When do you have contention? concurrent write sharing, file deletes How to implement locks? - Centralized server? Not fault-tolerant - Store them in petal? Disk writes too expensive - Distributed lock server, and "clerks" on each Frangipani server How is distributed lock server implemented? Split locks into ~100 different lock groups Clerks get lease on lock table Servers check each other with "heartbeat" messages How to agree on on which server is responsible for which lock group? Danger: Two servers both think they are responsible for same locks! Use consensus alg. to assign lock groups to servers--How might this work? E.g., use something like view change algorithm from Harp - Coordinator asks other servers to participate in view change - Available servers will agree if not participating in another v.c. - Phase 2: Coordinator tells other servers about new view Slow view change protocol OK, since only rarely needed What about lost state when server crashes? Can retrieve lost state from clerks Will know all clerks that have lease on lock table (leases replicated) Does above guarantee two servers never responsible for same lock? Not quite, need to ensure view change happens atomically First server releases all locks they are losing responsibility for Then find out state of new locks from clients What is hazard if messages delayed (3rd to last par in sec 6)? What about deadlock? - Is it a problem? e.g. rename - Acquire locks in two phases 1. Figure out what locks you need (acquiring & releasing as needed) 2. Get all the locks in increasing block order 3. Figure out if step 1 was still correct, if not, start over What happens if you trip over the ethernet cord - If network is out for more than 30 seconds, lease will expire - If any dirty data in cache, file system must be unmounted and remounted In normal operation, does Frangipani preserve Echo's => ordering? Almost (writes that change file size can still be reordered). How? - Always log metadata before updating permanent locations - Always write permanent locations before returning lock What happens when a Frangipani server crashes? - Detect server failure. How? (client notices, or lock lease expires) - Recovery daemon gains ownership of failed server's locks includes log, inodes, etc. - Finds beginning of log. How? version number decrease - Replays log, releases locks - How many Frangipani servers can fail? All, as long as Petal still up How to maintain update order after a crash? - What if log contains changes that were already overwritten? Server applied change, released lock, someone else changed, recovery server reacquired lock, rolled back other server's change - Solution? Never replay same log entry more than once Version number in every metadata block What if metadata block replaced by data block w/o version? don't do that How do you guarantee metadata blocks not recycled for data? They don't say... several possibilities: Could allocate metadata from bottom of bitmap region, data from top Could keep extra bits it bitmap table What happens if a Frangipani server and lock server crash simultaneously? Recovery server is supposed to acquire locks of failed Frangipani server Lock servers are supposed to reconstruct lock state from (crashed) clerk Don't know what locks to give recovery server! Why is this bad? Someone else could get lock and see incomplete operation How can you deal with this? Paper doesn't say. Possibilities: Could give entire set of missing locks to recovery server Will guarantee it has all locks the crashed server had Disadvantages? Slow. Plus what if two servers and lock server crash? Must wait until both Frangipani servers' logs are replayed before giving out any locks in missing lock groups Okay, because two crashed Frangipani servers didn't hold same lock How to backup a Frangipani file system? - Petal offers snapshots. Is that good enough? yes logs in Petal - Restore entire Petal snapshot (including logs), and do crash recovery - What about recovering individual files? Painful; search all logs - What's alternate scheme? Block everyone with a global lock, also imperfect Security plan - Export Frangipani w. another network file system protocol - Why not just authenticate users to petal servers? Performance - What are goals? Good single client performance Scale with number of clients - Figure 5, 6: good - Figure 7: Why are writes slower? (must write to two petal servers) - Figure 8: Why is performance so bad under conention? Must flush cache/write back all data when giving up lock - Why does readahead hurt? How does Frangipani compare to Zebra? - Very easy to implement (two months) - No need for central file manager - Frangipani has very heavy-weight sharing Must flush cache after returning lock Would Zebra do better on Figure 8 benchmark? Why? Zebra consistency is on block pointers, not blocks themselves