Echo ==== Background: Echo is a global distributed file system Root and top-level directories served by DNS Encrypts network communications for security Key management done through hierarchy of trust Echo supports replication for two purposes Reliability - the property that the FS will not lose your data Availability - the property that the system lets you get at your data Echo supports many different configurations One server, multiple disks (mirrored) - good for reliability Multiple servers attached to same disk - good for availability E.g., Servers+disk on same SCSI bus (though they weren't using SCSI) Only one server (the primary) can touch disk at a time If primary dies, elect a backup to take over New primary is SCSI host that can connect to drive Multiple servers multiple disks Similar to multiple servers one disk + mirroring Multiple servers each with one disk Like Harp setup -- can't get to a disk if server crashes Also need witness in case of network partitions How is Echo different from Harp? Harp implemented the NFS protocol ==> NFS semantics Potential inconsistency under concurrent write sharing Have to write metadata operations through to server Primary + n backups must log operation before returning to client Guarantees that same client's metadata OPs won't be reordered In absence of failure, echo is coherent even under concurrent write sharing However, FS has more flexibility in reordering operations But also provide interface that lets applications control ordering Echo supports write-behind even of metadata operations Suggests this may offer big performance win on many workloads But of course write-behind weakens semantics in face of failures Write-behind and ordering guarantees Every write operation can be in one of three states: Stable - meaning it is reflected on disk at the server(s) Discarded - when it will never be reflected on disk at the server Unstable - when it is neither stable nor discarded yet Informal guarantees we want from Echo A. If a write is observed by another client, it should be stable I.e., bad to discard writes that multiple clients have already seen B. Writes become stable in their logical order Except that file overwrites can be reordered C. Fsync (if it succeeds) forces writes to be stable More formally, define two relations -> "happens before": We say o1 -> o2 iff: o1 is a write o1 happened logically before o2 o1 was not discarded before o2 o2 has an operand in common with o1 both o1 and o2 were successful => "commits before": We say o1 => o2 iff: o1 -> o2 and o1, o2 are not both overwrites Can now redefine informal properties formally: A. If o1 -> o2 and o1, o2 on different clients, then o1 stable when o2 logically performed B. If o1 => o2 and o2 is stable, then o1 must be stable C. fsync is always stable if it succeeds What does this mean for ordering semantics? Consider the following operations: mkdir d.new echo hello > d.new/f (creat d.new/f1; write d.new/f) mv d d.old (1) mv d.new d (2) With NFS, no client would ever see an empty dir d or file d/f With Echo, client might. Why? mkdir => create => write create => mv (1) => mv (2) Write doesn't necessarily commit before mv operations What if you want NFS-like semantics? New forder system call Takes a list of files, and writes them in the sense of "->" So forder (f) happens between logically previous op to f and next Doesn't actually modify files, or even force anything to disk Example: To fix previous example, forder (d.new, d.new/f) Implementation: "Clerk" is name they give to Echo client in kernel For each modified object, clerk keeps modified state & write-behind queue Unbroken series of overwrites appears as single queue element Operations issued in order they are queued Operations that change multiple files reside in multiple queues E.g., create, rename, etc. Cannot issue an operation if any operation ahead of it is any queue When creating files/directories, how does clerk chose IDs (like NFS FHs)? Server gives client a certain number of IDs ahead of time Ensures malicious clients cannot pick same ID for two different files Does Echo suffer from same "disk full" problem as Sprite? No. Clients reserve disk space on server Server gives clients a space estimating library For each op, client calls lib for conservative estimate of space req. When insufficient space, ask server for "required" and "desired" space (Desired useful if multiple ops need to write) If can't even get required space, then return disk full error How does this work for quotas? Might not, actually. Harder problem--client needs to know all quotas Client would need to know whose quota to charge each operation to How does coherence work? Tokens What is a token? Gives you the right to hold data in your cache Read token - allows you to cache copy of clean data. Really 3 tokens: Info - Just the right to cache the file attributes Search - The right to cache directory lookup results (like NFS LOOKUP) Read - The right to cache file data, or directory data (NFS READ, READDIR) Write token - allows you to cache dirty data Write - The right to cache dirty file blocks and size changes ChangeAccess - Ability to cache chown, chmod file operations ChangeParent - Ability to rename or delete a file Tokens are also used for access control Permission checks prevent you from getting unauthorized tokens When performing read/write operations, server only checks client's tokens Token compatibility matrix (Table 5, p.31) What is an open token? How is it different from a read token? Problem: When open file is deleted, it must persist on server So server won't really erase a delete a file until all open tokens returned Token revocation What happens if you need a token incompatible with an already granted one? Server asks client for it back Read tokens returned immediately Write tokens require clerk first to write back all data covered by token which might additionally require writing back co-fordered files Can you suffer deadlock in acquiring tokens? Some operations require multiple tokens (e.g., rename) Acquire tokens in order of their IDs Actually, use two phases. Phase one, acquire all tokens if you can, but release if asked Phase two, reacquire any you might have lost Plus, increment dirty counter so you don't give them back May need to abort in phase two Make sure tokens you have are still the right ones Perform operation Failure recovery Tokens are within a session (between a particular client and server) There is a lease on each session If a client's session's lease expires, all its tokens automatically revoked What does this mean for write-behind cache? Bad news - may need to discard operations How do applications learn about discarded operations? May have closed file already before writes discarded Standard recovery - all operations to FS return errors Drastic. Basically hope is app. will exit quickly with error (Maybe should even have sent signal to processes) Self recovery mode Open files return error "." returns error until chdir Absolute pathnames work Idea was to recover by chdir (/...), but in practice was no so useful Users don't think this way Existing applications don't really work this way New applications could have used, e.g., failure handles Null recovery (not implemented, but maybe should have been) Open files all return errors Newly opened files work fine Would be good for shells But app. may have exited before they were discarded! Uh-oh Discussion: What happened in Vesta benchmark, Table 3 on p.21? Only 18 files created, but turning off dir write-through 143% slower Benchmark atomically updates files with forder + rename So now effectively becomes fsync + rename, since file writes must all commit before rename, and rename synchronous Prevents overlapping of computation with writing to server Paper concludes forder and dir write through a bad combination But in, e.g., NFS or AFS you would have problem even w/o forder Even most local file systems would require that you call fsync What happened with /proj/packages (p. 25)? Acquired /proj write token fordered a whole bunch of files with /proj renamed directory in proj Readers needed token back, but all fordered files had to be written through first Solution: fsync before acquiring /proj token