Echo
====

Background:  Echo is a global distributed file system
    Root and top-level directories served by DNS
  Encrypts network communications for security
    Key management done through hierarchy of trust

Echo supports replication for two purposes
  Reliability - the property that the FS will not lose your data
  Availability - the property that the system lets you get at your data
  Echo supports many different configurations
    One server, multiple disks (mirrored) - good for reliability
    Multiple servers attached to same disk - good for availability
      E.g., Servers+disk on same SCSI bus (though they weren't using SCSI)
      Only one server (the primary) can touch disk at a time
      If primary dies, elect a backup to take over
        New primary is SCSI host that can connect to drive
    Multiple servers multiple disks
      Similar to multiple servers one disk + mirroring
    Multiple servers each with one disk
      Like Harp setup -- can't get to a disk if server crashes
      Also need witness in case of network partitions

How is Echo different from Harp?
  Harp implemented the NFS protocol ==> NFS semantics
    Potential inconsistency under concurrent write sharing
    Have to write metadata operations through to server
      Primary + n backups must log operation before returning to client
    Guarantees that same client's metadata OPs won't be reordered
  In absence of failure, echo is coherent even under concurrent write sharing
    However, FS has more flexibility in reordering operations
    But also provide interface that lets applications control ordering
  Echo supports write-behind even of metadata operations
    Suggests this may offer big performance win on many workloads
    But of course write-behind weakens semantics in face of failures

Write-behind and ordering guarantees
  Every write operation can be in one of three states:
    Stable - meaning it is reflected on disk at the server(s)
    Discarded - when it will never be reflected on disk at the server
    Unstable - when it is neither stable nor discarded yet
  Informal guarantees we want from Echo
    A. If a write is observed by another client, it should be stable
       I.e., bad to discard writes that multiple clients have already seen
    B. Writes become stable in their logical order
       Except that file overwrites can be reordered
    C. Fsync (if it succeeds) forces writes to be stable
  More formally, define two relations
    -> "happens before":  We say o1 -> o2 iff:
      o1 is a write
      o1 happened logically before o2
      o1 was not discarded before o2
      o2 has an operand in common with o1
      both o1 and o2 were successful
    => "commits before":  We say o1 => o2 iff:
      o1 -> o2 and o1, o2 are not both overwrites
  Can now redefine informal properties formally:
    A. If o1 -> o2 and o1, o2 on different clients,
       then o1 stable when o2 logically performed
    B. If o1 => o2 and o2 is stable, then o1 must be stable
    C. fsync is always stable if it succeeds

What does this mean for ordering semantics?
  Consider the following operations:
    mkdir d.new
    echo hello > d.new/f   (creat d.new/f1; write d.new/f)
    mv d d.old  (1)
    mv d.new d  (2)
  With NFS, no client would ever see an empty dir d or file d/f
  With Echo, client might.  Why?
      mkdir => create => write
               create => mv (1) => mv (2)
    Write doesn't necessarily commit before mv operations
  What if you want NFS-like semantics?  New forder system call
    Takes a list of files, and writes them in the sense of "->"
    So forder (f) happens between logically previous op to f and next
    Doesn't actually modify files, or even force anything to disk
    Example:  To fix previous example, forder (d.new, d.new/f)

Implementation:  "Clerk" is name they give to Echo client in kernel
  For each modified object, clerk keeps modified state & write-behind queue
    Unbroken series of overwrites appears as single queue element
    Operations issued in order they are queued
    Operations that change multiple files reside in multiple queues
      E.g., create, rename, etc.
    Cannot issue an operation if any operation ahead of it is any queue
  When creating files/directories, how does clerk chose IDs (like NFS FHs)?
    Server gives client a certain number of IDs ahead of time
    Ensures malicious clients cannot pick same ID for two different files
  Does Echo suffer from same "disk full" problem as Sprite?
    No.  Clients reserve disk space on server
      Server gives clients a space estimating library
      For each op, client calls lib for conservative estimate of space req.
      When insufficient space, ask server for "required" and "desired" space
        (Desired useful if multiple ops need to write)
      If can't even get required space, then return disk full error
    How does this work for quotas?
      Might not, actually.  Harder problem--client needs to know all quotas
      Client would need to know whose quota to charge each operation to

How does coherence work?  Tokens
  What is a token?  Gives you the right to hold data in your cache
  Read token - allows you to cache copy of clean data.  Really 3 tokens:
    Info - Just the right to cache the file attributes
    Search - The right to cache directory lookup results (like NFS LOOKUP)
    Read - The right to cache file data, or directory data (NFS READ, READDIR)
  Write token - allows you to cache dirty data
    Write - The right to cache dirty file blocks and size changes
    ChangeAccess - Ability to cache chown, chmod file operations
    ChangeParent - Ability to rename or delete a file
  Tokens are also used for access control
    Permission checks prevent you from getting unauthorized tokens
    When performing read/write operations, server only checks client's tokens
  Token compatibility matrix (Table 5, p.31)
  What is an open token?  How is it different from a read token?
    Problem:  When open file is deleted, it must persist on server
    So server won't really erase a delete a file until all open tokens returned

Token revocation
  What happens if you need a token incompatible with an already granted one?
  Server asks client for it back
    Read tokens returned immediately
    Write tokens require clerk first to write back all data covered by token
      which might additionally require writing back co-fordered files

Can you suffer deadlock in acquiring tokens?
  Some operations require multiple tokens (e.g., rename)
  Acquire tokens in order of their IDs
  Actually, use two phases.
  Phase one, acquire all tokens if you can, but release if asked
  Phase two, reacquire any you might have lost
    Plus, increment dirty counter so you don't give them back
  May need to abort in phase two
  Make sure tokens you have are still the right ones
  Perform operation

Failure recovery
  Tokens are within a session (between a particular client and server)
  There is a lease on each session
  If a client's session's lease expires, all its tokens automatically revoked
  What does this mean for write-behind cache?
    Bad news - may need to discard operations
    How do applications learn about discarded operations?
      May have closed file already before writes discarded
        Standard recovery - all operations to FS return errors
	  Drastic.  Basically hope is app. will exit quickly with error
	  (Maybe should even have sent signal to processes)
	Self recovery mode
	  Open files return error
	  "." returns error until chdir
	  Absolute pathnames work
	  Idea was to recover by chdir (/...), but in practice was no so useful
	    Users don't think this way
	    Existing applications don't really work this way
	    New applications could have used, e.g., failure handles
        Null recovery (not implemented, but maybe should have been)
	  Open files all return errors
	  Newly opened files work fine
	  Would be good for shells
      But app. may have exited before they were discarded!  Uh-oh

Discussion:
  What happened in Vesta benchmark, Table 3 on p.21?
    Only 18 files created, but turning off dir write-through 143% slower
    Benchmark atomically updates files with forder + rename
    So now effectively becomes fsync + rename,
       since file writes must all commit before rename, and rename synchronous
    Prevents overlapping of computation with writing to server
    Paper concludes forder and dir write through a bad combination
      But in, e.g., NFS or AFS you would have problem even w/o forder
      Even most local file systems would require that you call fsync
  What happened with /proj/packages (p. 25)?
    Acquired /proj write token
    fordered a whole bunch of files with /proj
    renamed directory in proj
    Readers needed token back,
      but all fordered files had to be written through first
    Solution: fsync before acquiring /proj token