Caching and replication
=======================

In order to have good performance, want to cache data in network FS
Problem: ensuring network file system clients see the same data

Two interesting cases
Sequential write sharing:
  Only one writer or multiple readers at a time
  How do NFS and AFS offer consistency under sequential write sharing?
      NFS - poll server for freshness on every open
        Disadvantage -- heavy load on server
        Disadvantage -- incur round trip on every open
      AFS - have server call you back
        Advantage -- less load on server, fewer round trips
        Disadvantage -- server must keep state (potentially large)
	Disadvantage -- what if client disappears?
      Another mechanism:  Leases - server promises callback for limited time
        Advantages -- same as AFS callbacks
        Advantages -- plus, can recover from dead client
        Disadvantages -- Have to wait lease time to recover from dead
		      	 client or server reboot (since leases in memory)
	Lease implementation issues:
	  Lease may include expiration time
	    Valid until ExpTime - MaxClockSkew
	      so requires roughly synchronized clocks
	  Or can be relative:  "N seconds from now"
	    So conservitive estimate of expiration is
	      RPC_Issue_Time + N
Concurrent write sharing:
  Multiple simultaneous accesses, at least one of which is a write
  How do NFS and AFS provide consistency under concurrent write sharing?
      Short answer:  They don't, really
        NFS - changes asynchronously written to server
	  Even when writes not on server disk (COMMITted)
	      other clients will see changes - unless maybe server crashed
	  So you don't have strict consitency
	    Say you read immediately after my write returns
	      The write is asynchronous (before close) so you might miss it
	      If you are caching the data block, you won't check with server
	  What about mmap?  (Forget it)
        AFS - even worse that NFS
	  Writes are not visible to other clients until you close the file
	  If two clients write a file, the last one to close it wins

How to achieve concurrent write sharing?
Example:  Sprite file system
  Motivation - study of BSD file usage
    75% of files open less than 1/2 second, 90% -> 10 sec
    20-30% of files gone within 30 seconds
    So wanted to avoid writing file through to server on close
  Sprite write policy
    Write first go just to client cache, then server cache, then disk
      Write to next stage when unmodified for 30 seconds
    Benefit of delay -- exploit overwrites and deletes
  How useful in practice?
      Speculation: many files in /tmp, so maybe just make /tmp local...
  Sprite consistency -- Put each file in one of several modes:
    Sequential write-sharing
      After every modifying close, bump version number
      Version number tells other clients to flush cache
    Concurrent write sharing -- disable caching
    What happens when second open puts file in concurrent write sharing?
      Server keeps track of last writer
      Server asks writer to flush dirty blocks, and waits
      Then server lets open succeed

So which of NFS/AFS/Leases/Sprite is preferable?
  Is concurrent write sharing a problem?  Not with most applications
  Which would you want for your laptop when on an airplane?
    None - they all require communication with the server

Idea:  Replicate data *optimistically*
  Example: disconnected operation with laptops
    source-code repository, cooperating programmers
    pair-wise file synchronization between programmer laptops
    want to be able to modify files while disconnected
  Try to get right answer and be consistent if you can easily
  But better to return an old version of a file than none at all!

What about concurrent updates?
  Need to enforce some order of operations
  e.g. Ivy DSM ensured sequential consistency

Example 1:
  Just focus on a modifications to a single file.
  H1 writes "a"
  H1 -> H2
  H2 writes "b"
  H1 -> H3
  H3 -> H2
  What is the right thing to do?
  Is it enough to compare modification times?
  Yes in this case, as long as you carry them along correctly.

Example 2:
  H1 writes "a"
  H1 -> H2
  H1 writes "b"
  H2 writes "c"
  H2 -> H1
  H2's mtime will be bigger.
  Does that mean we should use "c" and discard "b"?

What is the principle here?
  We're not *enforcing* a sequential update history.
    No locks, no Ivy-style owners.
  But *if* updates were actually sequential,
    we want to detect it and carry along most recent copy.
  And if updates were not sequential,
    we want to detect the problem (after the fact).
  This is where the "optimism" comes in.

So what is the problem with using wall-clock time?
  Certainly if T1 < T2, T1 is not the latest version.
  *But* if T2 > T1, that doesn't imply that T2 supersedes T1.
  The underlying problem is that time doesn't reflect causality.
  That is, we care if the write at T2 was on a host that had seen write at T1.
  Also you cannot rely on computers having synchronized clocks.

How can we decide more precisely whether we're looking at sequential history?
  We could record each file's entire modification history.
  List of hostname/localtime pairs.
  And carry history along when synchronizing between hosts.
  For example 1:   H2: H1/T1,H2/T2   H3: H1/T1
  For example 2:   H1: H1/T1,H1/T2   H2: H1/T1,H2/T3
  Then its easy to decide if version X supersedes version Y:
    If Y's history is a prefix of X's history.
    This exactly captures our desire for an *ordered* sequence of updates.
  Note we're not comparing times, so this works w/o sync.

Why is complete history not such a great solution?
  The histories grow without bound if we modify a file a lot.

Can we compress the histories?
  Proposal: Vector Timestamps.
  Just record the latest time for each host.
  So throw away all but the last history record for each host.
  All we care about is whether there's a *new* update by some
    host that we hadn't heard about when we changed the file.
  We assume that such an update must be *after* last time we
    heard from the host: that the host isn't going backwards in time.

A vector timestamp is a vector of times, one entry per host.
  The times are (as in the histories) local wall-clock times.
    Though they could be any monotonic count.
  a = b if they agree at every element.
  a < b if a[i] <= b[i] for every i, but !(a = b).
  a > b
  a || b if a[i] < b[i], a[j] > b[j], for some i,j.
    i.e. a and b conflict
  
If one history was a prefix of the other, then one vector timestamp
  will be less than the other.

If one history is not a prefix of the other, then (at least by example)
  VTs will not be comparable.

So now we can perform file synchronization.
  When a pair of laptops has to files, they accept the one with
  the higher vector timestamp.
  And signal an error if the VTs conflict.
  Illustrate with the two examples.

What if there *are* conflicting updates?
  VTs can detect them, but then what?
  Depends on the application.
  Might be easy: mailbox file with distinct messages, just merge.
  Might be hard: changes to the same line of C source.
  Reconciliation must be done manually for the hard cases.
  CVS keeps extra information, equiv to complete history.
    So it can generate diffs from common ancestor.
    Much better than diffing the two final versions against each other.