Managing Update Conflicts in Bayou,
a Weakly Connected Replicated Storage System
=============================================

Let's build a calendar system to help understand ordering and conflicts.
  Each entry has a time and a set of participants.
  We want everyone to end up seeing the same set of entries.

Traditional approach: one calendar server.
  Ordering: only one copy, server picks the order.
  Conflict resolution: server checks for conflicts before accepting update.
    Returns error to user, who can handle in any way desired.

Why aren't we satisfied with central server?
  I want my calendar on my Palm Pilot.
    I.e. database replicated in every node.
    No master copy.
  Periodic connectivity to net.
  Periodic infrared contact with other calendar users.

Straw man 1: swap complete DBs.
  Similar to Palm Pilot sync.
  Might require lots of network b/w.
  What if there's a conflict? IE two meetings at same time.
    Palm just schedules them both!
    But we want automatic conflict resolution.
  We can't just view DB items as bits.
    Since then we can't resolve conflicts.
  We want intelligent DB items that know how to resolve conflicts.
    They are more like updates: read DB, think, make a change.
    But we have to make sure that all nodes resolve conflicts the same way.
    How?
      
Insight:
  Maintain an ordered list of updates at each node.
  Make sure every node has the same updates.
  Make sure every node applies the updates in the same order.
  Make sure the updates are deterministic functions of DB contents.
  Then a "sync" is really a merge of two ordered lists: easy.
    Not a DB merge at all.

What are the write log entries?
  Why not just "10am meeting with dm"?
    Because we can't resolve conflicts.
  Instead, "1-hour meeting w. dm at 9, otherwise 10, otherwise 11."
    Along with a unique ID: <local-time-stamp, originating-node>
    This is really instructions for doing a write, not the written data.
    So the write log contains instructions in the distributed calendar program.
      We want all nodes to run same instructions in same order. Eventually.
  Each server applies updates it gets to its state
    Order by Lamport clock total order

How do writes actually propagate?
  Unidirectional, peer-to-peer synchronization
  By what mechanism?  Any way you want
    Plug laptop into network
    IR link between to palm pilots
    Transfer floppy between two laptops on airplane (can't use wireless)

Example:
  <701,A>: Node A asks for meeting M1 to occur at 10am, otherwise 11am.
  <770,B>: Node B asks for meeting M2 to occur at 10am, otherwise 11am.
  Let's agree to sort by write ID (e.g. <701,A>).
  As "writes" spread, nodes may initially apply updates in different order.
  Each newly seen write is merged into the log,
    May have to undo updates to insert older ones (e.g., B gets <701,A>)
    Then replay the log
    May cause user's view of calendar to change!
    I.e. all entries are really "tentative", nothing is stable.
  But when everybody has seen all the writes, everybody will agree.

Global time sync is not possible.
  Does that make this particular scheme impossible?
  No. We're just using time stamps to allow agreement on order.
    Doesn't matter if node clocks are wrong.
    As long as users don't depend on reasoning about real time.
  But this scheme arbitrarily constrains order.
    You never know if there's some write from the past you haven't seen.
    So all entries must be tentative forever.
    And you have to keep the log around forever.

How can we allow a notion of committing a tentative entry?
When can you ever truncate the log?
  For an entry X to be committed, everyone must agree on:
    The total order of all previous committed entries.
    The fact that X is the next in the total order.
    The fact that all uncommitted entries are "after" X.
  
How does Bayou agree on total order of committed writes?
  One designated "primary replica".
  It marks each write it receives with a permanent CSN.
    Commit Sequence Number.
    That write is committed.
    So a complete time stamp is <CSN, local-TS, node-id>
  CSN notifications propagate with updates using the anti-entropy algorithm
  The CSNs define a total order for committed writes
    All nodes will eventually agree on it
    Uncommitted writes come after all committed writes (infinite CSN)

Do we know know enough to show user "committed" flag with entries?
  Not quite.
  Entire log up to that committed entry must be stable.
    Otherwise there might be earlier committed write we don't know about.
    And we'd have to re-run conflict resolution.
  Committed write isn't stable unless we've seen all prior committed writes.
  Bayou update propagation protocol maintains this.
    Propagates writes in order.

Now DB entries can be shown to user as committed.
  Which means everyone does (or will) agree on them.
  And a slow or disconnected node cannot prevent commits.

When can you discard log entries?
  Can never discard uncommitted
    You might need to undo that far when learning of other updates
  Can truncate any prefix of committed operations

Is is OK if primary replica can choose *any* order to commit?
  Suppose I schedule an event, then want to delete it?
    Or change attendee list?
  The create must precede the delete in the CSN order!
    And in every node's view of uncommitted part of log too.
  Total order must preserve order of writes originated at each node.
    But not necessarily order among different nodes' writes.
  How can primary replica be sure it commits each node's writes in order?
  1) Nodes actually use Lamport logical time-stamps for local TS.
  2) Everybody sends updates in order.
  So primary sees updates in causal order, and commits them in that order.
  Does CSN always preserve parial "accept order"?
  No: Sync, resolve conflict, have meeting, while primary changed order

So if things can switch around what guarantees to we have?
  Read your writes - In same session, reads always reflect writes
    Change your password, don't want to get "invalid password error"
    Don't want deleted emails to reappear
  Monotonic reads - if R1->R2 in session, R2 reflects all writes seen by R1
    Don't want meetings to come and go in calendar
    Mailbox - if you see a new message, request to read it should succeed
  Writes follow reads - Say R1 sees W (from other session) and makes W1
		        Everyone must see W1 as happening after W
    Bulletin board - want replies to appear after messages   
  Monotonic writes - If you write W1 then W2, then everyone sees W2 after W1
    In text editor, if you save, edit, save, want to see second save second!

How do I propagate if I've discarded part of my log?
  Suppose I've discarded *all* writes with CSNs.
  I keep a copy of the stable DB reflecting just discarded entries.
  First, I know I cannot receive writes that conflict.
    Could only happen if write has a CSN < one discarded.
    But I already saw it, in the right order, so can ignore.
  When I propagate to node X:
    If node X's highest CSN is less than my lowest (= Omitted Seq No, OSN),
      I can send him my stable DB reflecting just committed writes.
      Node X can use my DB as starting point.
      And X can discard all CSN log entries.
      Then play his tentative writes into that DB.
    If node X's highest CSN is greater than mine,
      X can ignore my DB.

What gets transfered in anti-entropy algorithm?
  Each server maintains a version vector V
    logical clock of last update in log of each user
Creating/deleting servers
  Accept-stamp is ID of new server
  Propagate "retirement write", that lets people remove server from v vecs

Log truncation
  Each server maintains "omitted" vector O
    logical clock of most recent update from each server omitted from the log
  Each server keeps OSN, the sequence number of latest omitted request

Evaluation
  Seems much more functional than Palm Pilot calendar.
  *Not* transparent to applications!
    Requires very strange programming practices.
    Every "write" is actually a bunch of code, not the new bits.
    Check conflicts (meeting already scheduled for any attendee?).
    Resolve conflicts (choose a different unused meeting time?).
  Not every application has appropriate semantics.
    Might actually work for a bank account!
    But conflicts can't always be automatically resolved.
      For example, changes to source code from multiple programmers.