==========================
Replication
==========================

Example application:
  Appendable file, perhaps suitable for e-mail storage.
    For simplicity, just one file.
  Must support multiple concurrent clients.
  Just two operations:
    Read
    Append(Data)

We want replicated system to mimic client-visible behavior
  of a single-server implementation.
  Despite partial failures at any time.
  Examples of correct behavior:
    A1(X), A1(Y), R1 -> X,Y
    A1(X) concurrent w/ A2(Y): all reads see same X/Y order
  This is sequential consistency.
  Single server implementation:
    Process operations one a time.
    Write to stable storage before replying to client.
    Client waits for reply before sending next message.
    Might lose the last update.
      But client will re-send after reboot.

Why do programmers like this model?
  Hides existence of concurrency.
  Hides failure recovery details.
  General tool, same for many applications.

But programmers might also want:
  To avoid a single point of failure.
  To improve read (or even write) performance as well as availability.

Straw man (incorrect!) replicated database:
  Imagine 2 servers, file replicated on both.
  Clients send read operations to either server.
  Clients send update operations to both.
    Waits for both to reply.

What can go wrong?
  Not available if either server is down.
  A1(X), A1(Y), R : correct; wait for reply forces order.
  A1(X) | A2(Y) : not correct.
    Losses/delays may cause different order at two servers.

We have two problems to fix:
  Want to survive one server being down.
    Straw man is *less* available than single server!
  Want to ensure servers have identical replicas.
    I.e. process updates in the same order.

Can we relax requirement that client wait for replies from all servers?
  To tolerate one server being down.
  The naive approach (wait for just one) does not work!
  What if one of your network messages is lost?
  What if the network is partitioned?
    Client can't distinguish server down from partition.

Can fix partition problem with voting:
  Use 2n+1 replicas if you want to survive n failures.
  Only allow operations in a partition with >= n+1 members.
    There can be at most one such partition.
  Reads also need to contact n+1 members.
    To verify that client is in majority partition.
    (What if client gets different answers???)

Can fix order problem with primary copy:
  One primary, clients send operations to it.
  Primary imposes order on concurrent operations.
  Primary tells slaves about operations, with sequence number.
  Slaves perform operation, then send ACK to primary.
  (What if primary crashes???)

Second straw man:
  Is this enough?
  Can primary just wait for ACKs, then respond to client?

No!
  What if fewer than n slaves respond?
    I.e. primary may be in the minority partition.
    Then primary should abort operation and send error to client.
    But some of the slaves have performed it!
  So slaves need to defer actual update until primary confirms.

2-phase commit protocol:
  1: Primary sends updates to slaves.
     Slaves append update information to a log, but don't yet perform.
     (Logs must be persistent???)
     Slaves ACK first phase to primary.
     Primary waits for n replies (so n+1 replicas).
     Primary replies "YES" to client.
  2: In background, primary tells slaves that commit happened.
     Slaves update real DB from log entry.

What if the primary fails before sending client the ACK?
  I.e. while sending phase 1 msgs or collecting slave ACKs.
  If some slave got the phase 1 message, it can re-start.
  But it's OK to just abort the operation -- client hasn't seen an ACK.

What if the primary fails after ACKing client?
  But before sending phase 2 commit messages to all slaves.
  Operation *must* complete because client has seen ACK.
  New primary can ask remaining replicas.
    If n+1 saw (and acked) the phase 1 message,
    new primary can safely commit.

What about concurrent operations?
  Primary numbers them and allows them to proceed.
  Primary and slaves keep a log of all operations.
  Slave only ACKs a phase 1 msg if it has seen all prior phase 1 msgs.
  Primary only sends out a commit if it has committed all previous.
    Otherwise reboot could lose a later op but commit an earlier one.
  Logs look like:
    Old committed operations
    <-- Commit Point (CP)
    Uncommitted operations

What if slave fails?
  Doesn't matter -- can just keep going as long as > 1/2 survive.

Slave log:
  If a replica ACKs phase 1:
    It can't yet write the DB.
    But it has promised to do so when asked.
      Since primary may already have ACKed client.
    Slave should not ACK, reboot, change its mind.
    So it must have a stable log.
  Same order as primary log.

Does this preserve ordering of committed operations?
  Suppose OP1 and OP2 are concurrent, but in that order.
  Remember, assigned sequence numbers when submitted to primary.
  Maybe prepare messages for OP1 were lost.
  So primary gets n+1 ACKs for OP2 first.
  Then primary crashes.
  Consequences:
    Primary cannot reply to client until all previous operations commit.
    New primary cannot commit unless all previous ops can be
      shown to have (possibly) committed.

Reads:
  Can't just send to any replica.
    Must make sure it's in the majority partition.
    Otherwise may miss a committed write in the other half.
  Read must reflect all committed updates.
    So clients have to send reads to the primary.
  A read could wait for all prior writes to commit.
    But could also just use latest committed write at primary.
    Not much of a savings, since you have to gather quorum anyway.

When is real on-disk DB written (from log)?
  Primary sends current CP along with every message.
    Slaves remember largest CP seen.
  Real DB entries written only when CP passes log entry.
  Updates in background. In log order.

How to agree on the choice of a primary and "majority" backups?
  It's not enough for each operation to be sent to any majority of nodes.
    Because then different nodes see different operation sequences.
    So the replicas will diverge.
  Must be a coherent notion of the current majority.
  Changes to that majority must be controlled:
    New majority must have same starting state left by previous majority.
    So you have to know what the previous majority was.
    So the old majority must be preserved in new majority.
    E.g. if you have just 2 of 5, perhaps they were in the minority,
      and missed some operations.

Let's call an operating majority of nodes a "view".
  What can go wrong between two views that don't share a majority of nodes?
  Assume 5-node system.
  1. Operate with V1 = A, B, C, D, E.
  2. D and E cut from network, V2 = A, B, C
  3. C crashes, revives w/o state
  4. But now A & B partitioned from C, D, & E
  5. So V2 = C, D, E
  But now we've lost our total order of views and hence operations.
  From client's point of view, system may claims to work, but may
    lose committed operations (in first V2).
  What went wrong -- no on in 2nd V2 knew of first V2

So how to recover from view changes?
  Worst case: old primary operating with just n slaves.
    primary crashes
    other n slaves simultaneously recover w. incomplete logs
  Must reconstruct primary's state.
  Need to ensure the following:
    1. There must be only one view at a time
         This is ensured by requiring n+1 nodes in any view
    2. We must know about the previous view (so we don't miss one)
         n+1 nodes in each view guarantees we have one from previous view
    3. We must know about everything that committed in previous view
	 This is trickier.
	 If n+1 nodes were in both the previous and current view
	     and none of the nodes lost state betwen the views
	     then at least one will have seen every operation
         So old view needed *more* than a majority
	     at least n+2 including primary, so dead primary leaves n+1 nodes
    Can we relax requirement to satisfy 3?
      Say m >= n+1 nodes in previous view
      You really only need m - n nodes to survive
      If n+1 ACKed every operation
        any m-n out of m nodes will have seen each op
    So that's the criterion for accepting updates into the new view.
  By the way, this is not quite two-phase commit, though similar.
    Two-phase commit nodes are alowed to vote ABORT
    Say primary and one backup fail
      All other backups all voted COMMIT
      No one received GLOBAL-COMMIT
    No one knows how the last backup voted
    However, in append/read exampke (and in Harp) backups do not vote to abort
    Thus, can always COMMIT an operation if any backup saw and ACKed it.

So here's an approach to view changes:
  Find a majority for the new view.
  Sequence of views, should be totally ordered--always a next and previous
  A view must include at least n+1 nodes that were in previous view.
  Choose a new primary (perhaps node w/ highest node ID)
  Commit any operations that any of the surviving nodes ACKed
    Within view, all operations were assigned order by primary
    So avoid committing things out of order

Interesting cases:
  Primary fails.
  Operation interrupted by a view change.
  Node fails or recovers during view change.
    I.e. multiple view changes started at the same time.
  What happens in minority partition:
    View change must fail in that case!

Give every server has a current view_number in stable storage.
  view_number = < sequence_number, initiating_server_id >
  The second part helps break ties.

First, initiating server chooses a new view number.
  new view_number = < old_sequence_number + 1, initiating_server_id >
  (Using its own server ID).

Initiating node broadcasts a prepare view change message.
  Contains new view_number.
  Recipients stop processing operations.
  Recipients send log tail to initiator.
  Initiator waits for majority of replies.
    Maybe a little longer, so that view includes all live nodes.
  Initiator decides which operations are before/after view change.
    I.e. which have committed and must be preserved.
    Others are aborted, must be re-started by client.
  Initiator sends phase 2 message to each replica:
    List of nodes (including who is primary).
    List of committed recent operations.
  Now replicas can continue.

What if two concurrent view changes?
  Must retain a notion of views occuring in a particular order!
    Otherwise cannot show continuity of majority.
  I.e. failure/recovery during view change.
  initiating_server_id breaks tie, so we know which is first/second.
  If first got a majority, second did not.
    So complete the first view change, abort and maybe re-do second one.
  If second got a majority, first aborts.
  But perhaps neither got a majority.
    They must both abort, random delay???, and try again.
    Maybe delays proportional to node ID, so deterministic winner of retry.
  Big point:
    Always at most one functioning view -- thus at most one primary.
    Operations totally ordered within each view.
    No operation spans a view.
    Views are also totally ordered.
    Majority from each view to the next, carries with it knowledge
      of order of committed operations.

What if initiator fails during view change?
  We either do or don't have evidence that he completed view change.
  If he committed operations, it's a real view, must preserve it.
    He must have had a majority.
    Make sure we get a majority of nodes from it.
  If he committed no operations, doesn't matter.

What if view change in minority partition?
  Initiator unable to acquire a majority of ACKs.

Data recovery procedure:
  Same as before.
  I.e. how to deal with operations interrupted by view change.
  Need to decide which operations to carry over from old view.
  Must commit any operation old primary might have responded to.
  Can abort any operation old primary couldn't have responded to.
    Client will re-try in new view.
  Old primary only committed if it got ACKs from n+1 clients.
    If primary fails, and we expect to continue, majority must survive.
    So old view had *more* than a majority.
    Thus if the primary committed an operation,
      a majority must know about it after the view change.
    So that's the criterion for accepting updates into the new view.

Does this preserve ordering of committed operations?
  Suppose OP1 and OP2 are concurrent, but in that order.
  Remember, assigned sequence numbers when submitted to primary.
  Maybe prepare messages for OP1 were lost.
  So primary gets n+1 ACKs for OP2 first.
  Then primary crashes.
  Consequences:
    Primary cannot reply to client until all previous operations commit.
    New primary cannot commit unless all previous ops can be
      shown to have (possibly) committed.