==========================
Replication
==========================

Example application:
  Appendable mailbox file.
    For simplicity, just one file.
  Must support multiple concurrent clients.
  Just two operations:
    Read
    Append(Data)

We want replicated system to mimic client-visible behavior
  of a single-server implementation.
  Despite partial failures at any time.
  Examples of correct behavior:
    A1(X), A1(Y), R1 -> X,Y
    A1(X) concurrent w/ A2(Y): all reads see same X/Y order
  This is sequential consistency.
  Single server implementation:
    Process operations one a time.
    Write to stable storage before replying to client.
    Client waits for reply before sending next message.
    Might lose the last update.
      But client will re-send after reboot.

Why do programmers like this model?
  Hides existence of concurrency.
  Hides failure recovery details.
  General tool, same for many applications.

But programmers might also want:
  To avoid a single point of failure.
  To improve read (or even write) performance as well as availability.

Straw man (incorrect!) replicated database:
  Imagine 2 servers, file replicated on both.
  Clients send read operations to either server.
  Clients send update operations to both.
    Waits for both to reply.

What can go wrong?
  Not available if either server is down.
  A1(X), A1(Y), R : correct; wait for reply forces order.
  A1(X) | A2(Y) : not correct.
    Losses/delays may cause different order at two servers.

We have two problems to fix:
  Want to survive one server being down.
    Straw man is *less* available than single server!
  Want to ensure servers have identical replicas.
    I.e. process updates in the same order.

Can we relax requirement that client wait for replies from all servers?
  To tolerate one server being down.
  The naive approach (wait for just one) does not work!
  What if one of your network messages is lost?
  What if the network is partitioned?
    Client can't distinguish server down from partition.

Can fix partition problem with voting:
  Use 2n+1 replicas if you want to survive n failures.
  Only allow operations in a partition with >= n+1 members.
    There can be at most one such partition.
  Reads also need to contact n+1 members.
    To verify that client is in majority partition.
    (What if client gets different answers???)

Can fix order problem with primary copy:
  One primary, clients send operations to it.
  Primary imposes order on concurrent operations.
  Primary tells slaves about operations, with sequence number.
  Slaves perform operation, then send ACK to primary.
  (What if primary crashes???)

Second straw man:
  Is this enough?
  Can primary just wait for ACKs, then respond to client?
No!
  What if fewer than n slaves respond?
    I.e. primary may be in the minority partition.
    Then primary should abort operation and send error to client.
    But some of the slaves have performed it!
  So slaves need to defer actual operation until primary confirms.

A two-phase protocol:
  1: Primary sends updates to slaves.
     Slaves append update information to a *log*, but don't yet perform.
     (Logs must be persistent???)
     Slaves ACK first phase to primary.
     Primary waits for n replies (so n+1 replicas).
     Primary replies "YES" to client.
  2: In background, primary tells slaves that commit happened.
     Slaves update state from log entry.

What if the primary fails before sending client the ACK?
  I.e. while sending phase 1 msgs or collecting slave ACKs.
  If some slave got the phase 1 message, it can re-start.
  But it's OK to just abort the operation -- client hasn't seen an ACK.

What if the primary fails after ACKing client?
  But before sending phase 2 commit messages to all slaves.
  Operation *must* complete because client has seen ACK.
  New primary can ask remaining replicas.
    If n+1 saw (and acked) the phase 1 message,
    new primary can safely commit.

What about concurrent operations?
  Primary numbers them and allows them to proceed.
  Primary and slaves keep a log of all operations.
  Slave only ACKs a phase 1 msg if it has seen all prior phase 1 msgs.
  Primary only sends out a commit if it has committed all previous.
    Otherwise reboot could lose a later op but commit an earlier one.
  Logs look like:
    Old committed operations
    <-- Commit Point (CP)
    Uncommitted operations

What if slave fails?
  Doesn't matter -- can just keep going as long as > 1/2 survive.

Slave log:
  If a replica ACKs phase 1:
    It can't yet update state (write your mailbox).
    But it has promised to do so when asked.
      Since primary may already have ACKed client.
    Slave should not ACK, reboot, change its mind.
    So it must have a stable log.
  Same order as primary log.

Does this preserve ordering of committed operations?
  Suppose OP1 and OP2 are concurrent, but in that order.
  Remember, assigned sequence numbers when submitted to primary.
  Maybe phase-1 messages for OP1 were lost.
  So primary gets n+1 ACKs for OP2 first.
  Then primary crashes.
  Consequences:
    Primary cannot reply to client until all previous operations commit.
    New primary cannot commit unless all previous ops can be
      shown to have (possibly) committed.

Reads:
  Can't just send to any replica.
    Must make sure it's in the majority partition.
    Otherwise may miss a committed write in the other half.
  Read must reflect all committed updates.
    So clients have to send reads to the primary.
  A read could wait for all prior writes to commit.
    But could also just use latest committed write at primary.
    Not much of a savings, since you have to gather quorum anyway.

When is real state (mailbox) written to disk (from log)?
  Primary sends current CP along with every message.
    Slaves remember largest CP seen.
  Real mailbox written only when CP passes log entry.
  Updates in background, in log order.

How to agree on the choice of a primary and "majority" backups?
  It's not enough for each operation to be sent to any majority of nodes.
    Because then different nodes see different operation sequences.
    So the replicas will diverge.
  Must be a coherent notion of the current majority.
  Changes to that majority must be controlled:
    New majority must have same starting state left by previous majority.
    So you have to know what the previous majority was.
    So the old majority must be preserved in new majority.
    E.g. if you have just 2 of 5, perhaps they were in the minority,
      and missed some operations.

Let's call an operating majority of nodes a "view".
  What can go wrong between two views that don't share a majority of nodes?
  Assume 5-node system.
  1. Operate with V1 = A, B, C, D, E.
  2. D and E cut from network, V2 = A, B, C
  3. C crashes, revives w/o state
  4. But now A & B partitioned from C, D, & E
  5. So V2 = C, D, E
  But now we've lost our total order of views and hence operations.
  From client's point of view, system claims to work, but may
    lose committed operations (in first V2).
  What went wrong -- no one in 2nd V2 knew of first V2

So how to recover from view changes?
  Worst case: old primary operating with just n slaves.
    primary crashes
    other n slaves (previous minority) simultaneously recover
       w. incomplete logs
  Must reconstruct primary's state.
  Need to ensure the following:
    1. There must be only one view at a time
         This is ensured by requiring n+1 nodes in any view
    2. We must know about the previous view (so we don't miss one)
         n+1 nodes in each view guarantees we have one from previous view
    3. We must know about everything that committed in previous view
	 This is trickier.
	 If n+1 nodes were in both the previous and current view
	     and none of the nodes lost state betwen the views
	     then at least one node will have seen every operation
         So old view needed *more* than a majority
	     at least n+2 including primary, so dead primary leaves n+1 nodes
    Can we relax requirement to satisfy 3?
      Say m >= n+1 nodes in previous view
      You really only need m - n nodes to survive
      If n+1 ACKed every operation
        any m-n out of m nodes will have seen each op
    So that's the criterion for accepting updates into the new view.

By the way, this is not quite the well-known two-phase commit, though similar.
  In two-phase commit, nodes are alowed to vote ABORT
    Designate one replica as "coordinator" (primary)
    1a. Coordinator replica broadcasts update in a "prepare message"
          Can include sequence number so everyone agrees on order of updates
    1b. Other replicas reply with VOTE-COMMIT or VOTE-ABORT
          In mailbox case, you never needed to abort
    2a. Coordinator broadcasts GLOBAL-COMMIT or GLOBAL-ABORT
          If everyone voted VOTE-COMMIT
            coordinator durably records transaction
            coordinator broadcasts GLOBAL-COMMIT
          If anyone voted VOTE-ABORT
            coordinator broadcasts GLOBAL-ABORT
    2b. Participants that voted YES wait for decision by coordinator
  Say primary and one backup fail
    All other backups all voted COMMIT
    No one received GLOBAL-COMMIT
  No one knows how the last backup voted
  However, in append/read mailbox exampke backups do not vote to abort
    Thus, can always COMMIT an operation if any backup saw and ACKed it.

So here's an approach to view changes:
  Find a majority for the new view.
  Sequence of views, should be totally ordered--always a next and previous
  A view must include at least n+1 nodes that were in previous view.
    Or, if previous view only had m replicas, n+1 <= m < 2n+1
      can get away with only m - n replicas in both views
  Choose a new primary (perhaps node w/ highest node ID)
  Commit any operations that any of the surviving nodes ACKed
    Within view, all operations were assigned order by primary
    So avoid committing things out of order

Interesting cases:
  Primary fails.
  Operation interrupted by a view change.
  Node fails or recovers during view change.
    I.e. multiple view changes started at the same time.
  What happens in minority partition:
    View change must fail in that case!

Give every server a current view_number in stable storage.
  view_number = < sequence_number, initiating_server_id >
  The second part helps break ties.

First, initiating server chooses a new view number.
  new view_number = < old_sequence_number + 1, initiating_server_id >
  (Using its own server ID).

Initiating node broadcasts a prepare view change message.
  Contains new view_number.
  Recipients stop processing operations.
  Recipients send log tail to initiator.
  Initiator waits for majority of replies.
    Maybe a little longer, so that view includes all live nodes.
  Initiator decides which operations are before/after view change.
    I.e. which have committed and must be preserved.
    Others are aborted, must be re-started by client.
  Initiator sends phase 2 message to each replica:
    List of nodes (including who is primary).
    List of committed recent operations.
  Now replicas can continue.

What if two concurrent view changes?
  Must retain a notion of views occuring in a particular order!
    Otherwise cannot show continuity of majority.
  I.e. failure/recovery during view change.
  initiating_server_id breaks tie, so we know which is first/second.
  If first got a majority, second did not.
    So complete the first view change, abort and maybe re-do second one.
  If second got a majority, first aborts.
  But perhaps neither got a majority.
    They must both abort, random delay???, and try again.
    Maybe delays proportional to node ID, so deterministic winner of retry.
  Big point:
    Always at most one functioning view -- thus at most one primary.
    Operations totally ordered within each view.
    No operation spans a view.
    Views are also totally ordered.
    Majority from each view to the next, carries with it knowledge
      of order of committed operations.

What if initiator fails during view change?
  We either do or don't have evidence that he completed view change.
  If he committed operations, it's a real view, must preserve it.
    He must have had a majority.
    Make sure we get a majority of nodes from it.
  If he committed no operations, doesn't matter.

What if view change in minority partition?
  Initiator unable to acquire a majority of ACKs.

Data recovery procedure:
  Same as before.
  I.e. how to deal with operations interrupted by view change.
  Need to decide which operations to carry over from old view.
  Must commit any operation old primary might have responded to.
  Can abort any operation old primary couldn't have responded to.
    Client will re-try in new view.
  Old primary only committed if it got ACKs from n+1 clients.
    If primary fails, and we expect to continue, majority must survive.
    So old view had *more* than a majority.
    Thus if the primary committed an operation,
      a majority must know about it after the view change.
    So that's the criterion for accepting updates into the new view.


===================
Byzantine Failures
===================

Until now, we have been assuming "fail-stop" behavior
  A node may fail, but this means it stops doing anything
  A more serious type of failure is "Byzantine" failure
    In which nodes can exhibit arbitrary behavior
    In worst case can pretend to be ok, but not do the right thing
    For example, could happen if an attacker has taken over your machine

Let's look at a simple example.  The problem:
  Enemy camp.
  Three armies, three generals, G0, G1, G2.
  Can communicate with trustworthy messengers.
  They need to agree on whether to attack at dawn.
  But one of the generals might be a traitor.
  Two loyal generals needed for victory; defeat if only one attacks.

Straw man:
  In advance the generals designate G0 as the leader.
  G0 sends an order to each of G1 and G2.
  G1 and G2 follow the order.
  If G1 or G2 is the traitor, this procedure works:
    G0 and the non-traitor both attack (or both retreat).
  What if G0 is the traitor?
    G0 could tell G1 to attack, and G2 to retreat.

Can G1 and G2 detect G0's treachery by comparing notes?
  G1 and G2 tell each other what G0 told them.
  If G1 and G2 got the same order, they'll obey it.
    Otherwise both retreat.
  If G0 is the traitor, this procedure will detect it.
  But does it still work if G1 (or G2) is the traitor?
    Suppose G0 tells both to attack.
    G1 tells G2 "G0 told me to retreat".
    So G2 decides G0 is the traitor, and retreats, while G0 attacks.

Why is this problem interesting?
  We are talking about replicating systems for fault-tolerance
  But might not be fail-stop.
  What if replica software malfunctions?
    Sends incorrect messages, performs incorrect operations on data.
  What if a replica is run by a malicious operator?

Byzantine Agreement is what we need if replicas can't be trusted.
  Generals == replicas.
  Orders == update operations.
  Single-copy correctness demands that all replicas perform the same ops.
    In the same order.
  So "agreement to attack" == all replicas agree on next update.
  And traitor == a replica that wants an incorrect result:
    Hard to forge actual operations in the real world.
    Traitor might cause inconsistent order of operations.
    Traitor might prevent any progress from being made.
  Assumption of only one traitor == withstand single failure.
    More generally, want to withstand some fraction of failures.

Back to the original problem.
  We ran into trouble trusting G1's claim "G0 told me to retreat".
  We can fix this with digital signatures.
  Now the basic algorithm is:
    G0 sends signed orders to G1 and G2.
    G1 and G2 exchange those orders.
  If G0 is the traitor and sends different orders, G0 and G1 detect easily.
  If G1 is the traitor, what can he do?
    Cannot forge a contrary order from G0.
    BUT G1 can just not send anything to G2!
    Then G0 will attack, but G2 will wait forever (and not attack).

So we have the danger that network delays can cause incorrect results!
  Suppose G2 times out after waiting for G1 for a while?
    And decides G1 must be a traitor, and follows G0's order.
  Then a traitorous G0 can cause incorrect operation:
    Send "attack" to G2. Send nothing to G1.
    G2 will time out and attack; G0 and G1 won't.
  
Note that we still have a problem even if G0, G1 and G2 are loyal.
  Adversary may delay or discard messages.

A majority of correct nodes is not sufficient any more
  Need at least 3f+1 replicas to survive f failures
  And that still doesn't guarantee liveness
    In fact, not *possible* to guarantee liveness
      unless some bound on message delays