Paxos
=====

State machine replication review
  Boils down to a consensus problem for each slot in an operation log

FLP review: In a deterministic, asynchronous consensus system
  Pick at most two of: safety, liveness, fault-tolerance
How might you achieve safety and liveness?
  Two-phase commit--safe, and guaranteed to terminate in two phases
  Or simpler: one-phase commit (take whatever lowest-numbered node says)
How about liveness and fault-tolerance?
  E.g., Every node broadcasts its input value
        wait 5 seconds then chose lexicographically lowest value seen
  Guaranteed to terminate in 5 seconds (if any node non-failed)
  But network partition could prevent agreement
What about safety and fault-tolerance?
  In theory, this is what Raft (and viewstamped-replication) give us
  But those give us more... how about something simpler?

Straw-man non-live consensus protocol:
  Everybody broadcast value, wait 5 seconds
  Every node votes for lowest value seen (and can never change vote)
  If some value receives a majority, output it
  Otherwise, you are "stuck" and there is no liveness
    E.g., 2f+1 nodes total, f voted for 0, f voted for 1, and 1 node failed
    Or no failures but three-way split between values 0, 1, and 2
Neither Raft nor straw-man have liveness.  Why is Raft better?
  Straw-man gets stuck with split votes
  Raft never gets stuck; even if it hasn't terminated, there's always hope

Some important conclusions from discussion thus far:
  - Voting is a key technique providing both safety and fault-tolerance
  - But votes can split and produce permanently indeterminate outcomes
  - So better not vote on questions you can't afford to leave indeterminate
    (e.g., what is the 5th operation in the state machine log)

Key idea in many of these protocols:
  Carefully craft statements on which nodes vote
  Either avoid possibility of split vote
  Or ensure indeterminate outcome won't permanently block log consensus
  Again, can't directly vote on log entries (e.g., "log index 5 is x")
Two types of statement are safe to vote on:
* Irrefutable statements: while relevant, no one ever votes against
    In Raft, at most one possible value of v in "(log index i, term t) = v"
    So while term t relevant
      No previously cast vote will contradict "(i, t) = v"
      Any node learning v is always free to vote for it in (i, t)
* Neutralizable statements: indeterminate outcome won't prevent consensus
    In Raft, nodes can fail to elect a leader for a particular term
    No big deal, just "neutralize" that term by moving to term n+1

Viewstamped replication (and Raft) from 50,000 feet:
  Have a single leader sequentially number all operations <viewstamp, op>...
  Have nodes vote on the operation corresponding to each viewstamp
    Only one leader per view, so this mapping is irrefutable
    Hence liveness guaranteed until you lose (or think you lost) leader
  Lost leader?  Must vote on (new viewid, new leader, last included viewstamp)
    (By irrefutability, stale nodes can always increase last include viewstamp)
    Leader vote could be indeterminate, but it's neutralizable!
    A failed view doesn't preclude successful views with higher viewids

Now let's try a different approach:  numbered ballots
  Vote for log entries, but associate each vote with a ballot number
    E.g., "I vote that log index i contains value v *in ballot #b*"
  Approximate idea: Suppose ballot b is indeterminate
    ...neutralize it by proceeding with a higher-numbered ballot

The complication:  An indeterminate ballot didn't necessarily fail
  Example: 7 nodes (n1, ..., n7).
    - n2-n4 vote for value v in ballot #b
    - n5-n7 never heard of v, voted for v' in ballot #b' (b' > b)
    - n1 died at some point, but might have voted
  Now what?  Both ballots b and b' are indeterminate.  Did they fail?
    Or did n1 vote for and externalize v?  Or did it externalize v'?
  Can't completely neutralize an indeterminate ballot!

Solution:  Conservatively assume indeterminate ballots may have succeeded
  Note, if v' = v above, there is no problem
  So if ballot b indeterminate, ensure all future ballots use same value

The (single-decree) Paxos protocol:
* Establish irrefutable mapping between ballot # and value
    Ballot = referendum on one value, not choice between candidate values
    Each ballot proposed by a single leader, has single value
    Embed leader's unique machine-id inside ballot number
* Before voting on value in ballot b, *prepare* b by checking previous ballots
    Leader broadcasts "PREPARE b"
    Cohorts reply "PREPARED b { NULL | <b_old, v_old> }"
      b_old is the last ballot in which cohort voted before b
      v_old is the value in ballot b_old
      Cohort promises conditions will still be true when acted on by leader
        ...implies promise never to vote for any ballot between b_old and b
    Ballot b prepared after getting PREPARED from a majority of nodes
    If all PREPARED messages have NULL, leader can use any value v
    Otherwise, highest b_old is indeterminate (or successful)
      Leader must set v = v_old corresponding to highest b_old
* Now can vote on value v
    Leader broadcasts "COMMIT b v"   [a.k.a. "accept b" in paper]
    Cohorts reply "COMMITTED b"
    After collecting COMMITTED from majority of replicas, consensus on v

Why is Paxos hard to understand?
  Conjecture: it conflates irrefutable and neutralizable messages
Another way to view Paxos: Translate concrete messages to conceptual ones
  The conceptual protocol:
    * commit b v: Vote for consensus on value v in ballot #b
    * abort b v:  Vote that ballot #b not reach consensus on value v
    Cannot vote for "commit b v" unless majority previously voted for
      { abort b' v' | b' <= b && v' != v }
    When such abort messages garner majority, we say (b, v) is prepared
  The concrete to conceptual mapping
    * PREPARED b <b_old, v_old>:
        { vote for "abort b' v'" | b_old < b' < b }
        { confirm "abort b' v'" got majority | b' <= b_old && v' != v_old }
    * COMMITTED b
        { vote for "commit b v" }  v is (irrefutable) value from COMMIT b v
So Paxos uses same tools (irrefutable/neutralizable statements)
  ... but in much more complicated way

Simple way to use paxos with replicated state machine:
  One paxos instance per log entry (slot), specified in messages:
    PREPARE i b, PREPARED i b {NULL|...}, COMMIT i b, COMMITTED i b
Can we optimize multi-paxos for only one round trip in common case?
  Yes.  Use same ballot number for all paxos instances
  * PREPARE i b: prepares ballot b for all slots i' >= i
  * PREPARED i b {NULL|<b_old,v_old>} a:
      - <b_old, v_old> is last COMMIT sent for ballot b
      - a is bool, if TRUE, means use NULL for "all future slots" (i' > i)
        This is the common case; means all slots >= i are now prepared
	If a is FALSE, then must issue PREPARE (i+1) b
When can you apply an log index to state machine and reply to client?
  When leader receives COMMITTED i b from a majority.  How do cohorts know?
  Can piggy back on new messages:
  * COMMIT i b v i_min:
      - i_min is says all slots i' < i_min committed with ballot # <= b
        So cohort that voted COMMIT i b v can mark (i, v) as committed
	Cohort that voted COMMIT i b' v' for b' < b can't commit v'
	  must contact peer to learn value of v; but this is not common case
Convenient extensions to avoid full protocol or other delays where unnecessary
  * PREPARE i 0
      Special ballot 0 probes slot without trying to become leader
  * LEARN i v:  Reply to request concerning already committed slot i
      A single LEARN message enough to externalize v, no need to hold vote
      Alternatively, LEARN i v =~ PREPARE i <INFINITY, v> FALSE
  * STATUS i_min b:  Reply request with current slot but stale ballot number
      Says, sorry, I've promised to ignore all ballots <= b
  Note these are only optimizations, not required for correctness

Reconfiguration:  How to add and remove servers?
  Use the log!  Have distinguished log values that add/remove nodes
  Danger:  Better not commit values unless you know what a majority is!
           Better not even prepare values unless you know what a majority is!
  Lamport's solution?
    Log index i changes system composition for slot i+alpha
    Places bound alpha on number of uncommitted requests
    Fill log with nops if you had fewer than alpha outstanding operations
  Your humble instructor's solution:
    Unlimited pipelining of normal requests
    But only commit values in order (probably want this anyway)
      Ensures you always know what majority is before deciding COMMIT majority
    Don't pipeline *any* COMMIT messages behind COMMIT i b v that reconfigures
      Ensures you always know ballot is prepared before sending COMMIT

What kind of synchrony assumptions could guarantee liveness with Paxos/Raft?
  If you know range of message latencies, set timeout appropriately
  If you don't, make more limited assumptions
  Example:  Assume message latencies don't grow exponentially
    Can keep doubling timeout before starting new ballot/leader election

Dirty little secret:  Most people think Paxos and VR are same protocol
  (Your instructor wrote paper about VR called Paxos Made Practical...oops)
  See https://www.cs.cornell.edu/fbs/publications/viveLaDifference.pdf

Does Paxos have any advantages over VR/Raft?  Maybe
  Leader change shares more code paths with normal operation
    Conjecture:  may be easier to test, leave fewer weird corner cases
  Greater symmetry makes it adaptable to less centralized situations
    Topic of ongoing research I'll tell you about in last lecture