Administrivia:
  - Make sure the correct microphone is selected in zoom
  - Sign up for group meeting using your sunet ID
  - Small Group Feedback Session Monday (bring feedback)

Paxos
=====

State machine replication review
  Boils down to a consensus problem for each slot in an operation log

FLP review: In a deterministic, asynchronous consensus system
  Pick at most two of: safety, liveness, fault-tolerance

How might you achieve safety and liveness?
  Two-phase commit--safe, and guaranteed to terminate in two phases
    So just run 2PC across all replicas for each state machine operation
  Can we repair a 2PC after one node fails?
    An ATM may have disbursed $20 bills, *externalizing* our transaction
      Externalized real-world actions irreversible, so 2PC outcome must match
    If non-coordinator node fails
      If not all cohorts sent COMMIT-VOTE, coordinator unilaterally ABORTs
      Otherwise, if coordinator already sent COMMIT, commit as usual
    What if coordinator fails?
      Did any coordinator receive COMMIT or ABORT from coordinator?
        Cohort can re-broadcast co-odrinator's dying message, follow it
      Else, did any cohort not (yet) send COMMIT-VOTE?
          (cohort either sent ABORT-VOTE or never received PREPARE)
        Safe to ABORT, because coordinator couldn't have sent COMMIT
      What if all cohorts sent COMMIT-VOTE, none heard from coordinator?
        Coordinator could have disbursed funds or not--we are stuck
  Solution: Three-phase commit
    2PC's problem was "prepared" state where cohorts can go either way
    So now after PREPARE, if everyone votes COMMIT, broadcast PREPARE-COMMIT
    Once all participants ACK PREPARE-COMMIT, broadcast COMMIT
    After failure, COMMIT if and only if any survivor saw PREPARE-COMMIT
    Why is this better?
      2PC: execute transaction when everyone is willing to COMMIT it
      3PC: execute transaction when everyone *knows* everyone willing to COMMIT
        ATM only disburses funds once everyone knows transaction can commit
  Why doesn't 3PC violate FLP?
    Must *know* that a node has failed to initiate repair
    In asynchronous model, can't tell failed nodes from slow ones

How about liveness and fault-tolerance?
  Simple gossip protocol
  - Every node broadcasts a proposal
  - After 5 sec. output lowest value seen
  Guaranteed to terminate in 5 seconds (if any non-faulty node)
  But network partition or slow nodes could prevent agreement

What about safety and fault-tolerance?  Topic this and next lecture...
  Almost nobody seems to understand how to implement consensus
    They know you use "Paxos" for consensus, but don't know what that is
  My theory:  We've been teaching wrong
    Notably, original Paxos paper (The Part-Time Parliament) unreadable
    Just using the word "simple" a lot in rewrite may not help understanding
  Another theory:  The protocols are wrong because they are hard to teach
    Monday, we will look at a protocol intended to correct this problem

Straw man safe+fault-tolerant consensus: "gossip and vote" protocol:
  - Every node broadcasts a proposed value
  - After 5 sec., broadcast vote for lowest value seen
  - If some value receives a votes from a majority, output it
  - Otherwise, you are "stuck" and there is no liveness
      E.g., 2f+1 nodes total, f voted for 0, f voted for 1, and 1 node failed
      Or no failures but three-way split between values 0, 1, and 2


Neither Paxos nor "gossip and vote" guarantee liveness.  Why is Paxos better?
  Gossip and vote gets into a stuck state with split votes/failed nodes
  Paxos never gets stuck; even if it hasn't terminated, there's always hope

             f+1 0 votes +--------+            +---------+
            +----------->|0-valent+----------->|learned 0|
            |            +----+---+            +---------+
            |                  \
  +---------+                   \              +---------+
  |Undecided+--------------------+------------>|  stuck  |
  +---------+                   /              +---------+
            |                  /
            |            +----+---+            +---------+
            +----------->|1-valent+----------->|learned 1|
             f+1 1 votes +--------+            +---------+

Some important conclusions from discussion thus far:
  - Voting is a key technique providing both safety and fault-tolerance
  - But votes can split and produce permanently indeterminate outcomes
  - Or failed cohorts + lost votes mean never learn system 0/1-valent
  - So better NOT vote on questions you can't afford to leave indeterminate
    (e.g., what is the 5th operation in the state machine log)

Key idea in many consensus protocols:
  Carefully craft statements on which nodes will cast votes
    Either avoid the possibility of a split vote
    Or ensure indeterminate outcome won't permanently block consensus on log
  Again, don't directly vote on log entries (e.g., "log index 5 is x")
Two types of statement are okay to vote on:
  Irrefutable statements: no one ever votes against, so no split votes
    E.g., vote that it's past 3pm on Wednesday, April 15, or don't vote yet
  Neutralizable statements: indeterminate outcome won't prevent consensus

Key Paxos mechanism:  numbered ballots
  Vote for log entries, but associate each vote with a ballot number
    E.g., "I vote that log index i contains value v *in ballot #b*"
  Approximate idea: Suppose ballot b is indeterminate
    ...neutralize it by proceeding with a higher-numbered ballot

The complication:  An indeterminate ballot didn't necessarily fail
  A node might externalize a value and then fail
    The network could lose the last messages sent by such a node
    Yet once externalized, a value must be chosen (e.g., ATM disbursement)
  Example: 7 nodes (n1, ..., n7).
    - n2-n4 vote for value v in ballot #b
    - n5-n7 never heard of v, voted for v' in ballot #b' (b' > b)
    - n1 died at some point, but might have voted and externalized
  Now what?  Both ballots b and b' are indeterminate.  Did they fail?
    Or did n1 vote for and externalize v?  or v'?
  No way to fix an indeterminate ballot!

Solution:  Conservatively assume indeterminate ballots could be externalized
  Note, if v' = v above, there is no problem
  So if ballot b is indeterminate, ensure all future ballots use same value

The (single-decree) Paxos protocol:
* Establish irrefutable mapping between ballot # and value
    Ballot = referendum on one value, not choice between candidate values
    Each ballot proposed by a single leader, so has single value
    Embed leader's unique machine-id inside ballot number
* Before voting on value in ballot b, *prepare* b by checking previous ballots
    Leader broadcasts "PREPARE b"  [Phase 1a in paper]
    Cohorts reply "PREPARED b { NULL | <b_old, v_old> }"  [Phase 1b]
      b_old is the last ballot in which cohort voted before b
      v_old is the value in ballot b_old
      Cohort promises conditions will still be true when acted on by leader
        ...implies promise never to vote for any ballot between b_old and b
    Ballot b prepared after getting PREPARED from a majority of nodes
    If all PREPARED messages have NULL, leader can use any value v
    Otherwise, highest b_old is either indeterminate (or successful)
      Leader must set v = v_old corresponding to highest b_old
* Now can vote on value v
    Leader broadcasts "COMMIT b v"   [a.k.a. "accept b" in paper Phase 2b]
    Cohorts reply "COMMITTED b"
    After collecting COMMITTED from majority of replicas, have consensus on v

Another way to view Paxos: Translate concrete messages to conceptual ones
  The conceptual protocol:
    * commit b v: Vote for consensus on value v in ballot #b
    * abort b v:  Vote that ballot #b did not reach consensus on value v
    Invariant: For each b, can only vote "commit b v" for a single value v
      Value is determined by leader of ballot b (whose id is embedded in b)
    Invariant: can only vote "commit b v" after a majority votes for
          { abort b' v' | b' <= b && v' != v }
      When such abort messages garner majority, we say (b, v) is prepared
      I.e., first vote you *didn't* reach consensus on any v' != v for b' <= b
        Can't even cast a commit vote before consensus on what didn't commit
    Result: all committed and indeterminate ballots have same value!
  The concrete to conceptual mapping
    * PREPARED b <b_old, v_old>:
        { vote for "abort b' v'" | b_old < b' < b } [for all values v']
        { assert "abort b' v'" got majority | b' <= b_old && v' != v_old }
    * COMMITTED b
        vote for "commit b v", { "abort b v'" | v' != v }
        v is value from COMMIT b v

Simple way to use paxos with replicated state machine:
  One paxos instance per log entry (slot), specified in messages:
    PREPARE i b, PREPARED i b {NULL|...}, COMMIT i b, COMMITTED b
Can we optimize multi-paxos for only one round trip in common case?
  Yes.  Use same ballot number for all paxos instances
  * PREPARE i b: prepares ballot b for all slots i' >= i
  * PREPARED i b {NULL|<b_old,v_old>} a:
      - <b_old, v_old> is last COMMIT sent for ballot b
      - a is bool, if TRUE, means use NULL for "all future slots" (i' > i)
        This is the common case; means all slots >= i are now prepared
        If a is FALSE, then must issue PREPARE (i+1) b
When can you apply an log index to state machine and reply to client?
  When leader receives COMMITTED i b from a majority.  How do cohorts know?
  Can piggy back on new messages:
  * COMMIT i b v i_min:
      - i_min is says all slots i' < i_min committed with ballot # <= b
        So cohort that voted COMMIT i b v can mark (i, v) as committed
        Cohort that voted COMMIT i b' v' for b' < b can't commit v'
          must contact peer to learn value of v; but this is not common case
Convenient extensions to avoid full protocol or other delays where unnecessary
  * PREPARE i 0
      Special ballot 0 probes slot without trying to become leader
  * LEARN i v:  Reply to request concerning already committed slot i
      A single LEARN message enough to externalize v, no need to hold vote
      Alternatively, LEARN i v =~ PREPARED i <INFINITY, v> FALSE
  * STATUS i_min b:  Reply to msg with stale ballot number for current ballot
      Says, "Sorry, I've promised to ignore all ballots <= b."
  Note these are only optimizations, not required for correctness

Reconfiguration:  How to add and remove servers?
  Use the log!  Have distinguished log values that add/remove nodes
  Danger:  Better not commit values unless you know what a majority is!
           Better not even prepare values unless you know what a majority is!
  Lamport's solution?
    Log index i changes system composition for slot i+alpha
    Places bound alpha on number of uncommitted requests
    Fill log with nops if you had fewer than alpha outstanding operations
  Your humble instructor's solution:
    Unlimited pipelining of normal requests
    But only commit values in order (probably want this anyway)
      Ensures you always know what majority is before deciding COMMIT majority
    Don't pipeline *any* COMMIT messages behind COMMIT i b v that reconfigures
      Ensures you always know ballot is prepared before sending COMMIT

What kind of synchrony assumptions could guarantee liveness with Paxos?
  If you know range of message latencies, set timeout appropriately
    Don't propose more than one ballot per timeout
    Or don't propose if another cohort proposed higher ballot within timeout
  If you don't know max message latency, make more limited assumptions
    Example:  Assume message latencies don't grow exponentially
      Can keep doubling timeout before starting new ballot/leader election