Raft (2014)
===========

Zookeeper achieves fault-tolerance through state-machine replication
  All servers begin with the same initial state
  Reach consensus on an ordered list of deterministic operations (the log)
  All servers apply same operations in same order to maintain identical state

Today: how to achieve consensus.  Let's abstract the consensus problem:
  Bunch of nodes {V_i} each has an input {I_i}
    E.g., I_i is V_i's idea of what should go at some log index
  Through consensus protocol node V_i produces output O_i
  In asynchronous model, make no assumptions about timing
    No bound on message delay, relative execution speeds of nodes
    So can't differentiate a failed node from a slow one
    But with at most f failures, can wait for f+1 messages (including own)
  Want two safety properties:
    - Agreement: All non-faulty nodes that produce output choose the same O
    - Validity: O is equal to some node V_i's input I_i
  Want liveness: eventually all non-faulty nodes produce an output
  Want fault-tolerance: works if node fails at any point in protocol

In early 1980s, people kept proposing broken consensus protocols. Then FLP:
  No deterministic, async protocol can achieve safety, liveness, fault tolerance
  Proof overview:
    For some inputs, output inevitably depends on behavior of the network
      Must be albe to complete without hearing all inputs in case node failed
      In one case might choose I_i, in another V_i delayed so complete without
    To terminate, some "deciding message" D must seal the fate of the system
      Network can't affect output after D, even if all nodes don't know yet
    Fault-tolerance means system must be able to complete without D
      So if D delayed, other message must "neutralize" it so no longer deciding
    Pathological network can delay all deciding messages until neutralized

Don't confuse FLP with "CAP theorem," a much more obvious fact:
  Can't have all 3 of *C*onsistency, *A*vailability, *P*artition-tolerance
    Because if nodes can't communicate, they can't stay in sync
  FLP says if your protocol fault-tolerant, can't guarantee termination
      *even if no nodes actually fail*

Let's look at some straw-man consensus protocols.  Assume N=2f+1
Simple vote:
  Each V_i broadcasts a vote for its input I_i
  f+1 votes for same value? choose it as output
   Agreement? yes.  Validity?  yes.  Fault-tolerance?  yes.
   Termination?  No - gets stuck if no majority value
Leader vote:
  Each V_i broadcasts I_i, considers lowest V_j it hears from to be leader
  Collect at least f+1 inputs, then vote for leader's I_j
  If f+1 votes for same value I, choose it as O
   Agreement? yes.  Validity?  yes.  Fault-tolerance?  yes.
   Termination?  No - still gets stuck if different leaders, no majority value
Ballot sequence:
  Instead of one vote, allow a bunch of votes, each with a number B
    Let B=<n,i>, where low bits (i) designate V_i as B's leader
    Because B includes leader ID, only one proposed value per ballot
  Leader V_i says, "In ballot B=<n,i>, please vote for my input I_i"
  Collect f+1 votes for O in ballot B?  Output O and announce to other nodes
    Else? timeout, move to higher ballot
          once B increased, ignore all messages with lower-numbered ballots
    Termination?  yes.  Fault-tolerance?  yes.  Validity?  yes.
    Agreement?  No because different ballots could have different outputs
Paxos ballot sequence:
  Idea: like above, but ensure all majorities choose the same value
  V_i first asks, "I'm doing ballot B.  What's your most recent vote?"
    Each V_j promises to ignore messages with ballots < B
    Then responds with most recent vote and its ballot number (if any)
  If f+1 nodes have never voted, V_i asks everyone to vote for I_i
  Otherwise, ask everyone to vote for value of most recent vote learned
  So if any previous ballot got majority, guaranteed to re-use same value
    Validity?  yes.  Agreement?  yes.  Fault-tolerance?  yes.
    Termination?  no [required by FLP], but doesn't get permanently stuck
      And if ever network faster than timeout [synchrony], will terminate
  Given FLP, this hits the sweet spot:
    Always safe and fault-tolerant
    Never permanently stuck
    Guaranteed to terminate whenever network faster than timeouts
    Can even dynamically lengthen timeouts if unknown latency bound

How I personally analyze consensus protocols:
  Decompose them into a explicit or implicit votes on statements
  To ensure you don't get stuck, each statement must be at least one of
    Irrefutable - All nodes that votes will vote for the same thing
      E.g., one leader requesting vote, so all that get request vote same way
    Neutralizable - Can make statement irrelevant for consensus
      E.g., (f+1)st vote in ballot B would complete consensus
      But if delayed and leader moves to B' > B, vote at B becomes irrelevant

High-level: Paxos achieves liveness through a series of votes on a log entry
  Safety requires carefully choosing what to vote on
But for a log, can decompose the problem another way [viewstamped replication]
  Do one leader, single vote on a series of different log records
    One leader means no split votes, but might not get majority
    Leader might fail, or slow network might trigger timeout
  Then decide which records will actually be part of the log
    So neutralize stuck vote by just not including entry in log
    Vote for new leader and where it should resume extending log
      Leader 1:  op_1, op_2, op_3, op_4
      Leader 2:  [vote fails on leader 2, triggering another timeout]
      Leader 3:                \-  op_4', op_5

[Maybe talk about Paxos vs. VR confusion]

What is the goal of Raft?  Understandability--how to achieve?
  Better problem decomposition:  leader election, log replication, safety
  Reduce state space and non-determinism
    No holes in logs, don't re-number committed log entries
    Do use randomness to simplify leader election

Like VR "views", Raft uses *terms*, with designated leader in each term

What happens to client requests in normal-case (no leader failures)?
  Clients sends request operation to the leader
  Leader assigns the operation a *log index* (next open slot in log)
  Leader sends operation to all other servers with AppendEntries RPC (Fig. 2)
    [go over each argument in figure 2]
  Leader receives success reply from a majority (including itself)
  Leader increments commitIndex, applies operation, returns result to client
Is AppendEntries idempotent?  Yes, can just keep retransmitting
When to stop retransmitting log entry?  When every follower has replied
  Note this could be *after* replying to the client
  Need this to catch up followers that may have been temporarily partitioned
Look at Figure 6:
  Which log entries can can be applied to state? 1-7
  Which log entries can be garbage-collected? 1-2
In AppendEntries step 5, when is commitIndex set to lower than leader's?
  Leader may be catching up a follower who doesn't have the log entry yet

What happens when the leader fails?
  Followers stop getting heartbeats (empty AppendEntries RPCs)
  After *election timeout*, follower assumes no viable leader
  Follower enters *candidate* state, broadcasts RequestVote RPC
    [go over each argument in figure 2]
    Servers can only vote for one leader per term
    Also deny vote to candidate with less up-to-date log
    - If last term in log different, higher term is more up to date
    - Else if same term, higher log entry is more up to date
  Possible election outcomes:
    - Candidate receives votes from a majority of nodes (including itself)?
      Transitions to *leader* state
    - Receive AppendEntries for term >= candidate's term
      Recognize new leader, transition to *follower* state
    - Stuck vote (no candidate receives a majority)
      Increment term and restart candidate process after *election timeout*
  New leader sends AppendEntries RPC
  Why don't we always get stuck?  Randomize timeout (e.g., 150-300ms)
  How long should election timeout be?  (p. 313)
      broadcastTime << electionTimeout << MTBF
    Too short: false positives, gratuitous elections
    Too long: large fraction of time spent in elections, not log replication

How does Raft's "strong leader" differ from Paxos leaders?
  In Raft, log entries flow only from leader to followers
  Unlike Paxos, Raft guarantees only one leader at a time
    Specifically, leader not active until majority renounce old leader
  Leads to "log matching property"--what's this?  (Fig. 3)
    1. Particular <log-index,term> uniquely identifies operation, and
    2. Two logs containing same <log-index,term> identical up to index--why?
       AppendEntries specifies prevLogTerm; follower rejects if mismatch
Look at figure 7
  What happened to replica (b)?
    Fell behind (e.g., network partition or downtime)
  What happened to replica (e)?
    Heard some term-4 entries that never made it to majority--how?
      Could have been leader of term 4, no one got AppendEntries RPC
      Could have been only follower to hear entries
        Other three followers elected new leader that hadn't heard indexes 6-7
  What happened to replica (f)?
    Term 2 leader got majority of replies to RequestVote but not AppendEntries
    Leader failed, then same thing happened to term 3 leader
  How do logs get back in sync from a state like this?
    If AppendEntries fails, decrement logIndex[follower], try again
    Eventually, prevLogTerm will match and truncate bad log entries at follower
    Paper mentions optimization--any guesses how to optimize?
      In rejection, specify follower's prev log term, and index where it starts
      E.g., in figure 7, f would say, "I've been at 3 since index 7"
      Now failed RPCs is # out-of-sync terms, not # out-of-sync log entries

When can a new leader externalize a log entry?
  Must be committed by majority in *current* term--Why?  See Fig 8c
    If S1 is leader in term 3, can replicate <index 2, term 2> to majority
    But if S1 then fails, S5 can get elected leader, overwrite w. <2,3>
  Log entry w. higher term prevents prev. entries from being overwritten
    In Fig 8(e), server S5 can no longer win an election (fails up-to-date)

What is leader completeness property?
  Log entry is committed in term T will be in leader log for all terms >= T
  Why does this hold?  (Sec 5.4.3 / Fig. 9)
    Say leader_T commits entry E but E is missing from leader_U's log
      Let leader_U be first such leader missing entry
    By majority, some node (S3) acked AppendEntry E but voted for leader_U
      Two ways S3 could have voted for leader_U:
      - Shared same last log term and leader_U's log at least as long as S3's
        But then by log matching property, leader_U must have had E in log
      - Leader_U's last log term greater than S3's and hence greater than T
        Let E' be last entry of leader_U's log, and let U' be its term
        By assumption, leader_{U'}'s log contained E
	By log matching property, all logs containing E' also contain E
	Therefore, leader_U's log must contain E

Briefly justify all properties in Figure 3
    - Election safety: at most one leader per term
    - Leader append only: leader never deletes or overwrites its log
    - Log matching: [see above]
    - Leader completeness: [see above]
    - State machine safety: servers never apply divergent logs

What must be forced to disk?  all RPCs before replying (p. 314)
  What can go wrong if you don't force AppendEntries to disk?
    Leader replies to client after hearing from 3/5 servers (including itself)
    Leader fails and followers reboot, losing state
    New leader won't know about operation that client externalized
  What can go wrong if you don't force RequestVote RPCs?
    Can vote for multiple candidates in same term after crash/reboot
    Might produce two leaders in the same term, violating many assumptions

How to change cluster membership?
  Joint consensus--need majority from both old and new configuration
    Add new configuration in special operation record, remove old in later one
    How does new configuration stay up-to-date?
      Before even joint consensus, new servers join as non-voting members
  Why doesn't Paxos alpha work? (p. 318)
      - Append at most alpha uncommitted log entries
      - Entry at index I can agree to consensus change at index I+alpha
    Say L is leader at I + alpha-1, and L is not in the new configuration
    L will stop and trigger leader election at log index I + alpha
    But now leader completeness of new leader may not be guaranteed

What evaluation questions should we ask?
  Is it safe?
    TLA+ machine-checked proof of log completeness, with hand-proved lemmas
    Informal (English-language) proof of state machine safety (no forks)
    Peer review
  Is performance good?
    Analytically--same number of normal-case messages/disk writes as Paxos, VR
    Experiments (Fig. 14) show fail-over time seems reasonable
    In production use, so we know it's good enough for many applications
  What about understandability?
    Ask students--how convincing is this?  Maybe better than nothing?
    Who found paper easy to understand?  [vote]

How adaptable is Raft to different network conditions?
  Fig. 14 taken with broadcast time of 15ms
  What if servers spread across continents?  easily 10x the latency
  Most protocols have lesser synchrony requirements, use adaptive timing
    Keep increasing the timeout if you can't elect a leader
    So protocol adapts to any maximum latency
  Would this work with Raft?  probably
    Elections could be similar to Ethernet randomized exponential backoff
    Maybe during term, leader can broadcast adaptive timeout based on RTTs?

Does Paxos have any advantages over VR/Raft?  Maybe
  Leader change shares more code paths with normal operation
    Conjecture:  may be easier to test, leave fewer weird corner cases
  Greater symmetry makes it adaptable to less centralized situations
    E.g., I wanted SCP to be like VR, but it ended up like Paxos

How would you adapt Zookeeper and Raft to run together (replacing Zab)?
  To compute future state (when translating requests to transactions)
    Would need to expose non-committed log entries (change API slightly)
    Must ensure subsequent operations invalidated if previous ones change
      Leader append-only + Leader completeness quite useful here
      C.f. Paxos, requires application-level check previous operations happened
  What about the combined logging optimization?
    Let application delay persistently incrementing lastApplied