Raft
====

What is the goal of Raft?  Understandability--how to achieve?
  Better problem decomposition:  leader election, log replication, safety
  Reduce state space and non-determinism
    No holes in logs, don't re-number committed log entries
    Do use randomness to simplify leader election

Like VR "views", Raft uses *terms*, with designated leader in each term

What happens to client requests in normal-case (no leader failures)?
  Clients sends request operation to the leader
  Leader assigns the operation a *log index* (next open slot in log)
  Leader sends operation to all other servers with AppendEntries RPC (Fig. 1)
    [go over each argument in figure]
  Leader receives success reply from a majority (including itself)
  Leader increments commitIndex, applies operation, returns result to client
Is AppendEntries idempotent?  Yes, can just keep retransmitting
When to stop retransmitting log entry?  When every follower has replied
  Note this could be *after* replying to the client
  Need this to catch up followers that may have been temporarily partitioned
Look at Figure 6:
  Which log entries can can be applied to state? 1-7
  Which log entries can be garbage-collected? 1-2

What happens when the leader fails?
  Followers stop getting heartbeats (empty AppendEntries RPCs)
  After *election timeout*, follower assumes no viable leader
  Follower enters *candidate* state, broadcasts RequestVote RPC (Fig. 1)
    Servers can only vote for one leader per term
    Also deny vote to candidate with less up-to-date log
    - If last term in log different, higher term is more up to date
    - Else if same term, higher log entry is more up to date
  Possible election outcomes:
    - Candidate receives votes from a majority of nodes (including itself)?
      Transitions to *leader* state
    - Receive AppendEntries for term >= candidate's term
      Recognize new leader, transition to *follower* state
    - Stuck vote (no candidate receives a majority)
      Increment term and restart candidate process after *election timeout*
  New leader sends AppendEntries RPC
  Why don't we always get stuck?  Randomize timeout (e.g., 150-300ms)
  How long should election timeout be?  (p. 313)
      broadcastTime << electionTimeout << MTBF
    Too short: false positives, gratuitous elections
    Too long: large fraction of time spent in elections, not log replication

How does Raft's "strong leader" differ from Paxos leaders?
  In Raft, log entries flow only from leader to followers
  Unlike Paxos, Raft guarantees only one leader at a time
    Specifically, leader not active until majority renounce old leader
  Leads to "log matching property"--what's this?
    1. Particular <log-index,term> uniquely identifies operation, and
    2. Two logs containing same <log-index,term> identical up to index--why?
       AppendEntries specifies prevLogTerm; follower rejects if mismatch

Look at figure 7
  What happened to replica (b)?
    Fell behind (e.g., network partition or downtime)
  What happened to replica (e)?
    Heard some term-4 entries that never made it to majority
    Majority that hadn't heard of indexes 6 and 7 voted for new leader
  What happened to replica (f)?
    Leader for term 2 got majority RequestVote replies, but not AppendEntries
    Leader failed, then same thing happened to term 3 leader
  How do logs get back in sync from a state like this?
    If AppendEntries fails, decrement logIndex[follower], try again
    Eventually, prevLogTerm will match and truncate bad log entries at follower
    Paper mentions optimization--any guesses how to optimize?
      In rejection, specify follower's prev log term, and index where it starts
      E.g., in figure 7, f would say, "I've been at 3 since index 7"
      Now failed RPCs is # out-of-sync terms, not # out-of-sync log entries

When can a new leader externalize a log entry?
  Must be committed by majority in *current* term--Why?  See Fig 8
    If S5 is leader in term 5, can replicate <index 2, term 3> to majority
    But if S5 then fails, S1 can get elected leader, overwrite w. <2,2>
  Log entry w. higher term prevents prev. entries from being overwritten
    In Fig 8(e), server S5 can no longer win an election (fails up-to-date)

What is leader completeness property?
  Log entry is committed in term T will be in leader log for all terms >= T
  Why does this hold?  (Sec 5.4.3 / Fig. 9)
    Say leader_T commits entry E but E is missing from leader_U's log
      w.l.o.g., let leader_U be first such leader missing entry
    By majority, some node (S3) acked AppendEntry E but voted for leader_U
      Two ways S3 could have voted for leader_U:
      - Shared same last log term and leader_U's log at least as long as S3's
        But then by log matching property, leader_U must have had E in log
      - Leader_U's last log term greater than S3's and hence greater than T
        But then leader of term of leader_U's last log entry must have had E
	So again by log matching property leader_U's log must contain E

What must be forced to disk?  all RPCs before replying (p. 314)
  What can go wrong if you don't force AppendEntries to disk?
    Leader replies to client after hearing from 3/5 servers (including itself)
    Leader fails and followers reboot, losing state
    New leader won't know about operation that client externalized
  What can go wrong if you don't force RequestVote RPCs?
    Can vote for multiple candidates in same term after crash/reboot
    Might produce two leaders in the same term, violating many assumptions

How to change cluster membership?
  Joint consensus--need majority from both old and new configuration
    Add new configuration in special operation record, remove old in later one
    How does new configuration stay up-to-date?
      Before even joint consensus, new servers join as non-voting members
  Why doesn't Paxos alpha work?
    Paxos assumes can reach consensus w/o leader (p. 318)

What evaluation questions should we ask?
  Is it safe?
    TLA+ machine-checked proof of log completeness, with hand-proved lemmas
    Informal (English-language) proof of state machine safety (no forks)
    Peer review
  Is performance good?
    Analytically--same number of normal-case messages/disk writes as Paxos, VR
    Experiments (Fig. 14) show fail-over time seems reasonable
    In production use, so we know it's good enough for many applications
  What about understandability?
    Ask students--how convincing is this?  Maybe better than nothing?
    How about this class? [run zoom poll]

Does Paxos have any advantages over VR/Raft?  Maybe
  Leader change shares more code paths with normal operation
    Conjecture:  may be easier to test, leave fewer weird corner cases
  Greater symmetry makes it adaptable to less centralized situations
    Topic of research I'll tell you about in last lecture