Sign up for AWS credit by tomorrow night Raft ==== What is the goal of Raft? Understandability--how to achieve? Better problem decomposition: leader election, log replication, safety Reduce state space and non-determinism No holes in logs, don't re-number committed log entries Do use randomness to simplify leader election Like VR "views", Raft uses *terms*, with designated leader in each term What happens to client requests in normal-case (no leader failures)? Clients sends request operation to the leader Leader assigns the operation a *log index* (next open slot in log) Leader sends operation to all other servers with AppendEntries RPC (Fig. 2) [go over each argument in figure 2] Leader receives success reply from a majority (including itself) Leader increments commitIndex, applies operation, returns result to client Is AppendEntries idempotent? Yes, can just keep retransmitting When to stop retransmitting log entry? When every follower has replied Note this could be *after* replying to the client Need this to catch up followers that may have been temporarily partitioned Look at Figure 6: Which log entries can can be applied to state? 1-7 Which log entries can be garbage-collected? 1-2 In AppendEntries step 5, when is commitIndex set to lower than leader's? Leader may be catching up a follower who doesn't have the log entry yet What happens when the leader fails? Followers stop getting heartbeats (empty AppendEntries RPCs) After *election timeout*, follower assumes no viable leader Follower enters *candidate* state, broadcasts RequestVote RPC [go over each argument in figure 2] Servers can only vote for one leader per term Also deny vote to candidate with less up-to-date log - If last term in log different, higher term is more up to date - Else if same term, higher log entry is more up to date Possible election outcomes: - Candidate receives votes from a majority of nodes (including itself)? Transitions to *leader* state - Receive AppendEntries for term >= candidate's term Recognize new leader, transition to *follower* state - Stuck vote (no candidate receives a majority) Increment term and restart candidate process after *election timeout* New leader sends AppendEntries RPC Why don't we always get stuck? Randomize timeout (e.g., 150-300ms) How long should election timeout be? (p. 313) broadcastTime << electionTimeout << MTBF Too short: false positives, gratuitous elections Too long: large fraction of time spent in elections, not log replication How does Raft's "strong leader" differ from Paxos leaders? In Raft, log entries flow only from leader to followers Unlike Paxos, Raft guarantees only one leader at a time Specifically, leader not active until majority renounce old leader Leads to "log matching property"--what's this? (Fig. 3) 1. Particular uniquely identifies operation, and 2. Two logs containing same identical up to index--why? AppendEntries specifies prevLogTerm; follower rejects if mismatch Look at figure 7 What happened to replica (b)? Fell behind (e.g., network partition or downtime) What happened to replica (e)? Heard some term-4 entries that never made it to majority--how? Could have been leader of term 4, no one got AppendEntries RPC Could have been only follower to hear entries Other three followers elected new leader that hadn't heard indexes 6-7 What happened to replica (f)? Term 2 leader got majority of replies to RequestVote but not AppendEntries Leader failed, then same thing happened to term 3 leader How do logs get back in sync from a state like this? If AppendEntries fails, decrement logIndex[follower], try again Eventually, prevLogTerm will match and truncate bad log entries at follower Paper mentions optimization--any guesses how to optimize? In rejection, specify follower's prev log term, and index where it starts E.g., in figure 7, f would say, "I've been at 3 since index 7" Now failed RPCs is # out-of-sync terms, not # out-of-sync log entries When can a new leader externalize a log entry? Must be committed by majority in *current* term--Why? See Fig 8c If S5 is leader in term 5, can replicate to majority But if S5 then fails, S1 can get elected leader, overwrite w. <2,2> Log entry w. higher term prevents prev. entries from being overwritten In Fig 8(e), server S5 can no longer win an election (fails up-to-date) What is leader completeness property? Log entry is committed in term T will be in leader log for all terms >= T Why does this hold? (Sec 5.4.3 / Fig. 9) Say leader_T commits entry E but E is missing from leader_U's log Let leader_U be first such leader missing entry By majority, some node (S3) acked AppendEntry E but voted for leader_U Two ways S3 could have voted for leader_U: - Shared same last log term and leader_U's log at least as long as S3's But then by log matching property, leader_U must have had E in log - Leader_U's last log term greater than S3's and hence greater than T But then leader of term of leader_U's last log entry must have had E So again by log matching property leader_U's log must contain E Briefly justify all properties in Figure 3 - Election safety: at most one leader per term - Leader append only: leader never deletes or overwrites its log - Log matching: [see above] - Leader completeness: [see above] - State machine safety: servers never apply divergent logs What must be forced to disk? all RPCs before replying (p. 314) What can go wrong if you don't force AppendEntries to disk? Leader replies to client after hearing from 3/5 servers (including itself) Leader fails and followers reboot, losing state New leader won't know about operation that client externalized What can go wrong if you don't force RequestVote RPCs? Can vote for multiple candidates in same term after crash/reboot Might produce two leaders in the same term, violating many assumptions How to change cluster membership? Joint consensus--need majority from both old and new configuration Add new configuration in special operation record, remove old in later one How does new configuration stay up-to-date? Before even joint consensus, new servers join as non-voting members Why doesn't Paxos alpha work? Paxos assumes can reach consensus w/o leader (p. 318) What evaluation questions should we ask? Is it safe? TLA+ machine-checked proof of log completeness, with hand-proved lemmas Informal (English-language) proof of state machine safety (no forks) Peer review Is performance good? Analytically--same number of normal-case messages/disk writes as Paxos, VR Experiments (Fig. 14) show fail-over time seems reasonable In production use, so we know it's good enough for many applications What about understandability? Ask students--how convincing is this? Maybe better than nothing? How about this class? [vote] How adaptable is Raft to different network conditions? Fig. 14 taken with broadcast time of 15ms What if servers spread across continents? easily 10x the latency Most protocols have lesser synchrony requirements, use adaptive timing Keep increasing the timeout if you can't elect a leader So protocol adapts to any maximum latency Would this work with Raft? probably Elections could be similar to Ethernet randomized exponential backoff Maybe during term, leader can broadcast adaptive timeout based on RTTs? Does Paxos have any advantages over VR/Raft? Maybe Leader change shares more code paths with normal operation Conjecture: may be easier to test, leave fewer weird corner cases Greater symmetry makes it adaptable to less centralized situations Topic of research I'll tell you about at end of quarter How would you adapt Zookeeper and Raft to run together (replacing Zab)? To compute future state (when translating requests to transactions) Would need to expose non-committed log entries (change API slightly) Must ensure subsequent operations invalidated if previous ones change Leader append-only + Leader completeness quite useful here C.f. Paxos, requires application-level check previous operations happened What about the combined logging optimization? Let application delay persistently incrementing lastApplied