Raft (2014) =========== Zookeeper achieves fault-tolerance through state-machine replication All servers begin with the same initial state Reach consensus on an ordered list of deterministic operations (the log) All servers apply same operations in same order to maintain identical state Today: how to achieve consensus. Let's abstract the consensus problem: Bunch of nodes {V_i} each has an input {I_i} E.g., I_i is V_i's idea of what should go at some log index Through consensus protocol node V_i produces output O_i In asynchronous model, make no assumptions about timing No bound on message delay, relative execution speeds of nodes So can't differentiate a failed node from a slow one But with at most f failures, can wait for f+1 messages (including own) Want two safety properties: - Agreement: All non-faulty nodes that produce output choose the same O - Validity: O is equal to some node V_i's input I_i Want liveness: eventually all non-faulty nodes produce an output Want fault-tolerance: works if node fails at any point in protocol In early 1980s, people kept proposing broken consensus protocols. Then FLP: No deterministic, async protocol can achieve safety, liveness, fault tolerance Proof overview: For some inputs, output inevitably depends on behavior of the network Must be albe to complete without hearing all inputs in case node failed In one case might choose I_i, in another V_i delayed so complete without To terminate, some "deciding message" D must seal the fate of the system Network can't affect output after D, even if all nodes don't know yet Fault-tolerance means system must be able to complete without D So if D delayed, other message must "neutralize" it so no longer deciding Pathological network can delay all deciding messages until neutralized Don't confuse FLP with "CAP theorem," a much more obvious fact: Can't have all 3 of *C*onsistency, *A*vailability, *P*artition-tolerance Because if nodes can't communicate, they can't stay in sync FLP says if your protocol fault-tolerant, can't guarantee termination *even if no nodes actually fail* Let's look at some straw-man consensus protocols. Assume N=2f+1 Simple vote: Each V_i broadcasts a vote for its input I_i f+1 votes for same value? choose it as output Agreement? yes. Validity? yes. Fault-tolerance? yes. Termination? No - gets stuck if no majority value Leader vote: Each V_i broadcasts I_i, considers lowest V_j it hears from to be leader Collect at least f+1 inputs, then vote for leader's I_j If f+1 votes for same value I, choose it as O Agreement? yes. Validity? yes. Fault-tolerance? yes. Termination? No - still gets stuck if different leaders, no majority value Ballot sequence: Instead of one vote, allow a bunch of votes, each with a number B Let B=, where low bits (i) designate V_i as B's leader Because B includes leader ID, only one proposed value per ballot Leader V_i says, "In ballot B=, please vote for my input I_i" Collect f+1 votes for O in ballot B? Output O and announce to other nodes Else? timeout, move to higher ballot once B increased, ignore all messages with lower-numbered ballots Termination? yes. Fault-tolerance? yes. Validity? yes. Agreement? No because different ballots could have different outputs Paxos ballot sequence: Idea: like above, but ensure all majorities choose the same value V_i first asks, "I'm doing ballot B. What's your most recent vote?" Each V_j promises to ignore messages with ballots < B Then responds with most recent vote and its ballot number (if any) If f+1 nodes have never voted, V_i asks everyone to vote for I_i Otherwise, ask everyone to vote for value of most recent vote learned So if any previous ballot got majority, guaranteed to re-use same value Validity? yes. Agreement? yes. Fault-tolerance? yes. Termination? no [required by FLP], but doesn't get permanently stuck And if ever network faster than timeout [synchrony], will terminate Given FLP, this hits the sweet spot: Always safe and fault-tolerant Never permanently stuck Guaranteed to terminate whenever network faster than timeouts Can even dynamically lengthen timeouts if unknown latency bound How I personally analyze consensus protocols: Decompose them into a explicit or implicit votes on statements To ensure you don't get stuck, each statement must be at least one of Irrefutable - All nodes that votes will vote for the same thing E.g., one leader requesting vote, so all that get request vote same way Neutralizable - Can make statement irrelevant for consensus E.g., (f+1)st vote in ballot B would complete consensus But if delayed and leader moves to B' > B, vote at B becomes irrelevant High-level: Paxos achieves liveness through a series of votes on a log entry Safety requires carefully choosing what to vote on But for a log, can decompose the problem another way [viewstamped replication] Do one leader, single vote on a series of different log records One leader means no split votes, but might not get majority Leader might fail, or slow network might trigger timeout Then decide which records will actually be part of the log So neutralize stuck vote by just not including entry in log Vote for new leader and where it should resume extending log Leader 1: op_1, op_2, op_3, op_4 Leader 2: [vote fails on leader 2, triggering another timeout] Leader 3: \- op_4', op_5 [Maybe talk about Paxos vs. VR confusion] What is the goal of Raft? Understandability--how to achieve? Better problem decomposition: leader election, log replication, safety Reduce state space and non-determinism No holes in logs, don't re-number committed log entries Do use randomness to simplify leader election Like VR "views", Raft uses *terms*, with designated leader in each term What happens to client requests in normal-case (no leader failures)? Clients sends request operation to the leader Leader assigns the operation a *log index* (next open slot in log) Leader sends operation to all other servers with AppendEntries RPC (Fig. 2) [go over each argument in figure 2] Leader receives success reply from a majority (including itself) Leader increments commitIndex, applies operation, returns result to client Is AppendEntries idempotent? Yes, can just keep retransmitting When to stop retransmitting log entry? When every follower has replied Note this could be *after* replying to the client Need this to catch up followers that may have been temporarily partitioned Look at Figure 6: Which log entries can can be applied to state? 1-7 Which log entries can be garbage-collected? 1-2 In AppendEntries step 5, when is commitIndex set to lower than leader's? Leader may be catching up a follower who doesn't have the log entry yet What happens when the leader fails? Followers stop getting heartbeats (empty AppendEntries RPCs) After *election timeout*, follower assumes no viable leader Follower enters *candidate* state, broadcasts RequestVote RPC [go over each argument in figure 2] Servers can only vote for one leader per term Also deny vote to candidate with less up-to-date log - If last term in log different, higher term is more up to date - Else if same term, higher log entry is more up to date Possible election outcomes: - Candidate receives votes from a majority of nodes (including itself)? Transitions to *leader* state - Receive AppendEntries for term >= candidate's term Recognize new leader, transition to *follower* state - Stuck vote (no candidate receives a majority) Increment term and restart candidate process after *election timeout* New leader sends AppendEntries RPC Why don't we always get stuck? Randomize timeout (e.g., 150-300ms) How long should election timeout be? (p. 313) broadcastTime << electionTimeout << MTBF Too short: false positives, gratuitous elections Too long: large fraction of time spent in elections, not log replication How does Raft's "strong leader" differ from Paxos leaders? In Raft, log entries flow only from leader to followers Unlike Paxos, Raft guarantees only one leader at a time Specifically, leader not active until majority renounce old leader Leads to "log matching property"--what's this? (Fig. 3) 1. Particular uniquely identifies operation, and 2. Two logs containing same identical up to index--why? AppendEntries specifies prevLogTerm; follower rejects if mismatch Look at figure 7 What happened to replica (b)? Fell behind (e.g., network partition or downtime) What happened to replica (e)? Heard some term-4 entries that never made it to majority--how? Could have been leader of term 4, no one got AppendEntries RPC Could have been only follower to hear entries Other three followers elected new leader that hadn't heard indexes 6-7 What happened to replica (f)? Term 2 leader got majority of replies to RequestVote but not AppendEntries Leader failed, then same thing happened to term 3 leader How do logs get back in sync from a state like this? If AppendEntries fails, decrement logIndex[follower], try again Eventually, prevLogTerm will match and truncate bad log entries at follower Paper mentions optimization--any guesses how to optimize? In rejection, specify follower's prev log term, and index where it starts E.g., in figure 7, f would say, "I've been at 3 since index 7" Now failed RPCs is # out-of-sync terms, not # out-of-sync log entries When can a new leader externalize a log entry? Must be committed by majority in *current* term--Why? See Fig 8c If S1 is leader in term 3, can replicate to majority But if S1 then fails, S5 can get elected leader, overwrite w. <2,3> Log entry w. higher term prevents prev. entries from being overwritten In Fig 8(e), server S5 can no longer win an election (fails up-to-date) What is leader completeness property? Log entry is committed in term T will be in leader log for all terms >= T Why does this hold? (Sec 5.4.3 / Fig. 9) Say leader_T commits entry E but E is missing from leader_U's log Let leader_U be first such leader missing entry By majority, some node (S3) acked AppendEntry E but voted for leader_U Two ways S3 could have voted for leader_U: - Shared same last log term and leader_U's log at least as long as S3's But then by log matching property, leader_U must have had E in log - Leader_U's last log term greater than S3's and hence greater than T Let E' be last entry of leader_U's log, and let U' be its term By assumption, leader_{U'}'s log contained E By log matching property, all logs containing E' also contain E Therefore, leader_U's log must contain E Briefly justify all properties in Figure 3 - Election safety: at most one leader per term - Leader append only: leader never deletes or overwrites its log - Log matching: [see above] - Leader completeness: [see above] - State machine safety: servers never apply divergent logs What must be forced to disk? all RPCs before replying (p. 314) What can go wrong if you don't force AppendEntries to disk? Leader replies to client after hearing from 3/5 servers (including itself) Leader fails and followers reboot, losing state New leader won't know about operation that client externalized What can go wrong if you don't force RequestVote RPCs? Can vote for multiple candidates in same term after crash/reboot Might produce two leaders in the same term, violating many assumptions How to change cluster membership? Joint consensus--need majority from both old and new configuration Add new configuration in special operation record, remove old in later one How does new configuration stay up-to-date? Before even joint consensus, new servers join as non-voting members Why doesn't Paxos alpha work? (p. 318) - Append at most alpha uncommitted log entries - Entry at index I can agree to consensus change at index I+alpha Say L is leader at I + alpha-1, and L is not in the new configuration L will stop and trigger leader election at log index I + alpha But now leader completeness of new leader may not be guaranteed What evaluation questions should we ask? Is it safe? TLA+ machine-checked proof of log completeness, with hand-proved lemmas Informal (English-language) proof of state machine safety (no forks) Peer review Is performance good? Analytically--same number of normal-case messages/disk writes as Paxos, VR Experiments (Fig. 14) show fail-over time seems reasonable In production use, so we know it's good enough for many applications What about understandability? Ask students--how convincing is this? Maybe better than nothing? Who found paper easy to understand? [vote] How adaptable is Raft to different network conditions? Fig. 14 taken with broadcast time of 15ms What if servers spread across continents? easily 10x the latency Most protocols have lesser synchrony requirements, use adaptive timing Keep increasing the timeout if you can't elect a leader So protocol adapts to any maximum latency Would this work with Raft? probably Elections could be similar to Ethernet randomized exponential backoff Maybe during term, leader can broadcast adaptive timeout based on RTTs? Does Paxos have any advantages over VR/Raft? Maybe Leader change shares more code paths with normal operation Conjecture: may be easier to test, leave fewer weird corner cases Greater symmetry makes it adaptable to less centralized situations E.g., I wanted SCP to be like VR, but it ended up like Paxos How would you adapt Zookeeper and Raft to run together (replacing Zab)? To compute future state (when translating requests to transactions) Would need to expose non-committed log entries (change API slightly) Must ensure subsequent operations invalidated if previous ones change Leader append-only + Leader completeness quite useful here C.f. Paxos, requires application-level check previous operations happened What about the combined logging optimization? Let application delay persistently incrementing lastApplied