Paxos ===== State machine replication review Boils down to a consensus problem for each slot in an operation log FLP review: In a deterministic, asynchronous consensus system Pick at most two of: safety, liveness, fault-tolerance How might you achieve safety and liveness? Two-phase commit--safe, and guaranteed to terminate in two phases Or simpler: one-phase commit (take whatever lowest-numbered node says) How about liveness and fault-tolerance? E.g., Every node broadcasts its input value wait 5 seconds then chose lexicographically lowest value seen Guaranteed to terminate in 5 seconds (if any node non-failed) But network partition could prevent agreement What about safety and fault-tolerance? In theory, this is what Raft (and viewstamped-replication) give us But those give us more... how about something simpler? Straw-man non-live consensus protocol: Everybody broadcast value, wait 5 seconds Every node votes for lowest value seen (and can never change vote) If some value receives a majority, output it Otherwise, you are "stuck" and there is no liveness E.g., 2f+1 nodes total, f voted for 0, f voted for 1, and 1 node failed Or no failures but three-way split between values 0, 1, and 2 Neither Raft nor straw-man have liveness. Why is Raft better? Straw-man gets stuck with split votes Raft never gets stuck; even if it hasn't terminated, there's always hope Some important conclusions from discussion thus far: - Voting is a key technique providing both safety and fault-tolerance - But votes can split and produce permanently indeterminate outcomes - So better not vote on questions you can't afford to leave indeterminate (e.g., what is the 5th operation in the state machine log) Key idea in many of these protocols: Carefully craft statements on which nodes vote Either avoid possibility of split vote Or ensure indeterminate outcome won't permanently block log consensus Again, can't directly vote on log entries (e.g., "log index 5 is x") Two types of statement are safe to vote on: * Irrefutable statements: while relevant, no one ever votes against In Raft, at most one possible value of v in "(log index i, term t) = v" So while term t relevant No previously cast vote will contradict "(i, t) = v" Any node learning v is always free to vote for it in (i, t) * Neutralizable statements: indeterminate outcome won't prevent consensus In Raft, nodes can fail to elect a leader for a particular term No big deal, just "neutralize" that term by moving to term n+1 Viewstamped replication (and Raft) from 50,000 feet: Have a single leader sequentially number all operations ... Have nodes vote on the operation corresponding to each viewstamp Only one leader per view, so this mapping is irrefutable Hence liveness guaranteed until you lose (or think you lost) leader Lost leader? Must vote on (new viewid, new leader, last included viewstamp) (By irrefutability, stale nodes can always increase last include viewstamp) Leader vote could be indeterminate, but it's neutralizable! A failed view doesn't preclude successful views with higher viewids Now let's try a different approach: numbered ballots Vote for log entries, but associate each vote with a ballot number E.g., "I vote that log index i contains value v *in ballot #b*" Approximate idea: Suppose ballot b is indeterminate ...neutralize it by proceeding with a higher-numbered ballot The complication: An indeterminate ballot didn't necessarily fail Example: 7 nodes (n1, ..., n7). - n2-n4 vote for value v in ballot #b - n5-n7 never heard of v, voted for v' in ballot #b' (b' > b) - n1 died at some point, but might have voted Now what? Both ballots b and b' are indeterminate. Did they fail? Or did n1 vote for and externalize v? Or did it externalize v'? Can't completely neutralize an indeterminate ballot! Solution: Conservatively assume indeterminate ballots may have succeeded Note, if v' = v above, there is no problem So if ballot b indeterminate, ensure all future ballots use same value The (single-decree) Paxos protocol: * Establish irrefutable mapping between ballot # and value Ballot = referendum on one value, not choice between candidate values Each ballot proposed by a single leader, has single value Embed leader's unique machine-id inside ballot number * Before voting on value in ballot b, *prepare* b by checking previous ballots Leader broadcasts "PREPARE b" Cohorts reply "PREPARED b { NULL | }" b_old is the last ballot in which cohort voted before b v_old is the value in ballot b_old Cohort promises conditions will still be true when acted on by leader ...implies promise never to vote for any ballot between b_old and b Ballot b prepared after getting PREPARED from a majority of nodes If all PREPARED messages have NULL, leader can use any value v Otherwise, highest b_old is indeterminate (or successful) Leader must set v = v_old corresponding to highest b_old * Now can vote on value v Leader broadcasts "COMMIT b v" [a.k.a. "accept b" in paper] Cohorts reply "COMMITTED b" After collecting COMMITTED from majority of replicas, consensus on v Why is Paxos hard to understand? Conjecture: it conflates irrefutable and neutralizable messages Another way to view Paxos: Translate concrete messages to conceptual ones The conceptual protocol: * commit b v: Vote for consensus on value v in ballot #b * abort b v: Vote that ballot #b not reach consensus on value v Cannot vote for "commit b v" unless majority previously voted for { abort b' v' | b' <= b && v' != v } When such abort messages garner majority, we say (b, v) is prepared The concrete to conceptual mapping * PREPARED b : { vote for "abort b' v'" | b_old < b' < b } { confirm "abort b' v'" got majority | b' <= b_old && v' != v_old } * COMMITTED b { vote for "commit b v" } v is (irrefutable) value from COMMIT b v So Paxos uses same tools (irrefutable/neutralizable statements) ... but in much more complicated way Simple way to use paxos with replicated state machine: One paxos instance per log entry (slot), specified in messages: PREPARE i b, PREPARED i b {NULL|...}, COMMIT i b, COMMITTED i b Can we optimize multi-paxos for only one round trip in common case? Yes. Use same ballot number for all paxos instances * PREPARE i b: prepares ballot b for all slots i' >= i * PREPARED i b {NULL|} a: - is last COMMIT sent for ballot b - a is bool, if TRUE, means use NULL for "all future slots" (i' > i) This is the common case; means all slots >= i are now prepared If a is FALSE, then must issue PREPARE (i+1) b When can you apply an log index to state machine and reply to client? When leader receives COMMITTED i b from a majority. How do cohorts know? Can piggy back on new messages: * COMMIT i b v i_min: - i_min is says all slots i' < i_min committed with ballot # <= b So cohort that voted COMMIT i b v can mark (i, v) as committed Cohort that voted COMMIT i b' v' for b' < b can't commit v' must contact peer to learn value of v; but this is not common case Convenient extensions to avoid full protocol or other delays where unnecessary * PREPARE i 0 Special ballot 0 probes slot without trying to become leader * LEARN i v: Reply to request concerning already committed slot i A single LEARN message enough to externalize v, no need to hold vote Alternatively, LEARN i v =~ PREPARE i FALSE * STATUS i_min b: Reply request with current slot but stale ballot number Says, sorry, I've promised to ignore all ballots <= b Note these are only optimizations, not required for correctness Reconfiguration: How to add and remove servers? Use the log! Have distinguished log values that add/remove nodes Danger: Better not commit values unless you know what a majority is! Better not even prepare values unless you know what a majority is! Lamport's solution? Log index i changes system composition for slot i+alpha Places bound alpha on number of uncommitted requests Fill log with nops if you had fewer than alpha outstanding operations Your humble instructor's solution: Unlimited pipelining of normal requests But only commit values in order (probably want this anyway) Ensures you always know what majority is before deciding COMMIT majority Don't pipeline *any* COMMIT messages behind COMMIT i b v that reconfigures Ensures you always know ballot is prepared before sending COMMIT What kind of synchrony assumptions could guarantee liveness with Paxos/Raft? If you know range of message latencies, set timeout appropriately If you don't, make more limited assumptions Example: Assume message latencies don't grow exponentially Can keep doubling timeout before starting new ballot/leader election Dirty little secret: Most people think Paxos and VR are same protocol (Your instructor wrote paper about VR called Paxos Made Practical...oops) See https://www.cs.cornell.edu/fbs/publications/viveLaDifference.pdf Does Paxos have any advantages over VR/Raft? Maybe Leader change shares more code paths with normal operation Conjecture: may be easier to test, leave fewer weird corner cases Greater symmetry makes it adaptable to less centralized situations Topic of ongoing research I'll tell you about in last lecture