Administrivia: - Make sure the correct microphone is selected in zoom - Sign up for group meeting using your sunet ID - Small Group Feedback Session Monday (bring feedback) Paxos ===== State machine replication review Boils down to a consensus problem for each slot in an operation log FLP review: In a deterministic, asynchronous consensus system Pick at most two of: safety, liveness, fault-tolerance How might you achieve safety and liveness? Two-phase commit--safe, and guaranteed to terminate in two phases So just run 2PC across all replicas for each state machine operation Can we repair a 2PC after one node fails? An ATM may have disbursed $20 bills, *externalizing* our transaction Externalized real-world actions irreversible, so 2PC outcome must match If non-coordinator node fails If not all cohorts sent COMMIT-VOTE, coordinator unilaterally ABORTs Otherwise, if coordinator already sent COMMIT, commit as usual What if coordinator fails? Did any coordinator receive COMMIT or ABORT from coordinator? Cohort can re-broadcast co-odrinator's dying message, follow it Else, did any cohort not (yet) send COMMIT-VOTE? (cohort either sent ABORT-VOTE or never received PREPARE) Safe to ABORT, because coordinator couldn't have sent COMMIT What if all cohorts sent COMMIT-VOTE, none heard from coordinator? Coordinator could have disbursed funds or not--we are stuck Solution: Three-phase commit 2PC's problem was "prepared" state where cohorts can go either way So now after PREPARE, if everyone votes COMMIT, broadcast PREPARE-COMMIT Once all participants ACK PREPARE-COMMIT, broadcast COMMIT After failure, COMMIT if and only if any survivor saw PREPARE-COMMIT Why is this better? 2PC: execute transaction when everyone is willing to COMMIT it 3PC: execute transaction when everyone *knows* everyone willing to COMMIT ATM only disburses funds once everyone knows transaction can commit Why doesn't 3PC violate FLP? Must *know* that a node has failed to initiate repair In asynchronous model, can't tell failed nodes from slow ones How about liveness and fault-tolerance? Simple gossip protocol - Every node broadcasts a proposal - After 5 sec. output lowest value seen Guaranteed to terminate in 5 seconds (if any non-faulty node) But network partition or slow nodes could prevent agreement What about safety and fault-tolerance? Topic this and next lecture... Almost nobody seems to understand how to implement consensus They know you use "Paxos" for consensus, but don't know what that is My theory: We've been teaching wrong Notably, original Paxos paper (The Part-Time Parliament) unreadable Just using the word "simple" a lot in rewrite may not help understanding Another theory: The protocols are wrong because they are hard to teach Monday, we will look at a protocol intended to correct this problem Straw man safe+fault-tolerant consensus: "gossip and vote" protocol: - Every node broadcasts a proposed value - After 5 sec., broadcast vote for lowest value seen - If some value receives a votes from a majority, output it - Otherwise, you are "stuck" and there is no liveness E.g., 2f+1 nodes total, f voted for 0, f voted for 1, and 1 node failed Or no failures but three-way split between values 0, 1, and 2 Neither Paxos nor "gossip and vote" guarantee liveness. Why is Paxos better? Gossip and vote gets into a stuck state with split votes/failed nodes Paxos never gets stuck; even if it hasn't terminated, there's always hope f+1 0 votes +--------+ +---------+ +----------->|0-valent+----------->|learned 0| | +----+---+ +---------+ | \ +---------+ \ +---------+ |Undecided+--------------------+------------>| stuck | +---------+ / +---------+ | / | +----+---+ +---------+ +----------->|1-valent+----------->|learned 1| f+1 1 votes +--------+ +---------+ Some important conclusions from discussion thus far: - Voting is a key technique providing both safety and fault-tolerance - But votes can split and produce permanently indeterminate outcomes - Or failed cohorts + lost votes mean never learn system 0/1-valent - So better NOT vote on questions you can't afford to leave indeterminate (e.g., what is the 5th operation in the state machine log) Key idea in many consensus protocols: Carefully craft statements on which nodes will cast votes Either avoid the possibility of a split vote Or ensure indeterminate outcome won't permanently block consensus on log Again, don't directly vote on log entries (e.g., "log index 5 is x") Two types of statement are okay to vote on: Irrefutable statements: no one ever votes against, so no split votes E.g., vote that it's past 3pm on Wednesday, April 15, or don't vote yet Neutralizable statements: indeterminate outcome won't prevent consensus Key Paxos mechanism: numbered ballots Vote for log entries, but associate each vote with a ballot number E.g., "I vote that log index i contains value v *in ballot #b*" Approximate idea: Suppose ballot b is indeterminate ...neutralize it by proceeding with a higher-numbered ballot The complication: An indeterminate ballot didn't necessarily fail A node might externalize a value and then fail The network could lose the last messages sent by such a node Yet once externalized, a value must be chosen (e.g., ATM disbursement) Example: 7 nodes (n1, ..., n7). - n2-n4 vote for value v in ballot #b - n5-n7 never heard of v, voted for v' in ballot #b' (b' > b) - n1 died at some point, but might have voted and externalized Now what? Both ballots b and b' are indeterminate. Did they fail? Or did n1 vote for and externalize v? or v'? No way to fix an indeterminate ballot! Solution: Conservatively assume indeterminate ballots could be externalized Note, if v' = v above, there is no problem So if ballot b is indeterminate, ensure all future ballots use same value The (single-decree) Paxos protocol: * Establish irrefutable mapping between ballot # and value Ballot = referendum on one value, not choice between candidate values Each ballot proposed by a single leader, so has single value Embed leader's unique machine-id inside ballot number * Before voting on value in ballot b, *prepare* b by checking previous ballots Leader broadcasts "PREPARE b" [Phase 1a in paper] Cohorts reply "PREPARED b { NULL | }" [Phase 1b] b_old is the last ballot in which cohort voted before b v_old is the value in ballot b_old Cohort promises conditions will still be true when acted on by leader ...implies promise never to vote for any ballot between b_old and b Ballot b prepared after getting PREPARED from a majority of nodes If all PREPARED messages have NULL, leader can use any value v Otherwise, highest b_old is either indeterminate (or successful) Leader must set v = v_old corresponding to highest b_old * Now can vote on value v Leader broadcasts "COMMIT b v" [a.k.a. "accept b" in paper Phase 2b] Cohorts reply "COMMITTED b" After collecting COMMITTED from majority of replicas, have consensus on v Another way to view Paxos: Translate concrete messages to conceptual ones The conceptual protocol: * commit b v: Vote for consensus on value v in ballot #b * abort b v: Vote that ballot #b did not reach consensus on value v Invariant: For each b, can only vote "commit b v" for a single value v Value is determined by leader of ballot b (whose id is embedded in b) Invariant: can only vote "commit b v" after a majority votes for { abort b' v' | b' <= b && v' != v } When such abort messages garner majority, we say (b, v) is prepared I.e., first vote you *didn't* reach consensus on any v' != v for b' <= b Can't even cast a commit vote before consensus on what didn't commit Result: all committed and indeterminate ballots have same value! The concrete to conceptual mapping * PREPARED b : { vote for "abort b' v'" | b_old < b' < b } [for all values v'] { assert "abort b' v'" got majority | b' <= b_old && v' != v_old } * COMMITTED b vote for "commit b v", { "abort b v'" | v' != v } v is value from COMMIT b v Simple way to use paxos with replicated state machine: One paxos instance per log entry (slot), specified in messages: PREPARE i b, PREPARED i b {NULL|...}, COMMIT i b, COMMITTED b Can we optimize multi-paxos for only one round trip in common case? Yes. Use same ballot number for all paxos instances * PREPARE i b: prepares ballot b for all slots i' >= i * PREPARED i b {NULL|} a: - is last COMMIT sent for ballot b - a is bool, if TRUE, means use NULL for "all future slots" (i' > i) This is the common case; means all slots >= i are now prepared If a is FALSE, then must issue PREPARE (i+1) b When can you apply an log index to state machine and reply to client? When leader receives COMMITTED i b from a majority. How do cohorts know? Can piggy back on new messages: * COMMIT i b v i_min: - i_min is says all slots i' < i_min committed with ballot # <= b So cohort that voted COMMIT i b v can mark (i, v) as committed Cohort that voted COMMIT i b' v' for b' < b can't commit v' must contact peer to learn value of v; but this is not common case Convenient extensions to avoid full protocol or other delays where unnecessary * PREPARE i 0 Special ballot 0 probes slot without trying to become leader * LEARN i v: Reply to request concerning already committed slot i A single LEARN message enough to externalize v, no need to hold vote Alternatively, LEARN i v =~ PREPARED i FALSE * STATUS i_min b: Reply to msg with stale ballot number for current ballot Says, "Sorry, I've promised to ignore all ballots <= b." Note these are only optimizations, not required for correctness Reconfiguration: How to add and remove servers? Use the log! Have distinguished log values that add/remove nodes Danger: Better not commit values unless you know what a majority is! Better not even prepare values unless you know what a majority is! Lamport's solution? Log index i changes system composition for slot i+alpha Places bound alpha on number of uncommitted requests Fill log with nops if you had fewer than alpha outstanding operations Your humble instructor's solution: Unlimited pipelining of normal requests But only commit values in order (probably want this anyway) Ensures you always know what majority is before deciding COMMIT majority Don't pipeline *any* COMMIT messages behind COMMIT i b v that reconfigures Ensures you always know ballot is prepared before sending COMMIT What kind of synchrony assumptions could guarantee liveness with Paxos? If you know range of message latencies, set timeout appropriately Don't propose more than one ballot per timeout Or don't propose if another cohort proposed higher ballot within timeout If you don't know max message latency, make more limited assumptions Example: Assume message latencies don't grow exponentially Can keep doubling timeout before starting new ballot/leader election