The short-term plan: Last Thursday we talked about need for consensus in replication Went over why 2PC, 3PC alone don't work, and also covered basic Paxos Today's agenda: Use Paxos to build a replicated system for availability Extend Paxos to deal with untrustworthy replicas Today's discussion will likely continue into next Tuesday's lecture Note: Later in term, will discuss replication for performance Recall Paxos algorithm. To propose a value: 1. Broadcast PREPARE(n) [where low-order bits of n are your machine ID] - Nodes ignore message if they have seen a higher value of n - Nodes reply with PREPARE-RESULT(n', v') or PREPARE-RESULT(0, nil) n' is highest numbered proposal seen, v' is value proposed PREPARE-RESULT is promise not to accept value w. lower numbered n 2. If majority accepts PREPARE, Broadcast PROPOSE(n, v) - v is v' from PREPARE-RESULT(n', v') with highest v' - if n' is nil, then can propose any value 3. If majority accepts PROPOSE, then done - broadcast DECIDE(n, v) to announce decision Big picture goal: Structure your system as a state machine Re-implement your system as distributed system w. replicated cohorts Cohorts use Paxos to agree on order of all operations But Paxos is not guaranteed to terminate How can a replicated system hope to the same system? Section 4.1 of last week's Lampson paper, defines "Y *implements* X" as: 1. X's safety property implies Y's safety property, and 2. Y's liveness property implies X's liveness property Let's examine what this means in more detail... [This portion of notes from Alpern & Schneider] One way to view a system is as sequence of externally-visible states An *execution* is an infinite sequence of states \sigma = s_0, s_1, s_2, ... where a single atomic action moves from s_i to s_{i+1} if a program terminates, last state just repeats infinitely A *property* is a set of executions. Examples: 1. Value of variable clock never decreases 2. Once variable done is true, variable result doesn't change A *program* also defines a set of executions A property *holds* for a program if its executions are a subset of property E.g., both example properties hold for program "for (;;);" Property 2 doesn't hold for program "done = true; result++;" What is a safety property? Informally, means some "bad thing" doesn't happen during execution What are examples of safety properties? - mutual exclusion - freedom from deadlock - partial correctness - what's this? Means if program produces output, output is correct function of input But program might not terminate on some inputs Given undecidability, this is often the best one can hope to verify - first-come-first-served Formally, P is a safety property iff: If execution s_0, s_1, ... is not in P, then there is some i >= 0 such that P holds for no execution beginning s_0, ..., s_i In other words, something bad happened by state s_i What is a liveness property? Informally, means some "good thing" happens during execution What are examples of liveness properties? - starvation freedom (make progress infinitely often) - termination - responsiveness (every request eventually satisfied) - leads to (event of type e_1 must eventually be followed by one of e_2) E.g., received RPC message followed by sent reply Formally, P is a liveness property iff: For any partial execution s_0, ..., s_i, there is some extension s_0, ..., s_i, s_{i+1}, s_{i+2}, ... for which P holds I.e., any partial execution can be "fixed" by making good thing happen later Note: Any property is the intersection of a safety and liveness property Argument for property P: - Note intersection of any two safety properties is a safety property You are just unioning all the "bad" prefixes - Let E be set of all possible executions (which is a safety property) - Let P' be smallest safety property containing P - Let L = E - (P' - P) [i.e., anything in P or not in P'] - Notice P = L \cap P' - Now suppose L is not a liveness property: Must be some prefix X = s_1, ..., s_i such that X|anything not in L Let O be set of executions starting X E - O is a safety property Also, since O and L disjoint, O \subseteq E-L = (P' - P) But O \subseteq (P' - P) implies P \subseteq (P' \cap (E-O)) (P' \cap (E-O)) is a safety property, contradicting P' smallest Example: "total correctness" safety property = partial correctness liveness property = termination Example: "until" means eventually E_2 will happen, and all preceding events are E_1 safety property = No trace with something other than E_1 before first E_2 liveness property = E_2 evantually happens Now look at Lampson's definition of "Y *implements* X" iff: 1. X's safety property implies Y's safety property, and 2. Y's liveness property implies X's liveness property What does it mean for one property to imply another? Recall properties are just sets of executions So P_1 implies P_2 just means P_1 contains P_2 Can you explain definition of implements? Any program can produce a set of executions, so a program is a property That property can be written as intersection of Safety \cap Liveness X's safety property = complement of executions with "disallowed prefixes" X's liveness property = all executions in X or not in X's safety property 1. means Y doens't execute any sequence containing a "bad prefix" for X equivalently, any prefix of an execution of Y, could also be of X 2. means Y's executions contain the "good things" X always does Let's look at an example: program_A () { program_B () { x = input (); x = input (); while (random () & 1) if (random () & 1) output ("thinking"); for (;;) output (x*x) output ("thinking"); } output (x*x) } What are program A's safety and liveness properties? (Just like Until above) Safety: Only outputs "thinking" before outputting x^2 Liveness: Eventually outputs x^2 What about program B? Safety: Outputs either x^2 or series of "thinking" Liveness: Will generate some output Does program A implement program B? No, because sequence "thinking", x^2 doesn't satisfy B's safety property Does program B implement program A? Yes, because two things hold: 1. B's safety property is a subset of A's looking at B's output, can't know for sure that you aren't executing A 2. A's liveness property is s superset of B's Why requirement 2? Why not just say Y implements X if X's safety implies Y's? Then trivial program "for (;;);" would implement everything 2 Says Y must be capable of doing what X does, even if it not every time Now back to replication... Want replicated system to implement unreplicated What is safety of unreplicated state machine? Goal is *linearizability*. What's this? Means 1. State is always result of executing list of requests from clients, and 2. List preserves temporal order of non-concurrent operations Linearizability is a handy property for distributed systems components - Totally local property Can implement/check w/o using 2PC or other protocols across components - Can also implement mutexes that work across components If everyone acquires mutex, then accesses object, then frees mutex, then as long as mutex and object linearizable, you are okay What liveness property do we want for replicated state machine? If, after series of requests, unreplicated system can transition to state s then replicated system should also be capable of transitioning to s What we really want is: Replicated system always transitions to s What we can hope for, given FLP, is: Unless you are really *unlucky*, replicates system will transition to s Here "luck" captures something outside of the asynchronous model What unlucky mean in BFT paper? Message delay doesn't grow exponentially with time, E.g., - dead machine never responds (lucky) - slow network suddenly adds 2 minutes to each packet (lucky) - each packet takes twice as long as previous (unlucky) Paxos Made Practical ==================== So now let's replicate a state machine assuming fail-stop cohorts Fail-stop means failed cohort stops sending any network messages As opposed to Byzantine, where can behave arbitrarily First, establish consensus on a view, which includes: view-id, one primary cohort, set of backup cohorts Assuming we have a view, execute an operations: 0. Client sends operation (including client's timestamp) to primary 1. Primary assigns a *viewstamp* to operation: ts = number of operation within this view (1, 2, 3, ...) viewstamp = Logs operation, broadcasts it to backups 2. Backups log operations (in oder), and confirm to primary 3. Primary waits for *submajority* of backups to confirm submajority = min x such that (x+1) > n/2 Primary + submajority of backups = majority of cohorts Executes operation, sends result to client Important points: - All cohorts agree on order of operations by comparing viewstamps - Client doesn't see result until at least majority of cohorts log operation When a cohort (possibly primary) fails, form new view. Must ensure: 1. Don't "forget" any past operations for which you replied to client 2. Make sure two competing, divergent views cannot form Solve both problems using Paxos: Instance of Paxos is current view-id Paxos proposal number is proposed new view-id Value you are agreeing on is: Configuration of new view (with cohorts are primary, which backups) Last operation executed in previous active view How to solve problems 1 and 2 above: 1. Make sure new view has old primary or a majority of old cohorts 2. Guaranteed by Paxos if set of possible cohorts fixed over time Slight trickiness when adding new cohorts--see paper for details Practical Byzantine Fault Tolerance =================================== Now what happens if a cohort is bad? Can become primary, return arbitrary results to clients Straw-man, 2f+1 cohorts, require f+1 to reply to client: 0. Client sends signed request to primary (so bad primary can't forge) 1. Primary assigns viewstamp to request, and broadcasts to backups 2. Backups execute operation, sign { viewstamp, result } send result to primary and to client 3. Client accepts results after f+1 matching replies The good news about this protocol: If at most f cohorts fail, client receives >=1 reply from honest cohort What is the bad news? Fails to achieve safety property (linearizability) Suppose cohorts (replicas) r_1 .. r_f malicious, r_{f+1} .. r_{2f+1} honest Let r_1 be the malicious primary client u_1 issues operation o_1: r_1 (primary) sends it to r_2 .. r_{f+1} only u_1 gets f+1 replies, assumes operation is complete client u_2 issues operation o_2: Now r_1 sends it to r_2 .. r_f and r_{f+2} only r_{f+2} returns as if o_1 never executed (it doesn't know about o_1) r_1 .. r_f, because malicious, also return as if o_1 never executed u_2 sees bad result--not linearizable Why is this bad? E.g., o_1 was acquiring a mutex or I commit change call you on phone say get new state of object client u_2 issues operation o_3: Now r_1 (primary) sends operations (o_2, o_1, o_3) to r_{f+3} r_{f+3} replies as if o_2 executed before o_1 other bad r_1 .. r_f send same reply as r_{f+3} basically malicious cohorts are completely violating safety What is the minimum replicas that we will need to recover from f failures? 3f+1. Why? f cohorts may fail and stop responding to network messages So must make progress after 2f+1 replies But if get 2f+1 replies, might just have f slow cohorts Which means f of remaining 2f+1 could be malicious But w. 3f+1 a majority (f+1) of 2f+1 replies guaranteed from honest cohorts