Practical Byzantine Fault Tolerance =================================== What is the minimum replicas that we will need to recover from f failures? 3f+1. Why? f cohorts may fail and stop responding to network messages So must make progress after 2f+1 replies But if get 2f+1 replies, might just have f slow cohorts Which means f of remaining 2f+1 could be malicious But w. 3f+1 a majority (f+1) of 2f+1 replies guaranteed from honest cohorts So how does this protocol work? Number replica cohorts 1, 2, 3, ..., 3f+1 Number requests with consecutive sequence numbers (not viewstamps) System goes through a series of views In view v, replica number v mod (3f+1) is designated the primary Primary is responsible for selecting the order of operations Assigns an increasing sequence number to each operation In normal-case operation, use two-round protocol for request r: Round 1 (pre-prepare, prepare) goal: Ensure at least f+1 honest replicas agree that If request r executes in view v, will execute with sequence no. n Round 2 (commit) goal: Ensure at least f+1 honest replicas agree that Request r has executed in view v with sequence no. n Note analogy with three-phase commit: In first round, cohort agrees that operation can execute (in this case agrees not to accept other operation w. seqno n in view v) In second round, learn that other cohorts have agreed on operation In 3PC, don't reply to client until everyone knows that everyone has agreed to execute the operation In BFT, don't reply to client until f+1 honest cohorts know that f+1 honest cohorts have agreed to execute the operation And note f+1 honest cohorts means you've heard from 2f+1 cohorts (at most f of which you are assuming can be faulty) Protocol for normal-case operation Let c be client r_i be replica i, or p primary, b_i backup i R set of all replicas c -> p: m = {REQUEST, o, t, c}_Kc p -> R: {PRE-PREPARE, v, n, d}_Kp, m (note d = H(m)) b_i -> R: {PREPARE, v, n, d, i} replica r_i now waits for PRE-PREPARE + 2f matching PREPARE messages puts these messages in its log then we say prepared (m, v, n, i) is TRUE Note: If prepared (m, v, n, i) is TRUE for honest replica r_i then prepared (m', v, n, j) where m' != m FALSE for any honest r_j So no other operation can execute with view v sequence number n Are we done? Just reply to client? No--think of 3PC analogy. Just because some other m' won't execute at (v,n) doesn't mean m will Suppose r_i is compromised right after prepared (m, v, n, i) Suppose no other replica received r_i's prepare message Suppose f replicas are slow and never even received the PRE-PREPARE No other honest replica will know the request prepared! Particularly if p fails, request might not get executed! So we say operation doesn't execute until prepared (m, v, n, i) is TRUE for f+1 non-faulty replicas r_i We say committed (m, v, n) is TRUE when this property holds So how does a replica *know* committed (m, v, n) holds? Add one more message: r_i -> R: {COMMIT, v, n, d, i} replica r_i waits for 2f+1 identical COMMIT messages (including its own) committed-local (m, v, n, i) is TRUE when: prepared (m, v, n, i) is TRUE, and r_i has 2f+1 matching commits in its log Note: If committed-local (m, v, n, i) is TRUE for any non-faulty r_i Then means committed (m, v, n) is TRUE. r_i knows when committed-local is TRUE So committed-local is a replica's way of knowing that committed is TRUE r_i replies to client when committed-local (m, v, n, i) is TRUE Client waits for f+1 matching replies, then returns to client Why f+1 and not 2f+1? Because of f+1, at least one replica r_i is non-faulty So client knows committed-local (m, v, n, i) Which in turn implies committed (m, v, n) Note tentative reply optimization: r_i can send tentative reply to client after prepared (m, v, n, i) Client can accept result after 2f+1 matching tentative replies Garbage collecting the message log make periodic checkpoints Broadcast , where d = digest of state When 2f+1 signed CHECKPOINTs received restrict sequence numbers are between h and H h = sequence number of last stable checkpoint H = h + k (e.g., k might be 2 * checkpoint interval of 100) delete all messages below sequence number of stable checkpoint View changes When client doesn't get an answer, broadcasts message to all replicas If a backup notices primary is slow/unresponsive: - broadcast VIEW-CHANGE v+1, n, C, P, i C is 2f+1 signed checkpoint messages for last stable checkpoint P = {P_m} where each P_m is signed PRE-PREPARE + 2f PREPARES i.e., P is set of all PREPAREd messages since checkpoint + proof that the messages really are prepared When primary of view v+1 sees 2f signed VIEW-CHANGE messages from others - New primary broadcasts V is set of at lesat 2f+1 VIEW-CHANGE messages (including by new primary) O is a set of pre-prepare messages, for operations that are: - after last stable checkpoint - appear in the set P of one of the VIEW-CHANGE messages O also contains dummy messages to fill in sequence number gaps Replicas may optain any missing state from each other (e.g., stable checkpoint data, or missing operation, since reissued pre-prepare messages only contain digest of request) What happens if primary creates incorrect O in NEW-VIEW message? E.g., might send null requests for operations that prepared Other replicas can compute O from V, and can reject NEW-VIEW message What happens if primary sends different V's to different backups? Still okay, because any committed operation will be in 2f+1 VIEW-CHANGE msgs of which f+1 must be honest, so at least on member of V will have operation So new primary cannot cause committed operations to be dropped Only operations for which client has not yet seen the answer Discussion what problem does BFS solve? - is IS going to run BFS to deal with byzantine failures? - what failures are we talking about? compromised servers - what about compromised clients? authentication and authorization how can we extend the system to allow for more than (n-1)/3 failures over its lifetime? - detect failed replicas using proactive recovery - recover the system periodically, no matter what - makes bad nodes good again - tricky stuff - an attacker might steal compromised replica's keys - with how many replicas will BFS work reasonably well? HQ Replication ============== So far we have been talking about state machines implementing entire services Another way to view the world is in terms of objects What is the difference? With state machines any operation's result might depend on previous With objects, two operations on different objects are independent Much less likely to have contention for objects than state machines E.g., credit card or bank account object vs. entire bank If there's no contention, let's have client drive the entire protocol First, let's assume no contention, no failure: Client -> Servers: _Kc cid - client ID oid - object ID op# - operation #, local to client op - operation to perform on object Servers -> Client: _Kr grantTS - h - Hash(op) vs - viewstamp from BFS ts - timestamp at which operation should be executed rid - replica ID currentC - 2f+1 signed grantTS structures from previous operation Client -> Servers: writeC - 2f+1 grantTS structures, proving client has right to execute operation op at timestamp ts Servers -> Client: _Kr After 2f+1 matching replies, we are done Note reads are simpler: C->S: _Kc S->C: _Kr After 2f+1 matching replies, client returns data to application What might go wrong even without malicious replicas? Client might fail or be slow between WRITE-1 and WRITE-2. What happens? Replicas only allow one pending operation After second WRITE-1, reply _Kr grantTS - the grant sent to the first client cid, op# - ID operation number of second client (to prevent replays) At this point, second client can just finish first client's operation WriteBackWrite and WriteBackRead operations accomplish this Two clients could issue simultaneous writes, get conflicting certificates E.g., f+1 honest replicas granted ts to one client, f+1 to another - How to fix? Client assembles 2f+1 non-matching grants, call this conflictC Broadcasts (w1 is WRITE-1 that detected conflict) Now replicas invoke BFT protocol to agree on order of operations Execute operations Note that BFT protocol increases vs, which avoids conflicting certificates What can faulty client do? Can force replicas into conflicting states, which must be RESOLVEd