Practical Byzantine Fault Tolerance =================================== Suppose you have N replicas, f of which might crash (non-Byzantine failure) What quorum size Q do you need to guarantee liveness and safety? * Liveness: (or pseudo-liveness, i.e., avoiding stuck states) There must be a non-failed quorum (*quorum availability*) Hence: Q <= N - f * Safety: Any two quorums must intersect at one or more nodes Otherwise, two quorums could independently accept operations, diverge This property is often known as the *quorum intersection* property Hence: 2Q - N > 0 So: N < 2Q <= 2(N - f) Note highest possible f: N < 2N-2f; f < N/2 And if N = 2f + 1, smallest Q is 2Q > 2f + 1; Q = f + 1 Now say we throw in Byzantine failures. One view... Say you have N nodes, f of which might experience Byzantine failure. First, how can Byzantine failures be worse than non-Byzantine? Byzantine nodes can vote for both a statement and its contradiction Make different statements to different nodes Consequences Risks driving non-failed nodes into divergent states Risks driving non-failed nodes into "stuck states" E.g., cause split vote on seemingly irrefutable statement Paxos example: You think majority aborted some ballot b v You vote to commit b' v' (where b' > b, v' != v) Can't convince other nodes it is safe to vote for b' What quorum size Q do we need in Byzantine setting? * Liveness: Q <= N - f As in non-Byzantine case, failed nodes might not reply * Safety: Quorum intersection must contain one non-faulty node Idea: out of f+1 nodes, at most one can be faulty Hence: 2Q - N > f (since f could be malicious) So: N + f < 2Q <= 2(N - f) Highest f: N+f < 2N-2f; 3f < N; f < N/3 And if N = 3f + 1, the smallest Q is: N + f < 2Q; 3f + 1 + f < 2Q; 2f + 1/2 < Q; Q_min = 2f + 1 So how does PBFT protocol work? Number replica cohorts 1, 2, 3, ..., 3f+1 Number requests with consecutive sequence numbers (not viewstamps) System goes through a series of views In view v, replica number v mod (3f+1) is designated the primary Primary is responsible for selecting the order of operations Assigns an increasing sequence number to each operation In normal-case operation, use two-round protocol for request r: Round 1 (pre-prepare, prepare) goal: Ensure at least f+1 honest replicas agree that If request r executes in view v, will execute with sequence no. n Round 2 (commit) goal: Ensure at least f+1 honest replicas agree that Request r has executed in view v with sequence no. n Protocol for normal-case operation Let c be client r_i be replica i, or p primary, b_i backup i R set of all replicas c -> p: m = {REQUEST, o, t, c}_Kc p -> R: {PRE-PREPARE, v, n, d}_Kp, m (note d = H(m)) b_i -> R: {PREPARE, v, n, d, i}_K{r_i} [Note all messages signed, so will omit signatures and use < > henceforth.] replica r_i now waits for PRE-PREPARE + 2f matching PREPARE messages puts these messages in its log then we say prepared(m, v, n, i) is TRUE Note: If prepared(m, v, n, i) is TRUE for honest replica r_i then prepared(m', v, n, j) where m' != m FALSE for any honest r_j So no other operation can execute with view v sequence number n Are we done? Just reply to client? No Just because some other m' won't execute at (v,n) doesn't mean m will Suppose r_i is compromised right after prepared(m, v, n, i) Suppose no other replica received r_i's prepare message Suppose f replicas are slow and never even received the PRE-PREPARE No other honest replica will know the request prepared! Particularly if p fails, request might not get executed! So we say operation doesn't execute until prepared(m, v, n, i) is TRUE for f+1 non-faulty replicas r_i We say committed(m, v, n) is TRUE when this property holds So how does a replica *know* committed(m, v, n) holds? Add one more message: r_i -> R: (sent only after prepared(m,v,n,i)) replica r_i waits for 2f+1 identical COMMIT messages (including its own) committed-local(m, v, n, i) is TRUE when: prepared(m, v, n, i) is TRUE, and r_i has 2f+1 matching commits in its log Note: If committed-local(m, v, n, i) is TRUE for any non-faulty r_i Then means committed(m, v, n) is TRUE. r_i knows when committed-local is TRUE So committed-local is a replica's way of knowing that committed is TRUE r_i replies to client when committed-local(m, v, n, i) is TRUE Client waits for f+1 matching replies, then returns to client Why f+1 and not 2f+1? Because of f+1, at least one replica r_i is non-faulty So client knows committed-local(m, v, n, i) Which in turn implies committed(m, v, n) Note tentative reply optimization: r_i can send tentative reply to client after prepared(m, v, n, i) Client can accept result after 2f+1 matching tentative replies. Why? f+1 of those replies must be from honest nodes And at least 1 of those f+1 will be part of 2f+1 forming a new view So that 1 node will make sure operation makes it to new view Garbage collecting the message log make periodic checkpoints Broadcast , where d = digest of state When 2f+1 signed CHECKPOINTs received restrict sequence numbers are between h and H h = sequence number of last stable checkpoint H = h + k (e.g., k might be 2 * checkpoint interval of 100) delete all messages below sequence number of stable checkpoint View changes When client doesn't get an answer, broadcasts message to all replicas If a backup notices primary is slow/unresponsive: - broadcast C is 2f+1 signed checkpoint messages for last stable checkpoint P = {P_m} where each P_m is signed PRE-PREPARE + 2f signed PREPARES i.e., P is set of all PREPAREd messages since checkpoint + proof that the messages really are prepared When primary of view v+1 sees 2f signed VIEW-CHANGE messages from others - New primary broadcasts V is set of at lesat 2f+1 VIEW-CHANGE messages (including by new primary) O is a set of pre-prepare messages, for operations that are: - after last stable checkpoint - appear in the set P of one of the VIEW-CHANGE messages O also contains dummy messages to fill in sequence number gaps Replicas may optain any missing state from each other (e.g., stable checkpoint data, or missing operation, since reissued pre-prepare messages only contain digest of request) What happens if primary creates incorrect O in NEW-VIEW message? E.g., might send null requests for operations that prepared Other replicas can compute O from V, and can reject NEW-VIEW message What happens if primary sends different V's to different backups? Still okay, because any committed operation will be in 2f+1 VIEW-CHANGE msgs of which f+1 must be honest, so at least one member of V will have operation So new primary cannot cause committed operations to be dropped Only operations for which client has not yet seen the answer Discussion what problem does BFS solve? - is IS going to run BFS to deal with byzantine failures? - what failures are we talking about? compromised servers - what about compromised clients? authentication and authorization how can we extend the system to allow for more than (n-1)/3 failures over its lifetime? - detect failed replicas using proactive recovery - recover the system periodically, no matter what - makes bad nodes good again - tricky stuff - an attacker might steal compromised replica's keys - with how many replicas will BFS work reasonably well?