Byzantine Agreement The problem: Enemy camp. Three armies, three generals, G0, G1, G2. Can communicate with trustworthy messengers. They need to agree on whether to attack at dawn. But one of the generals might be a traitor. Two loyal generals needed for victory; defeat if only one attacks. Straw man: In advance the generals designate G0 as the leader. G0 sends an order to each of G1 and G2. G1 and G2 follow the order. If G1 or G2 is the traitor, this procedure works: G0 and the non-traitor both attack (or both retreat). What if G0 is the traitor? G0 could tell G1 to attack, and G2 to retreat. Can G1 and G2 detect G0's treachery by comparing notes? G1 and G2 tell each other what G0 told them. If G1 and G2 got the same order, they'll obey it. Otherwise both retreat. If G0 is the traitor, this procedure will detect it. But does it still work if G1 (or G2) is the traitor? Suppose G0 tells both to attack. G1 tells G2 "G0 told me to retreat". So G2 decides G0 is the traitor, and retreats, while G0 attacks. Why is this problem interesting? We've talked a lot about replicas: Ficus, Bayou, DDS. We've assumed replicas are fail-stop. I.e. a typical failure is crash, power failure, network failure. What if replica software malfunctions? Sends incorrect messages, performs incorrect operations on data. What if a replica is run by a malicious operator? Byzantine Agreement is what we need if replicas can't be trusted. Generals == replicas. Orders == update operations. Single-copy correctness demands that all replicas perform the same ops. In the same order. So "agreement to attack" == all replicas agree on next update. And traitor == a replica that wants an incorrect result: Hard to forge actual operations in the real world. Traitor might cause inconsistent order of operations. Traitor might prevent any progress from being made. Assumption of only one traitor == withstand single failure. More generally, want to withstand some fraction of failures. Back to the original problem. We ran into trouble trusting G1's claim "G0 told me to retreat". We can fix this with digital signatures. Now the basic algorithm is: G0 sends signed orders to G1 and G2. G1 and G2 exchange those orders. If G0 is the traitor and sends different orders, G0 and G1 detect easily. If G1 is the traitor, what can he do? Cannot forge a contrary order from G0. BUT G1 can just not send anything to G2! Then G0 will attack, but G2 will wait forever (and not attack). So we have the danger that network delays can cause incorrect results! Suppose G2 times out after waiting for G1 for a while? And decides G1 must be a traitor, and follows G0's order. Then a traitorous G0 can cause incorrect operation: Send "attack" to G2. Send nothing to G1. G2 will time out and attack; G0 and G1 won't. Note that we still have a problem even if G0, G1 and G2 are loyal. Adversary may delay or discard messages. **** Practical byzantine fault tolerance What is an asynchronous system? - no relation to asynchronous programming - def. no bound on message delay i.e., can't tell the difference between a stopped replica and a slow one What is the state machine approach to replication - n replicas execute operations deterministically - n replicas start in the same initial state Is it okay to choose any order for operations? No. say C1 completes operation O1, then C2 completes O2. If O2 initiated after O1 completes, should reflect O1's effects. But if O1, O2 concurrent, don't care about order so long as everyone agrees Property is called "Linearizability" What kinds of attack are possible? compromised replicas may delete message, delay them, change them, send different replies to client etc. Claim of paper: system provides safety and liveness with [(n-1)/3] faulty replicas safety: linearizability liveness: clients eventual receive a reply assumptions: 1. the replicas fail independent - how can we ensure that? paper: N-version programming practice: ping of death, or DDS bug 2. message are received at some point - keep retransmitting - network, client, and server will heal at some point What is the minimum replicas that we will need to recover from f failures? - 3f+1 - why? - we need to deal with partitions - we must proceed after communicating with n - f replicas - the f that didn't respond might be non faulty! - f of n - f might be faulty - n - 2f > f to ensure that we have majority of correct responses - practice: f = 1 => 4 replicas f = 2 => 7 replicas f = 3 => 10 replicas f = 10 => 31 replicas The algorithm - a primary-backup protocol to achieve total order - primary sets order, backups agree - we need 2f backups to agree so that we have 2f+primary replicas agreeing (2f+1) - each replica keeps a message log (in-core) - 3 phases: propose, agree, commit c ----------------------------------------------- p ---------------------------------------------- b ---------------------------------------------- b ---------------------------------------------- b ------------------------------------------------- - c -> p: request o, t, c - p -> b's: in view v, i propose sequence number n for message m (signed by p) [PRE-PREPARE] - b's -> b's: in view v, i agree on n for message m (signed by b) [PREPARE] * m is PREPARED when b receives 2f agrees plus 1 propose - b's -> b's: in view v, I commit message m with sequence number n (signed by b) [COMMIT] if a non-faulty b commits m, then eventually m will be committed * when b receives 2f+1 commits (including its own), request is COMMITED-LOCAL - b sends reply to c Operation is (globally) COMMITTED if it is PREPARED at f+1 non-faulty replicas Note COMMITTED-LOCAL at b is a stronger property than COMMITTED If any non-faulty replica COMMITTED-LOCAL, then globally COMMITTED Garbage collecting the message log - sequence numbers are between h and H - make periodic checkpoints Broadcast , where d = digest of state When 2f+1 signed CHECKPOINTs received - delete all messages below sequence number of stable checkpoint View changes When client doesn't get an answer, broadcasts o,t to all replicas If a backup notices primary is slow/unresponsive: - broadcast VIEW-CHANGE v+1, n, C, P, i C is 2f+1 signed checkpoint messages for last stable checkpoint P = {P_m} where each P_m is signed PRE-PREPARE + 2f PREPARES i.e., P is set of all PREPAREd messages since checkpoing + proof that the messages really are prepared When primary of view v+1 sees 2f signed VIEW-CHANGE messages: - New primary broadcasts V is set of at lesat 2f+1 VIEW-CHANGE messages (including by new primary) O is a set of pre-prepare messages, for operations that are: - after last stable checkpoint - appear in the set P of one of the VIEW-CHANGE messages O also contains dummy messages to fill in sequence number gaps Replicas may optain any missing state from each other (e.g., stable checkpoint data, or missing operation, since reissued pre-prepare messages only contain digest of request) - optimizations: 1. backups send digests to client, except one which sends the whole reply 2. send replies after prepared predicate becomes true 3. read-only request straight to backups instead through primary - be careful: respond after all tentative requests have been committed 4. avoid public-key crypto - i probably should present the algorithm based on MACs first; it is easier to understand BFS - byzantine file system - speaks NFS2 - in-core file system (replication to ensure stability) - non-deterministic operations - setting the last-modified-time Performance evaluation n = 4 microbenchmarks (dominated by messages cost) andrew (BFS does well compared to NFS because of lack of synchronous writes) - Discussion - what problem does BFS solve? - is IS going to run BFS to deal with byzantine failures? - what failures are we talking about? compromised servers - what about compromised clients? authentication and authorization - how can we extend the system to allow for more than (n-1)/3 failures over its lifetime? - detect failed replicas using proactive recovery - recover the system periodically, no matter what - makes bad nodes good again - tricky stuff - an attacker might steal compromised replica's keys - with how many replicas will BFS work reasonably well?