Sing a song of Simplex (2023) ============================= Let's formalize "single-decree" Paxos from Monday. N=2f+1 nodes. Leader -> All: PREPARE B= All -> Leader: PREPARED B lastvote={ NULL | } [wait for f+1 PREPARED] Leader -> All: COMMIT B x=(lastvote != NULL ? x_old : input(Leader)) All -> Leader: COMMITTED B [wait for f+1 COMMITTED] But we assume "fail-stop": nodes follow protocol or just stop sending messages What if failed nodes are "Byzantine"--i.e., don't obey the protocol 1. Any 2 quorums must share 1 *honest* node, not just one node 2. Can't trust any single node (need f+1 nodes to have at least one honest) - Leader might propose different values to different nodes for same B - Followers could send bad lastvote values Solution 1: Set N to 3f+1 and quorum size to 2f+1 View any set of 2f+1 nodes as (worst case) f+1 "real" + f Byzantine nodes Solution 2: Nodes digitally sign messages, always prove what they say Byzantine Paxos-like protocol (c.f. HotStuff [Yin'19]) let denote a signed message let QC = quorum certificate (identical message signed by 2f+1 nodes) Leader -> All: QUERY B=(n, id(Leader)) All -> Leader: STATUS B { NULL | QC_prev } [wait for 2f+1 STATUS] Let x = value from QC_prev with highest ballot if any, else input(Leader) Leader -> All: PREPARE B x { QC_prev | NULL } All -> Leader: QC_new = 2f+1 signed SAFE messages Leader -> All: COMMIT B QC_new All -> Leader: QC_com = 2f+1 identical LOCKED messages Leader -> All: QC_com Key ideas: Don't send SAFE B x unless 1) You haven't sent LOCKED, or 2) The value you LOCKED most recently was x, or 3) QC_prev says x was safe after last ballot you LOCKED #3 differs from Paxos, where you just take leader's word for it Don't send LOCKED unless quorum says it is SAFE whereas in Paxos, just believe leader that value is safe to vote for But there's an easier decomposition than Paxos for logs composed of slots Paxos: vote on 1) what's safe to vote on for slot, 2) the actual value VR: vote on 1) what goes in the slot, 2) whether to include slot in log Unresolved slot contents fine if we don't include slot in the log What is voting, really? Like a bad consensus protocol that can get stuck Ask a question (e.g., should we commit value x in ballot B?) You might get a definitive answer (quorum agrees) Then you can eventually prove answer to all honest nodes Or you might get "maybe" Because maybe a quorum will agree and you just haven't heard yet Or maybe some nodes failed and you will never complete the quorum A consensus protocol needs to reach agreement through a set of questions without getting stuck if any single question ends up a maybe Useful fact: If all honest nodes vote same way, will eventually resolve maybe Example maybe outcome: N=4, f=1, quorum size 3, voting on yes/no question Node 1 Node 2 Node 3 Node 4 vote yes vote yes ... vote no Could the answer be yes? maybe Node 3 could have voted yes, message delayed, still in transit Could the answer be no? maybe Node 2 could have failed and double-voted for both yes and no Node 3 could have voted for no, message still in transit Could the answer be permanently maybe? If Node 3 failed, will never get a quorum for yes or no So maybe could turn into yes or no, or could be permanent maybe So we can't assume yes or no... but we also can't keep waiting forever Let's devise a simple protocol where leader of slot v is Node (v mod N) Leader proposes a value x for slot v, with "parent" slot v' Means include v' and v in log If v > v'+1, exclude slots (v'..v) from log (i.e., make them noops) If v' = 0, then v is a genesis slot (first valid slot in log) Let's ask 2 questions about each log slot and issue signed votes on answers Q1 (YES/maybe): Is it safe to put x in slot v with parent v'? Q2 (YES/NO/maybe): Did we get a good YES answer to Q1 for v in a timely way? [honest nodes can vote at most once per slot on each question] If at any point you get 2f+1 matching signed answers to Q1 or Q2... 2f+1 signatures constitute a *certificate*. Broadcast to all nodes Note: threshold crypto lets you combine 2f+1 signatures into one If Q2 = YES, slot is *committed*: log must include slot v and all v's ancestors What did we gain by asking a question about a question? If Q1 is maybe, we can get unstuck by answering NO to Q2 If Q2 is maybe, we can get unstuck by answering YES to Q1! Notice what Q2 *didn't* ask: Should we include slot v in the log? If Q1=YES and Q2=NO, it's safe to include OR exclude v from log So if Q1=YES and Q2=maybe, including slot v in the log gets us unstuck Fill in chart during lecture: Q2 +-----------------+-----------------+---------------- | maybe | YES | NO +--------+-----------------+-----------------+---------------- | maybe | wait | Q1 will be yes | exclude Q1+--------+-----------------+-----------------+---------------- | YES | OK as parent | commit, must | OK as parent | | | be next parent | OK to exclude Can we end up with Q1=maybe and Q2=maybe? Recall if everyone votes NO on Q2, can't get permanently stuck at maybe To get stuck, some honest node must vote YES to split the vote But if *any* honest node voted YES on Q2, had Q1 certificate So will broadcast that Q1 certificate to all nodes And all nodes will eventually agree that Q1=YES So if Q1 = Q2 = maybe, just keep waiting for at least one answer Until 1 honest node has Q1 certificate, all honest Q2 votes (if any) are NO So Q2 cannot (yet) be a permanent maybe Once an honest node has a Q1 cert, all will eventually learn Q1=YES So now we don't care whether or not Q2 is a permanent maybe What are the implications of Q1 = YES, Q2 = maybe? Means it would be safe to commit v, but v isn't actually committed yet Because Q2 could turn from maybe into NO And if f+1 honest nodes first see Q1=maybe, Q2=NO, could exclude v But since v is safe, it can be the parent of a future slot When is it okay for a node to vote YES in Q1 for (v, v', x)? Must not have previously voted for Q1 in slot v Needs to have Q1 certificate for v' proving v' is safe Need a definitive NO on Q2 for all slots in (v'..v) [vacuous if v = v'+1] Otherwise, risk Q2 maybe becoming YES for some skipped slot Also, x must be syntactically & semantically valid following block v' Is this protocol safe? Validity: Honest leader will propose its actual input x Honest nodes will only vote for leader's x So can't get 2f+1 Q1 votes to produce certificate for x' != x Validity with Byzantine leaders not really meaningful But protocol not trivial since will sometimes have honest leaders Agreement: Can two honest nodes get two different (v',x) values in the same slot v? No, because two Q1 quorums for same slot share at least one honest node Honest node can only vote once on Q1 Can two honest nodes disagree about whether to include slot v in log? From quorum intersection, cannot get both YES and NO answers for Q2 But with Q1=YES, Q2=NO, v might or might not be included in the log Suppose: a node commits v_y with v as ancestor another node commits v_n > v without v as ancestor Contradiction if v_n > v_y - No Q2 NO cert for v_y means v_n can't exclude v_y from ancestry - But v is ancestor of v_y, so would be ancestor of v_x But also contradiction if v_y > v_n - v_y would need v_n as ancestor - But v_y > v_n > v, so means v couldn't be an ancestor of v_y Hence nodes cannot disagree about whether to include a slot in the log Is this protocol live during periods of network synchrony? With honest leader, will get Q1=yes then Q2=yes with bounded delay With dishonest leader might solicit votes on divergent x or v' in Q1 But when Q1 != yes, will eventually get Q2=no after timeout and delay By now you may have guessed this is a simplified non-dispersed simplex SuppShare is like a YES vote on Q1 CommitShare is like a YES vote on Q2 ComplaintShare is like a NO vote on Q2 Using threshold crypto, shares combined into single-signature certificate Does our simplified protocol make good use of network bandwidth if x is large? No, burden entirely on leader to push out 3f copies of x Meanwhile, 2f honest replicas are not using their upstream bandwidth Background: erasure codes A (k,d)-erasure code divides a message M into k fragments f_1, ..., f_k Each fragment is 1/d the size of M Any d fragments allow reconstruction of M Example: polynomial interpolation Break M up into d fragments m_1, ..., m_d Treat each m_i as an element of a finite field If M is large, just encode it in pieces the size of field elements Interpolate a degree d+1 polynomial f such that f(1) = m_1, ..., f(d) = m_d Let f_i = f(i). Any d points suffice to interpolate f Let's have the leader encode x with a (3f+1, f+1)-erasure code For Q1, leader sends to node i: v, v', f_i [fragment], H(x) In Q1, vote on H(x)--small collision-resistant hash of x Honest nodes broadcast their fragments to each other Given 2f+1 Q1 YES votes, at least f+1 will be honest, so forward fragments What's the problem here? Malicious leader could distribute fragments that don't correspond to H(x) So nodes need to reconstruct x before voting on Q1 That's going to add unfortunately latency Also need to help other honest nodes that might be missing x E.g., maybe only f+1 valid shares, f incompletely distributed faulty nodes What can we do? Use another tool, Merkle trees Treat fragment hashes H(f_i) as leaves of a binary tree Parent node is hash of both child nodes Root hash r lets you verify any f_i with a proof \pi_i size O(log (3f+1)) Don't vote on H(x), vote on \tau=(r, \beta) [\beta is length of value x] Leader sends to node i: \tau, f_i, \pi_i Node i votes on Q1 with: sign(v, v', \tau), f_i, \pi_i Q1 certificates only include v, v', and \tau, no fragments or proofs (Q2 certificates already only included the slot number) But now we need some extra checks when voting Ignore received Q1 votes if \pi_i doesn't prove that that f_i belongs to \tau Must verify \tau before voting YES on Q2 Re-encode x and verify that leader committed to all fragments If \tau invalid, means faulty leader, so vote NO in Q2 Say we want a stable leader Allows pipelining of transactions that depend on each other Naive approach: Same leader for 1000-slot *epochs* i.e., Leader(v) = (v/1000) mod (3f+1)--what goes wrong? Failed leader incurs 1,000 timeouts before switching to another node How do we fix this? If you get a Q2 NO certificate, skip directly to next epoch v = ((v/1000)+1)*1000 Now only get one timeout per bad leader, 1000 transactions per good leader assuming a period of synchrony