Sing a song of Simplex (2023)
=============================

Let's formalize "single-decree" Paxos from Monday.  N=2f+1 nodes.

  Leader -> All:  PREPARE B=<n, id(Leader)>
  All -> Leader:  PREPARED B lastvote={ NULL | <B_old, x_old> }
    [wait for f+1 PREPARED]
  Leader -> All:  COMMIT B x=(lastvote != NULL ? x_old : input(Leader))
  All -> Leader:  COMMITTED B
    [wait for f+1 COMMITTED]

But we assume "fail-stop": nodes follow protocol or just stop sending messages
What if failed nodes are "Byzantine"--i.e., don't obey the protocol
  1. Any 2 quorums must share 1 *honest* node, not just one node
  2. Can't trust any single node (need f+1 nodes to have at least one honest)
       - Leader might propose different values to different nodes for same B
       - Followers could send bad lastvote values
  Solution 1: Set N to 3f+1 and quorum size to 2f+1
    View any set of 2f+1 nodes as (worst case) f+1 "real" + f Byzantine nodes
  Solution 2: Nodes digitally sign messages, always prove what they say

Byzantine Paxos-like protocol (c.f. HotStuff [Yin'19])
     let <msg> denote a signed message
     let QC = quorum certificate (identical message signed by 2f+1 nodes)
  Leader -> All:  QUERY B=(n, id(Leader))
  All -> Leader:  STATUS B { NULL | QC_prev }
    [wait for 2f+1 STATUS]
    Let x = value from QC_prev with highest ballot if any, else input(Leader)
  Leader -> All:  PREPARE B x { QC_prev | NULL }
  All -> Leader:  <SAFE B x>
    QC_new = 2f+1 signed SAFE messages
  Leader -> All:  COMMIT B QC_new
  All -> Leader:  <LOCKED B x>
    QC_com = 2f+1 identical LOCKED messages
  Leader -> All:  QC_com
Key ideas:
  Don't send SAFE B x unless
    1) You haven't sent LOCKED, or
    2) The value you LOCKED most recently was x, or
    3) QC_prev says x was safe after last ballot you LOCKED
       #3 differs from Paxos, where you just take leader's word for it
  Don't send LOCKED unless quorum says it is SAFE
       whereas in Paxos, just believe leader that value is safe to vote for

But there's an easier decomposition than Paxos for logs composed of slots
  Paxos: vote on 1) what's safe to vote on for slot, 2) the actual value
  VR: vote on 1) what goes in the slot, 2) whether to include slot in log
    Unresolved slot contents fine if we don't include slot in the log

What is voting, really?  Like a bad consensus protocol that can get stuck
  Ask a question (e.g., should we commit value x in ballot B?)
  You might get a definitive answer (quorum agrees)
    Then you can eventually prove answer to all honest nodes
  Or you might get "maybe"
    Because maybe a quorum will agree and you just haven't heard yet
    Or maybe some nodes failed and you will never complete the quorum
  A consensus protocol needs to reach agreement through a set of questions
      without getting stuck if any single question ends up a maybe
  Useful fact:  If all honest nodes vote same way, will eventually resolve maybe

Example maybe outcome: N=4, f=1, quorum size 3, voting on yes/no question
    Node 1      Node 2     Node 3     Node 4
    vote yes    vote yes   ...        vote no
  Could the answer be yes?  maybe
    Node 3 could have voted yes, message delayed, still in transit
  Could the answer be no?  maybe
    Node 2 could have failed and double-voted for both yes and no
    Node 3 could have voted for no, message still in transit
  Could the answer be permanently maybe?
    If Node 3 failed, will never get a quorum for yes or no
So maybe could turn into yes or no, or could be permanent maybe
So we can't assume yes or no... but we also can't keep waiting forever

Let's devise a simple protocol where leader of slot v is Node (v mod N)
  Leader proposes a value x for slot v, with "parent" slot v'
  Means include v' and v in log
  If v > v'+1, exclude slots (v'..v) from log (i.e., make them noops)
  If v' = 0, then v is a genesis slot (first valid slot in log)

Let's ask 2 questions about each log slot and issue signed votes on answers
  Q1 (YES/maybe): Is it safe to put x in slot v with parent v'?
  Q2 (YES/NO/maybe): Did we get a good YES answer to Q1 for v in a timely way?
     [honest nodes can vote at most once per slot on each question]
If at any point you get 2f+1 matching signed answers to Q1 or Q2...
  2f+1 signatures constitute a *certificate*.  Broadcast to all nodes
  Note: threshold crypto lets you combine 2f+1 signatures into one
If Q2 = YES, slot is *committed*: log must include slot v and all v's ancestors
What did we gain by asking a question about a question?
  If Q1 is maybe, we can get unstuck by answering NO to Q2
  If Q2 is maybe, we can get unstuck by answering YES to Q1!
Notice what Q2 *didn't* ask:  Should we include slot v in the log?
  If Q1=YES and Q2=NO, it's safe to include OR exclude v from log
  So if Q1=YES and Q2=maybe, including slot v in the log gets us unstuck

Fill in chart during lecture:  Q2
             +-----------------+-----------------+----------------
             | maybe           | YES             | NO
    +--------+-----------------+-----------------+----------------
    | maybe  | wait            | Q1 will be yes  | exclude
  Q1+--------+-----------------+-----------------+----------------
    |   YES  | OK as parent    | commit, must    | OK as parent
    |        |                 | be next parent  | OK to exclude

Can we end up with Q1=maybe and Q2=maybe?
  Recall if everyone votes NO on Q2, can't get permanently stuck at maybe
  To get stuck, some honest node must vote YES to split the vote
  But if *any* honest node voted YES on Q2, had Q1 certificate
    So will broadcast that Q1 certificate to all nodes
    And all nodes will eventually agree that Q1=YES
So if Q1 = Q2 = maybe, just keep waiting for at least one answer
  Until 1 honest node has Q1 certificate, all honest Q2 votes (if any) are NO
    So Q2 cannot (yet) be a permanent maybe
  Once an honest node has a Q1 cert, all will eventually learn Q1=YES
    So now we don't care whether or not Q2 is a permanent maybe
What are the implications of Q1 = YES, Q2 = maybe?
  Means it would be safe to commit v, but v isn't actually committed yet
    Because Q2 could turn from maybe into NO
    And if f+1 honest nodes first see Q1=maybe, Q2=NO, could exclude v
  But since v is safe, it can be the parent of a future slot

When is it okay for a node to vote YES in Q1 for (v, v', x)?
  Must not have previously voted for Q1 in slot v
  Needs to have Q1 certificate for v' proving v' is safe
  Need a definitive NO on Q2 for all slots in (v'..v) [vacuous if v = v'+1]
    Otherwise, risk Q2 maybe becoming YES for some skipped slot
  Also, x must be syntactically & semantically valid following block v'

Is this protocol safe?
  Validity:
    Honest leader will propose its actual input x
      Honest nodes will only vote for leader's x
      So can't get 2f+1 Q1 votes to produce certificate for x' != x
    Validity with Byzantine leaders not really meaningful
      But protocol not trivial since will sometimes have honest leaders
  Agreement:
    Can two honest nodes get two different (v',x) values in the same slot v?
      No, because two Q1 quorums for same slot share at least one honest node
      Honest node can only vote once on Q1
    Can two honest nodes disagree about whether to include slot v in log?
      From quorum intersection, cannot get both YES and NO answers for Q2
      But with Q1=YES, Q2=NO, v might or might not be included in the log
      Suppose: a node commits v_y with v as ancestor
               another node commits v_n > v without v as ancestor
        Contradiction if v_n > v_y
	  - No Q2 NO cert for v_y means v_n can't exclude v_y from ancestry
	  - But v is ancestor of v_y, so would be ancestor of v_x
        But also contradiction if v_y > v_n
          - v_y would need v_n as ancestor
	  - But v_y > v_n > v, so means v couldn't be an ancestor of v_y
      Hence nodes cannot disagree about whether to include a slot in the log

Is this protocol live during periods of network synchrony?
  With honest leader, will get Q1=yes then Q2=yes with bounded delay
  With dishonest leader might solicit votes on divergent x or v' in Q1
    But when Q1 != yes, will eventually get Q2=no after timeout and delay

By now you may have guessed this is a simplified non-dispersed simplex
    SuppShare is like a YES vote on Q1
    CommitShare is like a YES vote on Q2
    ComplaintShare is like a NO vote on Q2
  Using threshold crypto, shares combined into single-signature certificate

Does our simplified protocol make good use of network bandwidth if x is large?
  No, burden entirely on leader to push out 3f copies of x
  Meanwhile, 2f honest replicas are not using their upstream bandwidth

Background: erasure codes
  A (k,d)-erasure code divides a message M into k fragments f_1, ..., f_k
  Each fragment is 1/d the size of M
  Any d fragments allow reconstruction of M
Example: polynomial interpolation
  Break M up into d fragments m_1, ..., m_d
  Treat each m_i as an element of a finite field
    If M is large, just encode it in pieces the size of field elements
  Interpolate a degree d+1 polynomial f such that f(1) = m_1, ..., f(d) = m_d
  Let f_i = f(i).  Any d points suffice to interpolate f

Let's have the leader encode x with a (3f+1, f+1)-erasure code
  For Q1, leader sends to node i: v, v', f_i [fragment], H(x)
  In Q1, vote on H(x)--small collision-resistant hash of x
  Honest nodes broadcast their fragments to each other
  Given 2f+1 Q1 YES votes, at least f+1 will be honest, so forward fragments
What's the problem here?
  Malicious leader could distribute fragments that don't correspond to H(x)
  So nodes need to reconstruct x before voting on Q1
    That's going to add unfortunately latency
  Also need to help other honest nodes that might be missing x
    E.g., maybe only f+1 valid shares, f incompletely distributed faulty nodes

What can we do?  Use another tool, Merkle trees
  Treat fragment hashes H(f_i) as leaves of a binary tree
  Parent node is hash of both child nodes
  Root hash r lets you verify any f_i with a proof \pi_i size O(log (3f+1))
Don't vote on H(x), vote on \tau=(r, \beta) [\beta is length of value x]
  Leader sends to node i:  \tau, f_i, \pi_i
  Node i votes on Q1 with:  sign(v, v', \tau), f_i, \pi_i
  Q1 certificates only include v, v', and \tau, no fragments or proofs
  (Q2 certificates already only included the slot number)
But now we need some extra checks when voting
  Ignore received Q1 votes if \pi_i doesn't prove that that f_i belongs to \tau
  Must verify \tau before voting YES on Q2
    Re-encode x and verify that leader committed to all fragments
    If \tau invalid, means faulty leader, so vote NO in Q2

Say we want a stable leader
  Allows pipelining of transactions that depend on each other
Naive approach:  Same leader for 1000-slot *epochs*
    i.e., Leader(v) = (v/1000) mod (3f+1)--what goes wrong?
  Failed leader incurs 1,000 timeouts before switching to another node
How do we fix this?
  If you get a Q2 NO certificate, skip directly to next epoch
    v = ((v/1000)+1)*1000
  Now only get one timeout per bad leader, 1000 transactions per good leader
    assuming a period of synchrony