Practical Byzantine Fault Tolerance
===================================

What is the minimum replicas that we will need to recover from f failures?
      3f+1.  Why?
  f cohorts may fail and stop responding to network messages
  So must make progress after 2f+1 replies
  But if get 2f+1 replies, might just have f slow cohorts
  Which means f of remaining 2f+1 could be malicious
  But w. 3f+1 a majority (f+1) of 2f+1 replies guaranteed from honest cohorts

So how does this protocol work?
  Number replica cohorts 1, 2, 3, ..., 3f+1
  Number requests with consecutive sequence numbers (not viewstamps)
  System goes through a series of views
    In view v, replica number v mod (3f+1) is designated the primary
    Primary is responsible for selecting the order of operations
      Assigns an increasing sequence number to each operation
  In normal-case operation, use two-round protocol for request r:
    Round 1 (pre-prepare, prepare) goal:
      Ensure at least f+1 honest replicas agree that
        If request r executes in view v, will execute with sequence no. n
    Round 2 (commit) goal:
      Ensure at least f+1 honest replicas agree that
        Request r has executed in view v with sequence no. n
  Note analogy with three-phase commit:
    In first round, cohort agrees that operation can execute
      (in this case agrees not to accept other operation w. seqno n in view v)
    In second round, learn that other cohorts have agreed on operation 
    In 3PC, don't reply to client until everyone knows
      that everyone has agreed to execute the operation
    In BFT, don't reply to client until f+1 honest cohorts know
      that f+1 honest cohorts have agreed to execute the operation
    And note f+1 honest cohorts means you've heard from 2f+1 cohorts
      (at most f of which you are assuming can be faulty)

Protocol for normal-case operation
  Let c be client
      r_i be replica i, or p primary, b_i backup i
      R set of all replicas

   c -> p:  m = {REQUEST, o, t, c}_Kc
   p -> R:  {PRE-PREPARE, v, n, d}_Kp, m     (note d = H(m))
  b_i -> R: {PREPARE, v, n, d, i}

  replica r_i now waits for PRE-PREPARE + 2f matching PREPARE messages
     puts these messages in its log
     then we say prepared (m, v, n, i) is TRUE

  Note:  If prepared (m, v, n, i) is TRUE for honest replica r_i
     then prepared (m', v, n, j) where m' != m FALSE for any honest r_j
     So no other operation can execute with view v sequence number n

  Are we done?  Just reply to client?  No--think of 3PC analogy.
    Just because some other m' won't execute at (v,n) doesn't mean m will
    Suppose r_i is compromised right after prepared (m, v, n, i)
    Suppose no other replica received r_i's prepare message
    Suppose f replicas are slow and never even received the PRE-PREPARE
    No other honest replica will know the request prepared!
    Particularly if p fails, request might not get executed!

  So we say operation doesn't execute until
    prepared (m, v, n, i) is TRUE for f+1 non-faulty replicas r_i
    We say committed (m, v, n) is TRUE when this property holds

  So how does a replica *know* committed (m, v, n) holds?
    Add one more message:

  r_i -> R: {COMMIT, v, n, d, i}

  replica r_i waits for 2f+1 identical COMMIT messages (including its own)
    committed-local (m, v, n, i) is TRUE when:
      prepared (m, v, n, i) is TRUE, and
      r_i has 2f+1 matching commits in its log

  Note:  If committed-local (m, v, n, i) is TRUE for any non-faulty r_i
    Then means committed (m, v, n) is TRUE.
    r_i knows when committed-local is TRUE
    So committed-local is a replica's way of knowing that committed is TRUE

r_i replies to client when committed-local (m, v, n, i) is TRUE
  Client waits for f+1 matching replies, then returns to client
  Why f+1 and not 2f+1?
    Because of f+1, at least one replica r_i is non-faulty
    So client knows committed-local (m, v, n, i)
    Which in turn implies committed (m, v, n)
  Note tentative reply optimization:
    r_i can send tentative reply to client after prepared (m, v, n, i)
    Client can accept result after 2f+1 matching tentative replies

Garbage collecting the message log
  make periodic checkpoints
    Broadcast <CHECKPOINT, n, d, i>, where d = digest of state
    When 2f+1 signed CHECKPOINTs received
  restrict sequence numbers are between h and H
    h = sequence number of last stable checkpoint
    H = h + k   (e.g., k might be 2 * checkpoint interval of 100)
  delete all messages below sequence number of stable checkpoint

View changes
  When client doesn't get an answer, broadcasts message to all replicas
  If a backup notices primary is slow/unresponsive:

  - broadcast VIEW-CHANGE v+1, n, C, P, i
    C is 2f+1 signed checkpoint messages for last stable checkpoint
    P = {P_m} where each P_m is signed PRE-PREPARE + 2f PREPARES
      i.e., P is set of all PREPAREd messages since checkpoint
            + proof that the messages really are prepared

  When primary of view v+1 sees 2f signed VIEW-CHANGE messages from others

  - New primary broadcasts <NEW-VIEW, v+1, V, O>
      V is set of at lesat 2f+1 VIEW-CHANGE messages (including by new primary)
      O is a set of pre-prepare messages, for operations that are:
        - after last stable checkpoint
        - appear in the set P of one of the VIEW-CHANGE messages
      O also contains dummy messages to fill in sequence number gaps

  Replicas may optain any missing state from each other
    (e.g., stable checkpoint data, or missing operation, since
     reissued pre-prepare messages only contain digest of request)

What happens if primary creates incorrect O in NEW-VIEW message?
  E.g., might send null requests for operations that prepared
  Other replicas can compute O from V, and can reject NEW-VIEW message
What happens if primary sends different V's to different backups?
  Still okay, because any committed operation will be in 2f+1 VIEW-CHANGE msgs
  of which f+1 must be honest, so at least on member of V will have operation
  So new primary cannot cause committed operations to be dropped
  Only operations for which client has not yet seen the answer

Discussion
  what problem does BFS solve?
  - is IS going to run BFS to deal with byzantine failures?
  - what failures are we talking about?
    compromised servers
  - what about compromised clients?
    authentication and authorization
  how can we extend the system to allow for more than (n-1)/3
  failures over its lifetime?
  - detect failed replicas using  proactive recovery 
    - recover the system periodically, no matter what
    - makes bad nodes good again
  - tricky stuff
    - an attacker might steal compromised replica's keys
- with how many replicas will BFS work reasonably well?


HQ Replication
==============

So far we have been talking about state machines implementing entire services
  Another way to view the world is in terms of objects
  What is the difference?
    With state machines any operation's result might depend on previous
    With objects, two operations on different objects are independent
    Much less likely to have contention for objects than state machines
    E.g., credit card or bank account object vs. entire bank

If there's no contention, let's have client drive the entire protocol
First, let's assume no contention, no failure:

   Client -> Servers: <WRITE-1, cid, oid, op#, op>_Kc
      cid - client ID
      oid - object ID
      op# - operation #, local to client
       op - operation to perform on object

   Servers -> Client: <WRITE-1-OK, grantTS, currentC>_Kr
       grantTS - <cid, oid, op#, h, vs, ts, rid>
                   h - Hash(op)
		  vs - viewstamp from BFS
		  ts - timestamp at which operation should be executed
		 rid - replica ID
      currentC - 2f+1 signed grantTS structures from previous operation

   Client -> Servers: <WRITE-2, writeC>
        writeC - 2f+1 grantTS structures, proving client has
                 right to execute operation op at timestamp ts

   Servers -> Client: <WRITE-2-ANS, result, currentC, rid>_Kr
     After 2f+1 matching replies, we are done

Note reads are simpler:
  C->S: <READ, cid, oid, op, nonce>_Kc
  S->C: <READ-ANS, result, nonce, currentC, rid>_Kr
    After 2f+1 matching replies, client returns data to application

What might go wrong even without malicious replicas?

Client might fail or be slow between WRITE-1 and WRITE-2.  What happens?
    Replicas only allow one pending operation
    After second WRITE-1, reply
          <WRITE-1-REFUSED, grantTS, cid, oid, op#, currentC>_Kr
        grantTS - the grant sent to the first client
       cid, op# - ID operation number of second client (to prevent replays)
    At this point, second client can just finish first client's operation
      WriteBackWrite and WriteBackRead operations accomplish this

Two clients could issue simultaneous writes, get conflicting certificates
  E.g., f+1 honest replicas granted ts to one client, f+1 to another
  - How to fix?
  Client assembles 2f+1 non-matching grants, call this conflictC
  Broadcasts <RESOLVE, conflictC, w1> (w1 is WRITE-1 that detected conflict)
  Now replicas invoke BFT protocol to agree on order of operations
  Execute operations
  Note that BFT protocol increases vs, which avoids conflicting certificates

What can faulty client do?
  Can force replicas into conflicting states, which must be RESOLVEd