Practical Byzantine Fault Tolerance
===================================

So how does PBFT protocol work?
  Number replica cohorts 1, 2, 3, ..., 3f+1
  Number requests with consecutive sequence numbers (not viewstamps)
  System goes through a series of views
    In view v, replica number v mod (3f+1) is designated the primary
    Primary is responsible for selecting the order of operations
      Assigns an increasing sequence number to each operation
  In normal-case operation, use two-round protocol for request r:
    Round 1 (pre-prepare, prepare) goal:
      Ensure at least f+1 honest replicas agree that
        If request r executes in view v, will execute with sequence no. n
    Round 2 (commit) goal:
      Ensure at least f+1 honest replicas agree that
        Request r has executed in view v with sequence no. n

Protocol for normal-case operation
  Let c be client
      r_i be replica i, or p primary, b_i backup i
      R set of all replicas

    c -> p:  m = <REQUEST, o, t, c>_Kc
    p -> R:  <PRE-PREPARE, v, n, d>_Kp, m     (note d = H(m))
  b_i -> R:  <PREPARE, v, n, d, i>_K{r_i}
  [Note all messages signed, so henceforth < > means implicit signature.]

  replica r_i now waits for PRE-PREPARE + 2f matching PREPARE messages
     puts these messages in its log
     then we say prepared(m, v, n, i) is TRUE

  Note:  If prepared(m, v, n, i) is TRUE for honest replica r_i
         then prepared(m', v, n, j) where m' != m FALSE for any honest r_j
     Why?  Because each of r_i, r_j has 2f+1 signed [pre-]prepare messages
       Given 3f+1 replicas, messages at r_i and r_j must share f+1 sources
       But at most f nodes are malicious, so one of those won't double vote
       So the two sets of 2f+1 agreeing messages must agree with one another
     Means no other operation can execute with view v sequence number n

  Are we done?  Just reply to client?  No
    Just because some other m' won't execute at (v,n) doesn't mean m will
      Suppose r_i is partitioned right after prepared(m, v, n, i)
      Suppose no other replica received r_i's prepare message
      Suppose f replicas are slow and never even received the PRE-PREPARE
      No other honest replica will know the request prepared!
      If p fails, m' != m might get executed at (v+1,n)
    Would be bad if r_i had said m committed in slot n when it didn't
      f bad nodes could "join" r_i, falsely convincing client m committed at n
      f nodes could crash, won't have 2f+1 to agree on sequence number n

  So we say operation doesn't execute until
    prepared(m, v, n, i) is TRUE for f+1 non-faulty replicas r_i
    We say committed(m, v, n) is TRUE when this property holds

  So how does a replica *know* committed(m, v, n) holds?
    Add one more message:

  r_i -> R: <COMMIT, v, n, d, i>    (sent only after prepared(m,v,n,i))

  replica r_i waits for 2f+1 identical COMMIT messages (including its own)
    committed-local(m, v, n, i) is TRUE when:
      prepared(m, v, n, i) is TRUE, and
      r_i has 2f+1 matching commits in its log

  Note:  If committed-local(m, v, n, i) is TRUE for any non-faulty r_i
    Then means committed(m, v, n) is TRUE.
    r_i knows when committed-local is TRUE
    So committed-local is a replica's way of knowing that committed is TRUE

r_i replies to client when committed-local(m, v, n, i) is TRUE

  r_i -> c: <REPLY,v,t,c,i,r>    (r is result)

  Client waits for f+1 matching replies, then returns to client
  Why f+1 and not 2f+1?
    Because of f+1, at least one replica r_i is non-faulty
    So client knows committed-local(m, v, n, i) (even if it doesn't know i)
    Which in turn implies committed(m, v, n)

Garbage collecting the message log
  make periodic checkpoints
    Broadcast <CHECKPOINT, n, d, i>, where d = digest of state
    A stable checkpoint has sate + proof comprising 2f+1 signed CHECKPOINTs
  restrict sequence numbers are between h and H
    h = sequence number of last stable checkpoint
    H = h + k   (e.g., k might be 2 * checkpoint interval of 100)
    Why?  Don't want bad primary exhausting sequence number space
  delete all messages below sequence number of stable checkpoint

View changes
  When client doesn't get an answer, broadcasts message to all replicas
  If a backup notices primary is slow/unresponsive:

  - broadcast <VIEW-CHANGE v+1, n, C, P, i>
    C is 2f+1 signed checkpoint messages for last stable checkpoint
    P = {P_m} where each P_m is signed PRE-PREPARE + 2f signed PREPARES
      i.e., P is set of all PREPAREd messages since checkpoint
            + proof that the messages really are prepared

  When primary of view v+1 sees 2f signed VIEW-CHANGE messages from others

  - New primary broadcasts <NEW-VIEW, v+1, V, O>
      V is set of at least 2f+1 VIEW-CHANGE messages (including by new primary)
      O is a set of pre-prepare messages, for operations that are:
        - after last stable checkpoint
        - appear in the set P of one of the VIEW-CHANGE messages
      O also contains dummy messages to fill in sequence number gaps

  Replicas may optain any missing state from each other
    (e.g., stable checkpoint data, or missing operation, since
     reissued pre-prepare messages only contain digest of request)

What happens if primary creates incorrect O in NEW-VIEW message?
  E.g., might send null requests for operations that prepared
  Other replicas can compute O from V, and can reject NEW-VIEW message
What happens if primary sends different V's to different backups?
  Still okay, because any committed operation will be in 2f+1 VIEW-CHANGE msgs
  of which f+1 must be honest, so at least one member of V will have operation
  So new primary cannot cause committed operations to be dropped
  Can only drop operations for which client has not yet seen the answer

How does PBFT deal with non-determinism (Sec. 4.6)?  Two options
  Option 1: Primary selects value, piggybacks on PRE-PREPARE
    "replicas must be able to decide deterministically whether... correct"
  Option 2: Add an extra round to protocol for backups to determine value
    E.g., backups nominate non-deterministic candidate values
    Value chose is average, or one with highest hash out of 2f+1 nominations
    Primary must include 2f+1 nominations as proof this is valid
  What does BFS use for file mtime?  option 1
    Weird because not actually deterministically checkable
    But turns out it works anyway
      If f+1 honest replicas reject, timeout and view-change

What is the tentative reply optimization?
  Want to shave one network transmission latency from end-to-end client time
  r_i tentatively executes m, sends tentative reply once prepared(m,v,n,i)
      which is before committed-local(m,v,n,i)
  Client accepts 2f+1 matching tentative replies
      So f+1 of those replies must be from honest nodes
      And at least 1 of those f+1 will be part of 2f+1 forming any new view
      So that 1 node will make sure operation makes it to new view
      [c.f. last slide of intro: 2f+1 = f_S + f_L + 1]
  So why do we even need COMMIT messages?
    Tentative replies convince client that committed(m,v,n)
    But replicas don't know this!
      So can't determine outcome of subsequent dependent operations,
      Can't compute checkpoints, etc.
  Can tentatively executed requests be undone?
    Yes, if there's a view change
    But then client won't have 2f+1 tentative replies, so okay to undo

How are read-only (RO) requests optimized?
  Send directly to replicas, replicas send reply directly to client
    Client waits for 2f+1 matching replies, or restarts as read-write request
  Replicas must serialize after all prior tentative requests.  Why?
    For linearizability, RO request might not overlap tentatively executed one
  What if replicas serialize at different points
    Might not get 2f+1 matching replies, but that's okay
  How would this break if REPLY didn't include view number?
    Could get 2f+1 matching replies from different views
    Problem if all replied result from tentative operations and none commit
    Saved by the fact that only one op can prepare for a single view/seq#

What about compromised clients?  (p. 2)
  Results of faulty client operations must be consistent
  Can't violate server-side invariants
  So can rely on authentication and authorization
    Just like faulty clients with non-replicated service

How do authors evaluate system:
  Microbenchmark:  null op with empty or 4K request/response size
  End-to-end file system benchmark vs. NFS and non-replicated BFS
  Are these the right things to measure?
    End-to-end benchmarks are good, microbenchmark helps understand them
    Could ask for more benchmarks, or more services than BFS
    But compared to prior work, already these results very convincing
  What's the deal with r/o lookup?
    On NFS lookup RPC, BFS updates the directory utime, which is expensive
    Is that right?  No!  Only need "x" permission on directory for lookup
      Need to read/getdents ("r") a directory or file to change utime
        Verified on linux and BSD, lookup does not affect directory utime
    So the authors violated semantics and needlessly penalized themselves!

How might we extend PBFT to tolerate more than (n-1)/3 failures over lifetime?
  - detect failed replicas using proactive recovery
    - recover the system periodically, no matter what
    - makes bad nodes good again
  - tricky stuff
    - an attacker might steal compromised replica's keys