The short-term plan:
  Last Thursday we talked about need for consensus in replication
  Went over why 2PC, 3PC alone don't work, and also covered basic Paxos
  Today's agenda:
    Use Paxos to build a replicated system for availability
    Extend Paxos to deal with untrustworthy replicas
  Today's discussion will likely continue into next Tuesday's lecture
  Note:  Later in term, will discuss replication for performance

Recall Paxos algorithm.  To propose a value:
  1. Broadcast PREPARE(n)  [where low-order bits of n are your machine ID]
     - Nodes ignore message if they have seen a higher value of n
     - Nodes reply with PREPARE-RESULT(n', v') or PREPARE-RESULT(0, nil)
         n' is highest numbered proposal seen, v' is value proposed
       PREPARE-RESULT is promise not to accept value w. lower numbered n
  2. If majority accepts PREPARE, Broadcast PROPOSE(n, v)
     - v is v' from PREPARE-RESULT(n', v') with highest v'
     - if n' is nil, then can propose any value
  3. If majority accepts PROPOSE, then done
     - broadcast DECIDE(n, v) to announce decision

Big picture goal:
  Structure your system as a state machine
  Re-implement your system as distributed system w. replicated cohorts
    Cohorts use Paxos to agree on order of all operations
  But Paxos is not guaranteed to terminate
    How can a replicated system hope to the same system?
Section 4.1 of last week's Lampson paper, defines "Y *implements* X" as:
    1. X's safety property implies Y's safety property, and
    2. Y's liveness property implies X's liveness property
  Let's examine what this means in more detail...

[This portion of notes from Alpern & Schneider]
One way to view a system is as sequence of externally-visible states
  An *execution* is an infinite sequence of states
    \sigma = s_0, s_1, s_2, ...
    where a single atomic action moves from s_i to s_{i+1}
    if a program terminates, last state just repeats infinitely
  A *property* is a set of executions.  Examples:
    1. Value of variable clock never decreases
    2. Once variable done is true, variable result doesn't change
  A *program* also defines a set of executions
  A property *holds* for a program if its executions are a subset of property
    E.g., both example properties hold for program "for (;;);"
    Property 2 doesn't hold for program "done = true; result++;"

What is a safety property?
  Informally, means some "bad thing" doesn't happen during execution
  What are examples of safety properties?
    - mutual exclusion
    - freedom from deadlock
    - partial correctness - what's this?
        Means if program produces output, output is correct function of input
          But program might not terminate on some inputs
        Given undecidability, this is often the best one can hope to verify
    - first-come-first-served
  Formally, P is a safety property iff:
    If execution s_0, s_1, ... is not in P, then there is some i >= 0
    such that P holds for no execution beginning s_0, ..., s_i
  In other words, something bad happened by state s_i

What is a liveness property?
  Informally, means some "good thing" happens during execution
  What are examples of liveness properties?
    - starvation freedom (make progress infinitely often)
    - termination
    - responsiveness (every request eventually satisfied)
    - leads to (event of type e_1 must eventually be followed by one of e_2)
        E.g., received RPC message followed by sent reply
  Formally, P is a liveness property iff:
    For any partial execution s_0, ..., s_i, there is some
    extension s_0, ..., s_i, s_{i+1}, s_{i+2}, ... for which P holds
  I.e., any partial execution can be "fixed" by making good thing happen later

Note:  Any property is the intersection of a safety and liveness property
  Argument for property P:
    - Note intersection of any two safety properties is a safety property
        You are just unioning all the "bad" prefixes
    - Let E be set of all possible executions (which is a safety property)
    - Let P' be smallest safety property containing P
    - Let L = E - (P' - P)  [i.e., anything in P or not in P']
    - Notice P = L \cap P'
    - Now suppose L is not a liveness property:
        Must be some prefix X = s_1, ..., s_i such that X|anything not in L
        Let O be set of executions starting X
          E - O is a safety property
          Also, since O and L disjoint, O \subseteq E-L = (P' - P)
        But O \subseteq (P' - P) implies P \subseteq (P' \cap (E-O))
        (P' \cap (E-O)) is a safety property, contradicting P' smallest
  Example:  "total correctness"
    safety property = partial correctness
    liveness property = termination
  Example:  "until"
      means eventually E_2 will happen, and all preceding events are E_1
    safety property = No trace with something other than E_1 before first E_2
    liveness property = E_2 evantually happens

Now look at Lampson's definition of "Y *implements* X" iff:
    1. X's safety property implies Y's safety property, and
    2. Y's liveness property implies X's liveness property
  What does it mean for one property to imply another?
    Recall properties are just sets of executions
    So P_1 implies P_2 just means P_1 contains P_2
  Can you explain definition of implements?
    Any program can produce a set of executions, so a program is a property
    That property can be written as intersection of Safety \cap Liveness
    X's safety property = complement of executions with "disallowed prefixes"
    X's liveness property = all executions in X or not in X's safety property
    1. means Y doens't execute any sequence containing a "bad prefix" for X
       equivalently, any prefix of an execution of Y, could also be of X
    2. means Y's executions contain the "good things" X always does

Let's look at an example:

program_A () {                  program_B () {
  x = input ();                   x = input ();
  while (random () & 1)           if (random () & 1)
    output ("thinking");            for (;;)
  output (x*x)                        output ("thinking");
}                                 output (x*x)
                                }

What are program A's safety and liveness properties?  (Just like Until above)
  Safety:  Only outputs "thinking" before outputting x^2
  Liveness:  Eventually outputs x^2
What about program B?
  Safety:  Outputs either x^2 or series of "thinking"
  Liveness:  Will generate some output
Does program A implement program B?
  No, because sequence "thinking", x^2 doesn't satisfy B's safety property
Does program B implement program A?  Yes, because two things hold:
  1. B's safety property is a subset of A's
     looking at B's output, can't know for sure that you aren't executing A
  2. A's liveness property is s superset of B's
Why requirement 2?  Why not just say Y implements X if X's safety implies Y's?
  Then trivial program "for (;;);" would implement everything
  2 Says Y must be capable of doing what X does, even if it not every time


Now back to replication... Want replicated system to implement unreplicated
What is safety of unreplicated state machine?
  Goal is *linearizability*.  What's this?  Means
    1. State is always result of executing list of requests from clients, and
    2. List preserves temporal order of non-concurrent operations
  Linearizability is a handy property for distributed systems components
    - Totally local property
      Can implement/check w/o using 2PC or other protocols across components
    - Can also implement mutexes that work across components
        If everyone acquires mutex, then accesses object, then frees mutex,
	then as long as mutex and object linearizable, you are okay
What liveness property do we want for replicated state machine?
  If, after series of requests, unreplicated system can transition to state s
    then replicated system should also be capable of transitioning to s
  What we really want is:  Replicated system always transitions to s
  What we can hope for, given FLP, is:
    Unless you are really *unlucky*, replicates system will transition to s
    Here "luck" captures something outside of the asynchronous model
    What unlucky mean in BFT paper?
      Message delay doesn't grow exponentially with time, E.g.,
      - dead machine never responds (lucky)
      - slow network suddenly adds 2 minutes to each packet (lucky)
      - each packet takes twice as long as previous (unlucky)


Paxos Made Practical
====================

So now let's replicate a state machine assuming fail-stop cohorts
  Fail-stop means failed cohort stops sending any network messages
  As opposed to Byzantine, where can behave arbitrarily

First, establish consensus on a view, which includes:
    view-id, one primary cohort, set of backup cohorts

Assuming we have a view, execute an operations:
  0. Client sends operation (including client's timestamp) to primary
  1. Primary assigns a *viewstamp* to operation:
        ts = number of operation within this view (1, 2, 3, ...)
        viewstamp = <view-id, ts>
     Logs operation, broadcasts it to backups
  2. Backups log operations (in oder), and confirm to primary
  3. Primary waits for *submajority* of backups to confirm
       submajority = min x such that (x+1) > n/2
       Primary + submajority of backups = majority of cohorts
     Executes operation, sends result to client

Important points:
  - All cohorts agree on order of operations by comparing viewstamps
  - Client doesn't see result until at least majority of cohorts log operation

When a cohort (possibly primary) fails, form new view.  Must ensure:
  1. Don't "forget" any past operations for which you replied to client
  2. Make sure two competing, divergent views cannot form
  Solve both problems using Paxos:
    Instance of Paxos is current view-id
    Paxos proposal number is proposed new view-id
    Value you are agreeing on is:
      Configuration of new view (with cohorts are primary, which backups)
      Last operation executed in previous active view
  How to solve problems 1 and 2 above:
    1. Make sure new view has old primary or a majority of old cohorts
    2. Guaranteed by Paxos if set of possible cohorts fixed over time
       Slight trickiness when adding new cohorts--see paper for details


Practical Byzantine Fault Tolerance
===================================

Now what happens if a cohort is bad?
  Can become primary, return arbitrary results to clients

Straw-man, 2f+1 cohorts, require f+1 to reply to client:
  0. Client sends signed request to primary (so bad primary can't forge)
  1. Primary assigns viewstamp to request, and broadcasts to backups
  2. Backups execute operation, sign { viewstamp, result }
     send result to primary and to client
  3. Client accepts results after f+1 matching replies
The good news about this protocol:
  If at most f cohorts fail, client receives >=1 reply from honest cohort
What is the bad news?  Fails to achieve safety property (linearizability)
  Suppose cohorts (replicas) r_1 .. r_f malicious, r_{f+1} .. r_{2f+1} honest
  Let r_1 be the malicious primary
  client u_1 issues operation o_1:
    r_1 (primary) sends it to r_2 .. r_{f+1} only
    u_1 gets f+1 replies, assumes operation is complete
  client u_2 issues operation o_2:
    Now r_1 sends it to r_2 .. r_f and r_{f+2} only
    r_{f+2} returns as if o_1 never executed (it doesn't know about o_1)
    r_1 .. r_f, because malicious, also return as if o_1 never executed
    u_2 sees bad result--not linearizable
    Why is this bad?
      E.g., o_1 was acquiring a mutex
         or I commit change call you on phone say get new state of object
  client u_2 issues operation o_3:
    Now r_1 (primary) sends operations (o_2, o_1, o_3) to r_{f+3}
    r_{f+3} replies as if o_2 executed before o_1
    other bad r_1 .. r_f send same reply as r_{f+3}
    basically malicious cohorts are completely violating safety

What is the minimum replicas that we will need to recover from f failures?
      3f+1.  Why?
  f cohorts may fail and stop responding to network messages
  So must make progress after 2f+1 replies
  But if get 2f+1 replies, might just have f slow cohorts
  Which means f of remaining 2f+1 could be malicious
  But w. 3f+1 a majority (f+1) of 2f+1 replies guaranteed from honest cohorts