How to Build a Highly Available System Using Consensus
======================================================

The short term plan:
  We've learned about distributed programming models, including RPC
  Now we are talking about consensus
  We are interested in consensus for two reasons:
    1. Consistency in a distributed system
       This is what we talked about last lecture
       E.g., do consistent bank transfer between two databases
    2. Fault-tolerance through replication
       I.e., have N machines run replicas of your system and agree on state
       If you lose a replica, no big deal as long as other copies survive
  Today's lecture:  Consistency in an Asynchronous replicated system
  Next lecture:  Consistency in a Synchronous replicated system
  Following two lectures:  Consistency in the presence of malicious attackers

One way to replicate a server is to build it as a deterministic state machine:
  - System starts in some initial state
  - Clients send requests to the server
  - Server deterministically computes replies and updates its state
  - If N servers reach consensus on list of requests, will have identical state
Is it hard to implement a server as a state machine?  How would you do this?
  No.  A deterministic RPC server is already a state machine.
    Requests = procedure number and marshaled arguments
       Reply = marshaled results
       State = memory of your program and files it reads/writes
  (This is why RPC fits into the plan...)
  One catch:  You must process RPCs deterministically
  Is NFS deterministic (if you've seen NFS protocol)?
    No.  When you write file, server sticks modification time in inode
    Multiple replicas would probably use slightly different times
  How might you fix this?
    Include time in write request
    Or have replicas reach consensus on time value for request as well

Let's construct an example of a useful replicated system:
  Credit card payment server implemented as RPC-driven state machine:
  * CHARGE (transaction-id, CC #, merchant-to-pay) -> {approved, denied} 
    - Merchant issues RPC when customer pays for item
    - If system says approved, customer can leave with merchandise
  Obviously don't want to forget about transactions if disk fails!
  So replicating in case of server/disk failure is a good idea
  Note also that we care about order of transactions
    If CC# gets two large charges, only first executed one might succeed

Our server will be *asynchronous*.  What's an asynchronous system?
  Makes no assumptions about relative speeds of machines
  Makes no assumptions about delay in message delivery
  Basically means you can't use timeouts
    E.g., can't assume if machine doesn't respond in 2 minutes it must be dead
  [Note:  Do not confuse with async. event-driven programming vs. threads]

Famous Fischer-Lynch-Patterson (FLP) result for asynchronous systems:
  For a non-trivial consensus problem (agreed upon value depends on input)
  No protocol is guaranteed to reach consensus if even one machine fails

We were hoping to achieve fault-tolerance through consensus among replicas
       ... is that the end of lecture?  Go home?  No:
  1. Can always make your system not asynchronous after a failure:
     - Send in the human operator
     - Operator sees black smoke coming from disk, unplugs fried machine
     - Tells system that machine really failed and isn't just slow
  2. Not *guaranteed* to reach consensus doesn't mean *won't* reach consensus
     - A system might sometimes reaches consensus, sometimes not terminate
     - Real systems are asynchronous but longer delays less likely
     - Let's try for a protocol very likely to reach consensus in practice

First attempt:  Simple protocol like two-phase commit
  Designate one server the primary, and the others backups
  Client:
    - sends CHARGE RPC to primary
  Primary:
    - assigns the RPC a consecutive sequence number
    - broadcast PREPARE message containing: { sequence number, CHARGE args }
  Backups:
    - if sequence number not greater than all others seen, reply ABORT-VOTE
    - otherwise, log PREPARE message to disk and send COMMIT-VOTE
  Primary:
    - Transaction commits if all backups vote COMMIT
    - Log transaction to disk either way
    - Broadcast transaction result to backups
    - If transaction ABORTed, conservatively send back denied to client
    - If transaction COMMITted, execute and send result to client
  Backups:
    - Log outcome and reply to primary
    - If don't hear from primary, ask it
  Primary:
    - Once all backups ack outcome, log END record in background

What happens if one of the backups fails?
  System grinds to a halt
  But N-1 replicas of the data still exist
  So human operator can just configure system to have only N-1 replicas
  Now can continue operation with no lost data

What happens if primary fails?
  Customer either walked out with item or not, want charge to match
  If any replica didn't vote COMMIT, customer doesn't have item, so ABORT
      E.g., replica might never even have received the PREPARE RPC
  If all replicas voted COMMIT, then customer may have gotten item
      or may still be waiting for approval in store
    So put charge through; client can re-transmit CHARGE RPC if necessary
  Once you determine transaction status, make another node the new primary

What happens if primary and another replica both fail?
  If all replicas voted COMMIT, we are in trouble
    Customer may have left with item, may have left w/o item, may be waiting
  So 2PC was probably the wrong protocol to use

Second attempt:  Use three-phase commit
  Problem was "prepared" state where replicas can go either way
  So now after PREPARE, if everyone votes COMMIT, broadcast PREPARE-COMMIT
  Once all participants ACK PREPARE-COMMIT, broadcast COMMIT
  After failure, COMMIT if and only if any survivor saw PREPARE-COMMIT
  Why is this better?
    2PC: execute transaction when everyone is willing to COMMIT it
    3PC: execute transaction when everyone *knows* it will COMMIT
      Customer only gets item when everyone knows that charge committed
But still two problems:
  3PC is expensive to do for each operation
  Need human operator to set primary and #replicas each time

Can we fix 3PC to recover from failure automatically?  No.
  3PC is guaranteed to reach consensus if no machine fails
  FLP tells us can't always reach consensus if a machine fails
    But in an asynchronous system, failed machine looks like slow machine
      ... and consequently slow machine looks like failed machine
    E.g., might time out and initiate recovery for machine that didn't fail
  Strangely, we actually need a protocol that's not guaranteed to terminate
      even without failure!  This is what Paxos gives us

How does Paxos protocol work?
  No primary (since primary might fail).  Anyone can initiate protocol.
  Give each machine a unique numeric machine ID
  Each replica is an agent, maintaining the following state:
      mid:  This machine's unique ID
    round:  This is a number in the form (num, machine-ID)
	    num is most significant bits, machine-ID is least significant bits
    value:  This is a candidate value to agree on, or NULL
  To initiate protocol, an agent becomes a leader and:
    Sets its round value to (n, mid), where
            n = (1 + max (num in all rounds seen))
    Broadcasts PREPARE (round)
  When an agent receives PREPARE (r):
    If r < round, reply with REJECT (round)  [where round is receiver's value]
    Else, set round := r, reply with ACCEPT (round, value)
  If leader gets ACCEPT (r, v) from a majority of agents:
    If round != r, ignore message; leader has become underling to other leader
    Else, sets value to v from accept with highest r
    If value is NULL, leader chooses any value it wants nodes to agree on
    Broadcast DECIDE (round, value)
  When agent receives DECIDE (r, v)
    If r < round, reply with REJECT (round)
    Else, set round := r [if not already], and set value := v
    Reply with ACCEPT (round, value)
  Once majority of nodes has same value, group has reached consensus

But Paxos is expensive.  How to use it to implement a distributed system?
  Optimize for common case:
    - No new nodes have failed (since last time system repaired itself)
    - Network requests take less than some time-out threshold
  Idea:  As long as these properties hold, do something efficient
    Along the lines of 2PC protocol above
    Except arrange system so backups never need to vote ABORT
  When a timeout occurs, then invoke Paxos-like protocol
    Reach consensus on a new primary and set of non-faulty backups
    Make sure everyone knows about all operations
    Re-execute any operations that didn't make it through (since no ABORT)

Now the big picture:  Make any system more reliable and available
       reliable = won't permanently your data
      available = will be functioning when you need it
  If it runs on one machine, only as available as hardware
    If machine crashes, needs new power supply, etc., will have downtime
  Create another *implementation* that is replicated on multiple machines
    E.g., Have 3 replicas; system available even if one is down
    What about 2 replicas?
      Unfortunately, can't reach consensus w/o a majority.  1/2 not a majority
      One trick:  Have 1 primary, 1 backup, 1 witness
        In normal case, witness does nothing
	Want to avoid primary thinking backup failed & vice versa
	  Otherwise will create two divergent copies of the system
	  E.g., approve charges that total above credit limit
        If network partitions, use witness to decide who continues running
	Witness can also log operations while primary or backup down
	  E.g., If primary dies, then backup dies, then primary comes back
                primary can get missing operations from witness
  So replicated system will be more reliable and available
  But Paxos is not even guaranteed to terminate
     ... how can replicated system then correctly implement the same system?