How to Build a Highly Available System Using Consensus ====================================================== The short term plan: We've learned about distributed programming models, including RPC Now we are talking about consensus We are interested in consensus for two reasons: 1. Consistency in a distributed system This is what we talked about last lecture E.g., do consistent bank transfer between two databases 2. Fault-tolerance through replication I.e., have N machines run replicas of your system and agree on state If you lose a replica, no big deal as long as other copies survive Today's lecture: Consistency in an Asynchronous replicated system Next lecture: Consistency in a Synchronous replicated system Following two lectures: Consistency in the presence of malicious attackers One way to replicate a server is to build it as a deterministic state machine: - System starts in some initial state - Clients send requests to the server - Server deterministically computes replies and updates its state - If N servers reach consensus on list of requests, will have identical state Is it hard to implement a server as a state machine? How would you do this? No. A deterministic RPC server is already a state machine. Requests = procedure number and marshaled arguments Reply = marshaled results State = memory of your program and files it reads/writes (This is why RPC fits into the plan...) One catch: You must process RPCs deterministically Is NFS deterministic (if you've seen NFS protocol)? No. When you write file, server sticks modification time in inode Multiple replicas would probably use slightly different times How might you fix this? Include time in write request Or have replicas reach consensus on time value for request as well Let's construct an example of a useful replicated system: Credit card payment server implemented as RPC-driven state machine: * CHARGE (transaction-id, CC #, merchant-to-pay) -> {approved, denied} - Merchant issues RPC when customer pays for item - If system says approved, customer can leave with merchandise Obviously don't want to forget about transactions if disk fails! So replicating in case of server/disk failure is a good idea Note also that we care about order of transactions If CC# gets two large charges, only first executed one might succeed Our server will be *asynchronous*. What's an asynchronous system? Makes no assumptions about relative speeds of machines Makes no assumptions about delay in message delivery Basically means you can't use timeouts E.g., can't assume if machine doesn't respond in 2 minutes it must be dead [Note: Do not confuse with async. event-driven programming vs. threads] Famous Fischer-Lynch-Patterson (FLP) result for asynchronous systems: For a non-trivial consensus problem (agreed upon value depends on input) No protocol is guaranteed to reach consensus if even one machine fails We were hoping to achieve fault-tolerance through consensus among replicas ... is that the end of lecture? Go home? No: 1. Can always make your system not asynchronous after a failure: - Send in the human operator - Operator sees black smoke coming from disk, unplugs fried machine - Tells system that machine really failed and isn't just slow 2. Not *guaranteed* to reach consensus doesn't mean *won't* reach consensus - A system might sometimes reaches consensus, sometimes not terminate - Real systems are asynchronous but longer delays less likely - Let's try for a protocol very likely to reach consensus in practice First attempt: Simple protocol like two-phase commit Designate one server the primary, and the others backups Client: - sends CHARGE RPC to primary Primary: - assigns the RPC a consecutive sequence number - broadcast PREPARE message containing: { sequence number, CHARGE args } Backups: - if sequence number not greater than all others seen, reply ABORT-VOTE - otherwise, log PREPARE message to disk and send COMMIT-VOTE Primary: - Transaction commits if all backups vote COMMIT - Log transaction to disk either way - Broadcast transaction result to backups - If transaction ABORTed, conservatively send back denied to client - If transaction COMMITted, execute and send result to client Backups: - Log outcome and reply to primary - If don't hear from primary, ask it Primary: - Once all backups ack outcome, log END record in background What happens if one of the backups fails? System grinds to a halt But N-1 replicas of the data still exist So human operator can just configure system to have only N-1 replicas Now can continue operation with no lost data What happens if primary fails? Customer either walked out with item or not, want charge to match If any replica didn't vote COMMIT, customer doesn't have item, so ABORT E.g., replica might never even have received the PREPARE RPC If all replicas voted COMMIT, then customer may have gotten item or may still be waiting for approval in store So put charge through; client can re-transmit CHARGE RPC if necessary Once you determine transaction status, make another node the new primary What happens if primary and another replica both fail? If all replicas voted COMMIT, we are in trouble Customer may have left with item, may have left w/o item, may be waiting So 2PC was probably the wrong protocol to use Second attempt: Use three-phase commit Problem was "prepared" state where replicas can go either way So now after PREPARE, if everyone votes COMMIT, broadcast PREPARE-COMMIT Once all participants ACK PREPARE-COMMIT, broadcast COMMIT After failure, COMMIT if and only if any survivor saw PREPARE-COMMIT Why is this better? 2PC: execute transaction when everyone is willing to COMMIT it 3PC: execute transaction when everyone *knows* it will COMMIT Customer only gets item when everyone knows that charge committed But still two problems: 3PC is expensive to do for each operation Need human operator to set primary and #replicas each time Can we fix 3PC to recover from failure automatically? No. 3PC is guaranteed to reach consensus if no machine fails FLP tells us can't always reach consensus if a machine fails But in an asynchronous system, failed machine looks like slow machine ... and consequently slow machine looks like failed machine E.g., might time out and initiate recovery for machine that didn't fail Strangely, we actually need a protocol that's not guaranteed to terminate even without failure! This is what Paxos gives us How does Paxos protocol work? No primary (since primary might fail). Anyone can initiate protocol. Give each machine a unique numeric machine ID Each replica is an agent, maintaining the following state: mid: This machine's unique ID round: This is a number in the form (num, machine-ID) num is most significant bits, machine-ID is least significant bits value: This is a candidate value to agree on, or NULL To initiate protocol, an agent becomes a leader and: Sets its round value to (n, mid), where n = (1 + max (num in all rounds seen)) Broadcasts PREPARE (round) When an agent receives PREPARE (r): If r < round, reply with REJECT (round) [where round is receiver's value] Else, set round := r, reply with ACCEPT (round, value) If leader gets ACCEPT (r, v) from a majority of agents: If round != r, ignore message; leader has become underling to other leader Else, sets value to v from accept with highest r If value is NULL, leader chooses any value it wants nodes to agree on Broadcast DECIDE (round, value) When agent receives DECIDE (r, v) If r < round, reply with REJECT (round) Else, set round := r [if not already], and set value := v Reply with ACCEPT (round, value) Once majority of nodes has same value, group has reached consensus But Paxos is expensive. How to use it to implement a distributed system? Optimize for common case: - No new nodes have failed (since last time system repaired itself) - Network requests take less than some time-out threshold Idea: As long as these properties hold, do something efficient Along the lines of 2PC protocol above Except arrange system so backups never need to vote ABORT When a timeout occurs, then invoke Paxos-like protocol Reach consensus on a new primary and set of non-faulty backups Make sure everyone knows about all operations Re-execute any operations that didn't make it through (since no ABORT) Now the big picture: Make any system more reliable and available reliable = won't permanently your data available = will be functioning when you need it If it runs on one machine, only as available as hardware If machine crashes, needs new power supply, etc., will have downtime Create another *implementation* that is replicated on multiple machines E.g., Have 3 replicas; system available even if one is down What about 2 replicas? Unfortunately, can't reach consensus w/o a majority. 1/2 not a majority One trick: Have 1 primary, 1 backup, 1 witness In normal case, witness does nothing Want to avoid primary thinking backup failed & vice versa Otherwise will create two divergent copies of the system E.g., approve charges that total above credit limit If network partitions, use witness to decide who continues running Witness can also log operations while primary or backup down E.g., If primary dies, then backup dies, then primary comes back primary can get missing operations from witness So replicated system will be more reliable and available But Paxos is not even guaranteed to terminate ... how can replicated system then correctly implement the same system?