Administrivia: Note sign-in code--either QR code or 5-character code for link on web site Sign up for project meeting as early as tomorrow (see edstem) Still need a partner? Join team-finding OH tomorrow 6pm (see edstem) Now we are talking about distributed systems. Why do we care about them? - Scale: systems that wouldn't fit on a single machine - Reliability: systems that survive failure of a machine, network, etc. - Federated control: no single entity can serve all DNS names - Network latency: may need to exist near clients around the world Why does it matter? More failure cases, harder to order things A New Presumed Commit Optimization for Two Phase Commit ======================================================= Note on serializability vs. linearizability Serializability: as if transactions executed one at a time Strict serializability: ... and non-overlapping txns in temporal order Linearizability = s.s. for transactions limited to one op. on one object Purely local property defined on history of single objects Unlike serializability, admits non-blocking implementations Linearizable response always exists regardless of pending ops Say you want to move $100 from account A to account B at a bank Such a transaction could fail depending on state of ledger E.g., Account A has insufficient funds, or either account deleted Want serializability - otherwise could violate invariants Concurrent transactions on A and B must not interfere Read B, write B+100 must be equivalent to atomic B += 100 Everyone must agree on exactly what transactions preceded this transfer Otherwise, say A only has $100 but attempts two payments to B and C If neither transaction sees the other, both will succeed Or say A and B each move $1M (they don't have) to the other Transfers could succeed if each transaction thinks other happened first Want recoverability - transaction is atomic (all or nothing) Debit A iff credit B, even with power failures, crashes, etc. Say bank ledger is in an RDBMS--how do we achieve these properties? Serializability - take out locks on both accounts before moving money Recoverability - generally use a write-ahead log Write atomic, idempotent description to log before changing B-trees Post-crash, log replay has same effect regardless of B-tree state Alternatives: keep undo log (SQLite), DO-UNDO-REDO logging, etc. Note: locks can fail or be revoked, in which case RDBMS aborts transaction Aborted transaction has no effect on database contents Must be reported to client, which tries again or gives up Say ledger is large, must be sharded, with A and B in separate RDBMSes Naive approach, two transactions on two RDBMSes--what goes wrong? Serializability - say concurrent payments T1:A->B($10) and T2:C->A($100) A in DB1, B+C in DB2: DBs could see two payments in different orders T2: lock C T1: lock B, lock A, commit, release locks T2: lock A, commit, release locks DB1 orders T1 before T2 (because of lock on A) But DB2 can order T2 before concurrent T1 (DB2 still serializable) Bad when operations do not commute E.g., A assessed overdraft fee even though C debited before B credited Recoverability - What if T1 commits on DB2 but aborts on DB1? Solution? Use 2PC to ensure *all* DBs either commit or abort Acquire a bunch of locks, finish your transaction, decide to commit Phase 1: Preparing Coordinator broadcasts PREPARE to all cohorts (DBs in transaction) Up to this point, each cohort can abort (e.g., if it revoked lock) If a cohort aborted, responds with ABORT-VOTE Otherwise, respond COMMIT-VOTE: now can't abort or touch locks! Phase 2: Committing/aborting If any cohort voted ABORT-VOTE, coordinator broadcasts ABORT message Otherwise, if all voted COMMIT-VOTE, then coordinator broadcasts COMMIT If COMMIT, cohorts *must* commit transaction Only now can cohorts release any locks associated with transaction Cohorts reply with ACK so coordinator knows they received outcome Does this solve recoverability? Yes if everything logged Does it solve serializability? Possibly DB2 must commit T1 before learning whether or not T2 committed But T2 could still have a lower timestamp than T1 on DB2 Okay if you assign timestamp in Phase2 (e.g., on receipt of PREPARE) At that point, all associated locks held on all cohorts What must the coordinator store in the in-memory protocol database? (Fig. 1) Tid: transaction ID Stable: Yes if existence of transaction has been persistently logged State: Initiated|Preparing|Aborted|Committed Per-cohort: Cohort-id, Vote (None|Abort|Read-only|Commit), Ack Can delete per-cohort state on receiving ACK Can delete Tid entirely when all ACKs received When must a cohort force-write something to disk? Before sending COMMIT-VOTE? Always COMMIT-VOTE is a promise not to abort transaction or release locks If cohort reboots with no record of promise, can't possibly keep it E.g., before hearing from coordinator of transaction it forgot about, might agree to commit conflicting transaction with other coordinator Before sending ACK? Depends on variant Naively yes, because it should permanently commit/abort transaction But if it crashes, will know transaction existed from COMMIT-VOTE record If coordinator remembers what happened, cohort can just ask it Whether an ACK message is even required (and associated write) depends on - What information coordinator retains - Whether transaction committed or aborted How much do we care about these forced writes? COMMIT-VOTE is on the critical path for transaction latency ACK is not--coordinator already knows transaction committed So disk latency won't affect transaction latency Maybe delay, piggyback on another log write for better throughput Possibly unfair of paper to lump these together in single "n" value When must coordinator force-write in paper's PrN variant? Before sending PREPARE? No (though yes if pedantically presuming nothing) Until it sends COMMIT, coordinator can unilaterally ABORT a transaction Cohort inquires about unknown transaction? coordinator "presumes" abort Before sending COMMIT? Yes At this point transaction happened, so need to record it durably Otherwise, couldn't properly respond to cohort inquiries after crash Before sending ABORT? Yes This is the part about presume nothing Upon receiving ACK? No (non-forced write) Fill in a table during lecture PrN PrA PrC NPrC log before PREPARE N* N Y N log before COMMIT Y Y Y Y log before ABORT Y* N N N ACK after COMMIT Y Y N N ACK after ABORT Y N Y Y (Note: cohorts log before COMMIT-VOTE in all schemes) *depends how pedantic we want to be about presuming nothing How does presumed abort (PrA) differ from PrN? Don't write to disk before sending ABORT Coordinator never writes aborted transactions--no garbage to collect No matter when coordinator crashes, nothing on disk So cohorts inquiring about transaction will always get same answer Hence, cohorts don't even need to send ACK message to ABORT And hence obviously no force-write before (non-existent) ACK message But COMMITs (common case) are exactly same cost as PrN What are costs for traditional presumed commit (PrC)? Coordinator writes before PREPARE? Yes Otherwise, after crash would presume committed if cohort inquired Why the difference from PrA? Asymmetry: Without all votes, coordinator can unilaterally ABORT but not COMMIT Coordinator writes before COMMIT? Yes Have to log to "undo" effect of previously written PREPARE log record Otherwise, would see transaction after crash and abort it But cohort doesn't have to ACK COMMIT Coordinator writes before sending ABORT? No But cohorts must ACK ABORT And coordinator has one more non-forced cleanup write when all ACKs in What's a read-only transaction? E.g., want to know if A.balance + B.balance >= $2M, but won't change balances Read-only at both DB1 and DB2, but still must lock balances Otherwise could get confused if A concurrently paying B $1M Might be read-only at some cohorts, read-write at others What's the read-only optimization? Read-only cohort doesn't care whether transaction commits or aborts Only effect is to hold locks for duration of transaction Cohort replies to PREPARE with READ-ONLY-VOTE Also releases all locks when it sends READ-ONLY-VOTE any locked data unmodified since transaction is read-only at cohort Coordinator doesn't send COMMIT/ABORT message to read-only cohort If all cohorts reply READ-ONLY-VOTE, then whole transaction read-only If coordinator didn't log PREPARE message, doesn't need to log anything Cohorts don't care whether committed or aborted With PrC, must write (non-forced) record to delete logged PREPARE message What's the idea behind the NPrC optimization? Trade garbage collection after crash to reduce messages+writes Window of recent transaction ids REC=(tid_l,...,tid_h) presumed aborted All other transactions presumed committed Cohorts act exactly like normal presumed committed Must stably log the window parameters (tid_l,tid_h) tid_h can be implicit based on highest stable transaction or can be explicitly logged but amortized over many transactions must guarantee no tids >= tid_h ever used tid_l piggybacked onto other log writes points to oldest "undocumented" transaction i.e., oldest that is Initiated or Preparing or Aborted with missing ACKs Now no need to log before sending PREPARE when in window (common case) Can always log PREPARE of slow ("recalcitrant") transaction later Logging PREPARE allows tid_l to advance despite a few stragglers But also no need for cohorts to ACK COMMIT messages! Because tid_l suffices for coordinator to "remember" an arbitrary number of committed transactions cohorts might ask about Only common-case forced log write is before sending COMMIT What happens if NPrC coordinator crashes while committing tid? If tid >= tid_l, then disk will contain COMMIT log record If tid < tid_l, then will presume committed, so no need for record What happens if NPrC coordinator crashes while aborting tid? Won't advance tid_l > tid until all cohorts ACK the abort So either tid > tid_l and presumed aborted, or no cohorts will inquire (Or if tid_l > recalcitrant tid, will have explicit PREPARE log record) What is garbage collection issue? Never logged PREPARE? won't know what cohorts involved in transaction Can't collect ACKs for ABORT, have to be prepared for inquiries No time-bound on unknown cohorts inquiring about unknown transactions Solution is keep permanent record of presumed abort ranges after each crash What if you really don't want "forever garbage" in your system? Could alternatively keep track of all possibly active cohorts Inform all cohorts of crashes with CRASH messages But what if old cohort is permanently dead? Manually remove dead cohorts With simple transactions, tx done and all locks held when PREPARE sent Consider the following scenario: Database DB1 has a constraint SUM(colX) < 1000 Want to defer constraint checking to when cohort receives PREPARE Checking constraint requires additional read locks on colX What can go wrong with constraints if a cohort has READ-ONLY-VOTE? RO cohort DB2 never gets COMMIT/ABORT from coordinator So releases locks immediately on sending READ-ONLY-VOTE But DB1 might not yet have locked colX for contraint check Transactions no longer guaranteed serializable! E.g.: tx1 sets row1.colX -= 100 on DB1, changes Y from 1 to 2 on DB2 tx2 sets row2.colX += 200 (would fail before tx1), reads Y (RO) on DB2 Want: tx2 sees Y == 2 or ABORTs, does NOT see Y == 1 With naive RO: tx2 locking row2.colX and Y == 1, sends PREPARE DB2 receives tx2 PREPARE, responds READ-ONLY-VOTE, releases lock tx1 fully executes and COMMITs no need to check invariant because reducing colX DB1 receives tx2 PREPARE (delayed in network), locks colX invariant okay, sends COMMIT-VOTE even though saw Y == 1 One solution: timestamp transactions Order by timestamps to ensure same order on all cohorts COMMIT-VOTE and READ-ONLY-VOTE includes timestamps range ok for cohort E.g., on DB2, if tx2 sees Y == 1, tx2 timestamp must be before tx1 Coordinator picks time that is acceptable to all cohorts ABORT tx if no timestamp satisfies on cohorts' ranges Why is NPrC bad for transactions with user-visible timestamps? Coordinator chooses a timestamp, sends it to cohorts in COMMIT message A cohort inquires after missing COMMIT must be told the timestamp Since coordinator must force-write timestamp, might as well use PrA But timestamps may be just for RO optimization, not user-visible (see [5]) Is this compatible with NPrC optimization? Yes. Cohort's COMMIT-VOTE specifies range of permissible timestamps guarantees no overlapping ranges for conflicting transactions So if commit, cohort knows timestamp in valid range, which is good enough