Administrivia:
  - Feel free to turn off video if I forget to take you off screen
  - I'm trying screen sharing because in tests it looked better
    - If frame rate too low or A/V sync issues, just pin my camera
    - Please email me if this is worse than before
  - I'm working on downloadable lectures
  - Sign up for your group to meet with me (refresh class web page)

A New Presumed Commit Optimization for Two Phase Commit
=======================================================

Note on serializability vs. linearizability
  Serializability: as if transactions executed one at a time
  Strict serializability: ... and non-overlapping txns in temporal order
  Linearizability = s.s. for transactions limited to one op. on one object
    Purely local property defined on history of single objects
    Unlike serializability, admits non-blocking implementations
      Linearizable response always exists regardless of pending ops

Say you want to move $100 from account A to account B at a bank
  Such a transaction could fail depending on state of ledger
  E.g., Account A has insufficient funds, or either account deleted
Want serializability - otherwise could violate invariants
  Concurrent transactions on A and B must not interfere
    Read B, write B+100 must be equivalent to atomic B += 100
  Everyone must agree on exactly what transactions preceded this transfer
    Otherwise, say A only has $100 but attempts two payments to B and C
      If neither transaction sees the other, both will succeed
    Or say A and B each move $1M (they don't have) to the other
      Transfers could succeed if each transaction thinks other happened first
Want recoverability - transaction is atomic (all or nothing)
  Debit A iff credit B, even with power failures, crashes, etc.

Say bank ledger is in an RDBMS--how do we achieve these properties?
  Serializability - take out locks on both accounts before moving money
  Recoverability - generally use a write-ahead log
    Write atomic, idempotent description to log before changing B-trees
    Post-crash, log replay has same effect regardless of B-tree state
    Alternatives: keep undo log (SQLite), DO-UNDO-REDO logging, etc.
  Note: locks can fail or be revoked, in which case RDBMS aborts transaction
    Aborted transaction has no effect on database contents
    Must be reported to client, which tries again or gives up

Say ledger is large, must be sharded, with A and B in separate RDBMSes
  Naive approach, two transactions on two RDBMSes--what goes wrong?
  Serializability - say concurrent payments T1:A->B($10) and T2:C->A($100)
    A in DB1, B+C in DB2: DBs could see two payments in different orders
      T2: lock C
      T1: lock B, lock A, commit, release locks
      T2: lock A, commit, release locks
    Even if linearizable, DB2 can order T2 before concurrent T1
  Recoverability - What if T1 commits on DB2 but aborts on DB1?

Solution?  Use 2PC to ensure *all* DBs either commit or abort
  Acquire a bunch of locks, finish your transaction, decide to commit
  Phase 1: Preparing
    Coordinator broadcasts PREPARE to all cohorts (DBs in transaction)
    Up to this point, each cohort can abort (e.g., if it revoked lock)
    If a cohort aborted, responds with ABORT-VOTE
    Otherwise, respond COMMIT-VOTE: now can't abort or touch locks!
  Phase 2: Committing/aborting
    If any cohort voted ABORT-VOTE, coordinator broadcasts ABORT message
    Otherwise, if all voted COMMIT-VOTE, then coordinator broadcasts COMMIT
    If COMMIT, cohorts *must* commit transaction
    Only now can cohorts release any locks associated with transaction
    Cohorts reply with ACK so coordinator knows they received outcome
  Does this solve recoverability?  Yes if everything logged
  Does it solve serializability?  Possibly
    DB2 must commit T1 before learning whether or not T2 committed
    But T2 could still have a lower timestamp than T1 on DB2
    Okay if you assign timestamp in Phase2 (e.g., on receipt of PREPARE)
      At that point, all associated locks held on all cohorts

When must a cohort force-write something to disk?
  Before sending COMMIT-VOTE?  Always
    COMMIT-VOTE is a promise not to abort transaction or release locks
    If cohort reboots with no record of promise, can't possibly keep it
    E.g., before hearing from coordinator of transaction it forgot about,
          might agree to commit conflicting transaction with other coordinator
  Before sending ACK?  Depends on variant
    Naively yes, because it should permanently commit/abort transaction
    But if it crashes, will know transaction existed from COMMIT-VOTE record
    If coordinator remembers what happened, cohort can just ask it
    Whether an ACK message is even required (and associated write) depends on
      - What information coordinator retains
      - Whether transaction committed or aborted
  How much do we care about these forced writes?
    COMMIT-VOTE is on the critical path for transaction latency
    ACK is not--coordinator already knows transaction committed
      So disk latency won't affect transaction latency
      Maybe delay, piggyback on another log write for better throughput
      Possibly unfair of paper to lump these together in single "n" value

When must coordinator force-write in paper's PrN variant?
  Before sending PREPARE?  No (only if pedantically presuming nothing)
    Until it sends COMMIT, coordinator can unilaterally ABORT a transaction
    Cohort inquires about unknown transaction?  coordinator "presumes" abort
  Before sending COMMIT?  Yes
    At this point transaction happened, so need to record it durably
    Otherwise, couldn't properly respond to cohort inquiries after crash
  Before sending ABORT?  Yes
    This is the part about presume nothing
  Upon receiving ACK?  No (non-forced write)

Fill in a table during lecture
                            PrN    PrA   PrC   NPrC
       log before PREPARE   N*     N     Y     N
       log before COMMIT    Y      Y     Y     Y
       log before ABORT     N*     N     N     N
       ACK after COMMIT     Y      Y     N     N
       ACK after ABORT      Y      N     Y     Y
  (Note: cohorts log before COMMIT-VOTE in all schemes)
  *depends how pedantic we want to be about presuming nothing

How does presumed abort (PrA) differ from PrN?
  Don't write to disk before sending ABORT
    Coordinator never writes aborted transactions--no garbage to collect
      No matter when coordinator crashes, nothing on disk
      So cohorts inquiring about transaction will always get same answer
    Hence, cohorts don't even need to send ACK message to ABORT
      And hence obviously no force-write before (non-existent) ACK message
  But COMMITs (common case) are exactly same cost as PrN

What are costs for traditional presumed commit (PrC)?
  Coordinator writes before PREPARE?  Yes
    Otherwise, after crash would presume committed if cohort inquired
    Without all votes, coordinator can unilaterally ABORT but not COMMIT
  Coordinator writes before COMMIT?  Yes
    Have to log to "undo" effect of previously written PREPARE log record
    Otherwise, would see transaction after crash and abort it
    But cohort doesn't have to ACK COMMIT
  Coordinator writes before sending ABORT?  No
    But cohorts must ACK ABORT
    And coordinator has one more non-forced cleanup write when all ACKs in

What's the read-only optimization?
  Transaction might be read-only at one cohort
    Only effect is to hold locks for duration of transaction
    So cohort doesn't care whether transaction commits or aborts
  Cohort replies to PREPARE with READ-ONLY-VOTE
    Also releases all locks when it sends READ-ONLY-VOTE
      any locked data unmodified since transaction read-only at cohort
    Coordinator doesn't send COMMIT/ABORT message to read-only cohort
  If all cohorts reply READ-ONLY-VOTE, then whole transaction read-only
    If coordinator didn't log PREPARE message, doesn't need to log anything
      Cohorts don't care whether committed or aborted
    With PrC, must write (non-forced) record to delete logged PREPARE message

What's the idea behind the NPrC optimization?
  Trade garbage collection after crash for messages+writes
  Window of recent transaction ids (tid_l,...,tid_h) presumed aborted
    All other transactions presumed committed
  Must stably log the window parameters (tid_l,tid_h)
    tid_h can be implicit based on highest stable transaction
      or can be explicitly logged but amortized over many transactions
      must guarantee no tids >= tid_h ever used
    tid_l piggybacked onto other log writes
      points to oldest "undocumented" transaction
  Now no need to log before sending PREPARE when in window (common case)
    Can always log PREPARE of slow ("recalcitrant") transaction later
    Logging PREPARE allows tid_l to advance despite a few stragglers
  But also no need for cohorts to ACK COMMIT messages!
    Because tid_l suffices for coordinator to "remember" an arbitrary
    number of committed transactions cohorts might ask about
  Only common-case forced log write is before sending COMMIT

What happens if NPrC coordinator crashes while committing tid?
  If tid >= tid_l, then disk will contain COMMIT log record
  If tid < tid_l, then will presume committed, so no need for record
What happens if NPrC committed crashes while aborting tid?
  Won't advance tid_l > tid until all cohorts ACK the abort
  So either tid > tid_l and presumed aborted, or no cohorts will inquire

What is garbage collection issue?
  Never logged PREPARE? won't know what cohorts involved in transaction
    Can't collect ACKs for ABORT, have to be prepared for inquiries
    No time bound on unknown cohorts inquiring about unknown transactions
  Solution is keep permanent record of presumed abort ranges after each crash
What if you really don't want "forever garbage" in your system?
  Could alternatively keep track of all possibly active cohorts
  Inform all cohorts of crashes with CRASH messages
  But what if old cohort is permanently dead?  Manually remove dead cohorts

Why is NPrC bad for transactions with user-visible timestamps?
  Coordinator chooses a timestamp, sends it to cohorts in COMMIT message
    A cohort inquires after missing COMMIT must be told the timestamp
    Since coordinator must force-write timestamp, might as well use PrA
  What about internal timestamps (order all conflicting transactions)?
    Cohort's ABORT-VOTE specifies range of permissible timestamps
      guarantees no overlapping ranges for conflicting transactions
    So if it commits, cohort knows timestamp in range, which is good enough