Bayou ===== Let's build a calendar system to help understand ordering and conflicts. Each entry has a time and a set of participants. We want everyone to end up seeing the same set of entries. Traditional approach: one calendar server. Ordering: only one copy, server picks the order. Conflict resolution: server checks for conflicts before accepting update. Returns error to user, who can hand in any way desired. Why aren't we satisfied with central server? I want my calendar on my Palm Pilot. I.e. database replicated in every node. No master copy. Periodic connectivity to net. Periodic infrared contact with other calendar users. Straw man 1: swap complete DBs. Similar to Palm Pilot sync. Might require lots of network b/w. What if there's a conflict? IE two meetings at same time. Palm just schedules them both! But we want automatic conflict resolution. We can't just view DB items as bits. Since then we can't resolve conflicts. We want intelligent DB items that know how to resolve conflicts. They are more like updates: read DB, think, make a change. But we have to make sure that all nodes resolve conflicts the same way. How? Insight: Maintain an ordered list of updates at each node. Make sure every node has the same updates. Make sure every node applies the updates in the same order. Make sure the updates are deterministic functions of DB contents. Then a "sync" is really a merge of two ordered lists: easy. Not a DB merge at all. Is is OK if lists are merged in any order? Suppose I schedule an event, then want to delete it? Or change attendee list? The create must precede the delete in the eventual order! And in every node's view of uncommitted part of log too. So if things can switch around what guarantees to we have? Read your writes - In same session, reads always reflect writes Change your password, don't want to get "invalid password error" Don't want deleted emails to reappear Monotonic reads - if R1->R2 in session, R2 reflects all writes seen by R1 Don't want meetings to come and go in calendar Mailbox - if you see a new message, request to read it should succeed Writes follow reads - Say R1 sees W (from other session) and makes W1 Everyone must see W1 as happening after W Bulletin board - want replies to appear after messages Monotonic writes - If you write W1 then W2, then everyone sees W2 after W1 In text editor, if you save, edit, save, want to see second save second! What are the write log entries? Why not just "10am meeting with dm"? Because we can't resolve conflicts. Instead, "1-hour meeting w. dm at 9, otherwise 10, otherwise 11." So a write is: (could do with just merge-procedure, but other two are optimization) Along with a unique ID: This is really instructions for doing a write, not the written data. So the write log contains instructions in the distributed calendar program. We want all nodes to run same instructions in same order. Eventually. Each server applies updates it gets to its state Order by Lamport clock total order How do writes actually propagate? Unidirectional, peer-to-peer synchronization By what mechanism? Any way to want Plug laptop into network IR link between to palm pilots Transfer floppy between two laptops on airplane (can't use wireless) Example: <701,A>: Node A asks for meeting M1 to occur at 10am, otherwise 11am. <770,B>: Node B asks for meeting M2 to occur at 10am, otherwise 11am. Let's agree to sort by write ID (e.g. <701,A>). As "writes" spread, nodes may initially apply updates in different order. Each newly seen write is merged into the log, May have to undo updates to insert older ones (e.g., B gets <701,A>) Then replay the log May cause user's view of calendar to change! But when everybody has seen all the writes, everybody will agree. Is this good enough? Lamport clocks preserve causality, but you never know if there's some write from the past you haven't seen. So all entries must be tentative forever And you have to keep the log around forever How can we allow a notion of committing a tentative entry? For an entry X to be committed, everyone must agree on: The total order of all previous committed entries. The fact that X is the next in the total order. The fact that all uncommitted entries are "after" X. (p.4) "Any mechanism that stabilizes the position of a write in the log can be used." How does Bayou actually agree on total order of committed writes? One designated "primary replica". It marks each write it receives with a permanent CSN. Commit Sequence Number. That write is committed. So a complete time stamp is CSN notifications propagates with updates using the anti-entropy algorithm The CSNs define a total order for committed writes All nodes will eventually agree on it Uncommitted writes come after all committed writes (infinite CSN) Total order must preserve order of writes originated at each node. But not necessarily order among different nodes' writes. How can system be sure it commits each node's writes in order? accept-order (p. 2) vs. causal-accept-order (p. 3) 1) Nodes actually use Lamport logical time-stamps for local TS. 2) Everybody sends updates in order. So primary sees updates in causal order, and commits them in that order. Does CSN always preserve parial "accept order" No: Sync, resolve conflict, have meeting, while primary changed order Now DB entries can be shown to user as committed. Which means everyone does (or will) agree on them. And a slow or disconnected node cannot prevent commits. Do we know know enough to show user "committed" flag with entries? Committed write isn't stable unless we've seen all prior committed writes. Bayou update propagation protocol maintains this. Propagates writes in order. When can you discard log entries (truncate log)? Can never discard uncommitted You might need to undo that far when learning of other updates Can truncate any prefix of committed operations But others might not know it has been committed, or even know the writes So better to keep a longer history if you have the space How do I propagate if I've discarded part of my log? Suppose I've discarded *all* writes with CSNs. I keep a copy of the stable DB reflecting just discarded entries. First, I know I cannot receive writes that conflict. Could only happen if write has a CSN < one discarded. But I already saw it, in the right order, so can ignore. When I propagate to node X: If node X's highest CSN is less than mine, I can send him my stable DB reflecting just committed writes. Node X can use my DB as starting point. And X can discard all CSN log entries. Then play his tentative writes into that DB. If node X's highest CSN is greater than mine, X can ignore my DB. What gets transfered in anti-entropy algorithm? Each server maintains a version vector V logical clock of last update in log of each user Creating/deleting servers Log truncation Each server maintains "omitted" vector O logical clock of most recent update from each server omitted from the log Each server keeps OSN, the sequence number of latest omitted request Evaluation Seems much more functional than Palm Pilot calendar. *Not* transparent to applications! Requires very strange programming practices. Every "write" is actually a bunch of code, not the new bits. Check conflicts (meeting already scheduled for any attendee?). Resolve conflicts (choose a different unused meeting time?). Not every application has appropriate semantics. Might actually work for a bank account! But conflicts can't always be automatically resolved. For example, changes to source code from multiple programmers.