Replication and Consistency
===========================

Big picture
  We want to replicate data for reliability.
  How do we manage replicas consistently?
  Can we replicate transparently to applications?
  Can we replicate and maintain high performance?

Example application: storing mail boxes.
  Each mailbox is a list of numbered messages.
  Operations by clients:
    M = Read(Num)
    Num = Append(M)
    Delete(Num)

Can anything go wrong in a simple single-server implementation?
  Consistency/Serializability:
    What if two clients call Append() at the same time?
    Implement with locks.
  Atomicity:
    What if the system crashes in the middle of an Append() or Delete()?
    Implement with transactions and logs.
  Durability:
    What if the system crashes just after Append() returns?
    Implement by flushing data and logs carefully.

Why do programmers like this model?
  Hides existence of concurrency.
  Hides failure recovery details.
  General tool, same for many applications.

Why isn't single-server DB good enough?
  Single point of failure.
  Want to replicate data on multiple machines.
    May help read (or even write) performance as well as availability.

Straw man replicated database:
  Imagine 2 servers, every mailbox replicated on both.
  Client sends read operations to either server.
  Client sends update operations to both.
    Waits for one to reply.

What can go wrong?
  1. Suppose a client does Num = Append(M); Delete(Num).
     Updates may occur in different order on two servers.
     Can maybe fix for single client by waiting for all ACKs.
     So what about two clients and concurrent Append()s?
       Servers end up with differing copies.
  2. Network partitions.
     Different updates proceed in both halves.
     Again, "replicas" now no longer replicate each other.

How do we know these behaviors are wrong?
  Could not have happened in a single-copy system.
  Our task is to emulate single-copy with replicas.
  So that simple applications work correctly.

Can fix partition problem with voting:
  Use 2n+1 replicas if you want to survive n failures.
  Only allow operations in a partition with >= n+1 members.
    There can be at most one such partition.

Can fix order problem with primary copy:
  One primary, clients send operations to it.
  Primary imposes order on concurrent operations.
  Primary tells slaves about operations, with number.
  Slaves perform operation, then send ACK to primary.

Second straw man:
  Is this enough?
  Can primary just wait for ACKs, then respond to client?

No!
  What if it turns out there are fewer than n slaves?
    Then primary should abort operation and send error to client.
    But some of the slaves have performed it.
  So slaves need to defer actual update until primary confirms.

Fix requires 2-phase commit protocol, as we've already seen:
  1: Primary sends updates to slaves.
     Slaves append update information to a log, but don't yet perform.
     Slaves ACK first phase to primary.
     Primary waits for n replies (so n+1 replicas).
     Primary replies "YES" to client.
  2: In background, primary tells slaves that commit happened.
     Slaves update real DB from log entry.

What if the primary fails before sending client the ACK?
  I.e. while sending phase 1 msgs or collecting slave ACKs.
  If some slave got the phase 1 message, it can re-start.
  But it's OK to just abort the operation -- client hasn't seen an ACK.

What if the primary fails after ACKing client?
  But before sending phase 2 commit messages to all slaves.
  Operation *must* complete because client has seen ACK.
  New primary can ask remaining replicas.
    If n+1 saw (and acked) the phase 1 message,
    new primary can safely commit.

What if slave fails?
  Doesn't matter -- can just keep going as long as > 1/2 survive.

What about concurrent operations?
  Primary numbers them and allows them to proceed.
  Primary and slaves keep a log of all operations.
  Slave only ACKs a phase 1 msg if it has seen all prior phase 1 msgs.
  Primary only sends out a commit if it has committed all previous.
    Otherwise reboot could lose a later op but commit an earlier one.
  Primary log looks like:
    Old committed operations
    <-- Commit Point (CP)
    Uncommitted operations

Reads:
  Can't just send to any replica.
    Must make sure it's in the majority partition.
    Otherwise may miss a committed write in the other half.
  Read must reflect all committed updates.
    So clients have to send reads to the primary.
  A read does not wait for prior uncommitted writes to commit.
    Since those calls have not yet returned to the application.

Slave log:
  If a replica ACKs phase 1:
    It can't yet write the DB.
    But it has promised to do so when asked.
      Since primary may already have ACKed client.
    Slave should not ACK, reboot, change its mind.
    So it must have a stable log.
  Same order as primary log.

When is real on-disk DB written (from log)?
  Primary sends current CP along with every message.
    Slaves remember largest CP seen.
  Real DB entries written only when commit point (CP) passes log entry.
  Updates in background. In log order.

As-yet unresolved issues:
  Transactions that involve multiple DB operations.
    How to preserve serializeability?
  What happens if a slave fails?
    Need to make it up to date.
  What happens if the primary fails?
    How to agree on the choice of a new primary?
    How to reconstruct primary's state?

What kind of performance are we likely to get?
  Every operation involves all the servers.
  So it's likely to be *slower* than a single server.


Porcupine
=========

- Problem: email service for many users.  many users receive lots of
  email per second (billion message a day, ~10,000 per second). More
  than one machine can deal with.

- Email operations:  
   - deliver msg 
     requires updating the user's index of messages and storing the message
   - retrieve message
     lookup message and return
     optionally remove message and update the user's index

- Idea: use a cluster of machines to construct a high-performance
  email server.
   - PCs are inexpensive

- Alternative 1:  simple partition of users over machines.
   I.e. each user's mail stored on a particular machine.
   problems: 
   - manual configuration (doesn't scale)
   - one machine fails, many users no email (low availability)
   - load imbalance (low performance)
   - rebalance is hard, requires manual intervention
   - adding new servers doesn't automatically improve performance

Why is load balance so important?
  Some users may be stuck on overloaded machines.
  Some machines overloaded while others are idle.
    Wasted resources.
    Means we could have handled a higher load.

- Alternative 1 + plus replication for availabilty and performance
   - replicate each user's mailbox on a few machines (availability)
   - don't need to choose in advance (management)
   - read operations can happen at any replica (good performance)
   - write operations must touch each replica (bad performance)
   - must serialize to maintain replication (bad performance)
     e.g. using primary-copy scheme

- Problem: email service with serializability is limited in the
  performance (e.g., of the primary).  Canot do billion messages
     
   - Delivering and retrieving message may require modifications to the
     user's index and therefore must be executed by the primary to get
     total ordering and therefore consistency (i.e., serializability).

- Does an email service have to mimic single-copy semantics for mailboxes?
  No:
    1. It's OK to re-order messages: updates commute.
    2. It's OK to deliver a message twice.
    3. It's OK to un-delete a message.
    4. It's OK to present only a subset of messages during a partial failure.
    5. It's OK to change the order from one session to another.
    6. Updates are individually atomic, so no locking.
  I.e. "mailbox" does not need to be a coherent idea.
    Enough to have just a set of messages.
    Real operations are add/delete message.
    Only real consistency rule is "don't lose a message".
    All else is optional.

Plan 3:
  When a new message arrives, copy it to 2 random servers.
    For availability.
  When a user reads mail,
    Ask all reachable servers for the messages they store.
  What's good about Plan 3?
    Highly available.
    Probably load-balanced.
    Automatically starts using new servers.
    No management required.
  What's bad about Plan 3?
    It's expensive to contact every server for every user.
    Doesn't even attempt to preserve message order.

How does Porcupine fix the problems with Plan 3?
  1. Maintain affinity between each user and a few preferred servers.
     "Fragment list"
     So reading mail only need to contact those servers.
     Need to remember correct per-user server set despite failures.
     But can be soft state!
       If all else fails, can ask each server who it holds mail for.
  2. But mail delivery needs to find fragment list.
     They are distributed across servers, using "user map".
     Every node has complete copy of (identical) user map.
     Which is again soft state.
  3. What about preserving order?
     Incoming mail stamped with current wall-clock time.
     Clocks of different servers roughly synchronized.
     Can be sorted when user reads mail.
  4. What about preserving consistency?
     "Total updates" -- Eliminates the need for locks
     Means the last update has all the state you need
     If you see two object updates, just take the latest one by time
     E.g., Clocks synced to +/- 10ms, users delete mailbox much less frequently


Porcupine overview
  - basic ideas:
    - reduce hard state 
      mail fragments and some user profile state

    - approach serializability using synchronized clock (can also
      order events that are not logically related---e.g., deletes of
      the same email message on different nodes because the first
      delete from the client perspective didn't complete).

    - updates are complete (no locking).

    - update operations (delivery and delete) can be performed any node
      - if receiving node fails, restart update on some other node
      - updates are ordered by timestamp
        (can order delete and delivery of a message)
      - log operations with timestamp
        apply update later
      - replicate log to some nodes
        - each node that has the log can recover (in parallel)
        - will eventually reach consistency
          (clients can see all behavior as with SMTP)
      - garbage collect log when: 
         - log is replicated successfully in n places and update
           is performed locally
         - entry is a week old
           
    - retrieve email on any node 
      conceptually: scan all disks for fragments
         - detect replicas by observing the same UID
         - messages on unavailable disks are unavailable
        
  - Improve performance of retrieve operation:
     - keep soft state for:
       - user map. mapping a user name to a node
       - fragment list.  node in user map keeps list of nodes that
         have mail fragments
       - membership list.  record which nodes are up. quick recovery
       etc.
     - recovery: reconstruct soft state, if necessary
     - maintaining soft state reduces performance of update operations!

  - How wide should the mail fragments be replicated?
     - enough nodes to deal with reasonable failure scenarios
     - enough nodes to deal with load imbalance 
       load metrics: free space on disk, number of outstanding disk ops
     - not too wide, bad for performance


What happens when mail is delivered?
  Mail arrives on randomly chosen server.
  User map finds server holding user's fragment list.
  Fragment list tells us where to store/replicate.
  Can store somewhere else.
    If preferred servers are down or out of disk or overloaded.
    Just add new server to fragment list.

How does Porcupine make sure data is replicated properly?
  I.e. if can't (or slow) contacting some replica servers?
  Log updates, keep trying.

What happens when a user reads mail?
  Again, user contacts any server.
  Find fragment list.
  Gather fragments.
    Ordinarily only from a few preferred nodes.
  Sort by timestamp, eliminate duplicates.

What happens during recovery?
  Soft state has to be correct.
    Not just a hint, since not cheap to verify.
    If user maps don't agree, a user may have multiple (different) frag lists.
    If fragment list isn't correct, frag may exist but not be read.
  Must agree on new set of nodes.
  First, choose a server to coordinate the recovery.
  Coordinator decides which servers are live.
    And computes a new user map -- with minimum change from old one.
    Tells every node the user map.
  Then servers help each other reconstruct fragment lists.
    Timestamp entries in the user map to know wish are new ones
    Hard state stored in directories corresponding to user-map hash buckets
    Scan relevant directories, and forward state to new nodes in user-map table
    For replicated user profile, only live node with highest IP address sends

Are these techniques widely applicable?
  I.e. could you use them for e.g. a replicated NFS server?

How fast *should* such a system be?
  I.e. messages per second per server.
  Limited by disk seek?
    Because messages must be permantently stored.
    At least one seek per message delivered, retrieved, deleted.
      But sounds like DB might be written as well as log
    So three disk I/Os per message.
    Assume 15ms per disk seek. 66 I/Os per second.
    22 messages per second per server.
  What would we expect if every message replicated (i.e. two copies)?
    Delivery and deletion are two seeks on two servers
      But will slower of two seeks will dominate
    Retrieval still one seek -- just need one copy.
    Expect a bit more than half as fast, roughly.
    Though surely we could use logs for the replicas...

How would you build porcupine to reduce the number of seeks?
  LFS would eliminate seeks on write -- but not retrieval
  Some variation of LFS?
    Maybe cleaner could sort by user-id
    Or rewrite all mail messages when less than a segment size?
      but then an extra seek to read them

How fast is Porcupine?
  (does synthetic workload balance delivery and retrieval? probably.)
  No replication: 26 messages / server / second.
  Replication: 10 m/s/s.
  Sendmail: 10 m/s/s.

What do they claim the bottleneck is?
  Disk seeks.