Implementing a mail server ========================== Let's say we wanted to implement a scalable mail server Each mailbox is a list of numbered messages. Operations by clients: M = Read(Num) Num = Append(M) Delete(Num) Can anything go wrong in a simple single-server implementation? Consistency/Serializability: What if two clients call Append() at the same time? Implement with locks. Atomicity: What if the system crashes in the middle of an Append() or Delete()? Implement with transactions and logs. Durability: What if the system crashes just after Append() returns? Implement by flushing data and logs carefully. Why isn't single-server DB good enough? Performance bottleneck. Single point of failure. Want to replicate data on multiple machines. May help read (or even write) performance as well as availability. We've looked at many scalable network file systems... just use one One possible implementation: Each user has a mail directory: .../mail/HASH(user)/user/ Store user profile, password, etc., in .../user/.profile Store messages in files with name: .../user/msg.time.server.pid time is # seconds since epoch, server name of server, pid process ID Now have 100 SMTP/POP servers using same network file system To deliver a message, server S does the following: Create file tmp.time.S.pid Write new mail message to tmp.time.S.pid, fsync (tmp.time.S.pid) Rename (tmp.time.S.pid -> msg.time.S.pid) Periodically garbage collect old tmp.. files To retrieve/delete messages Just return a sorted list of msg.time.server.pid files Delete message by deleting file How well will this work with the following network file systems? AFS + Volume abstraction very suitable for users' mail directories (even supports quotas, etc.) + Supports migration, with cached hints, so you always find volume, and common case is fast - No automatic load balancing--need to assign volumes to servers manually - No replication Echo + Replication, should survive server crashes + 2-servers for same disk offers CPU hardware redundancy option + Metadata write behind will save synchronous writes (absorb file create and write into rename operation) - Contention for tokens: Need to acquire write token on directory twice for each delivery Almost every delivery will require a token to be - Servers themselves not scalable Can have many servers, but no AFS-like volume mechanism for migration o Need to modify mail server to call fsync after rename Also call fsync before--could use forder, but recall Vesta benchmark Zebra + Any single storage server can crash w/o affecting functionality + Good storage efficiency (far more efficient than mirroring) - Centralized file manager is a single point of failure + bottleneck - All directory/metadata ops implemented by file manager--remember how Zebra performed on small writes? For average mail message of 4.7KB, won't make sense to scale to more than 2 storage servers Frangipani + Scalable to large amounts of storage + Can replicate everything, survive failure of any petal server/disk + Can simultaneously deliver mail to different users with no lock conflicts o Can recover from any frangipani server crash in ~30 seconds - Contention for lock on directory is multiple servers deliver/read same user but this probably isn't too serious a problem - When you delete messages, may have to lock many other servers' free maps What about using Thor? + OO DB lets you implement exactly desired interface (Read/Append/Delete) How would you do this? User, message objects, ... + Transactional semantics very easy to program with + Replication survives failures - Somehow need to partition users among servers--won't be automated Might run out of pid's on a given server! - Thor optimized for small objects--4.7KB messages ill-suited to 8KB pages--may have a lot of fragmentation - May sometimes reject a transaction that would be okay for mail delivery E.g., RAW conflicts no big deal--will get message next time maybe WAR conflicts no big deal when just appending a message In general, many replicated systems we've seen offer heavy-duty consistency Everyone will agree that any two messages A & B were delivered in same order But we don't care how operations to different mailboxes are ordered We can even be sloppy with one mailbox Have to support concurrent delivery But POP/IMAP standards to say you have to support concurrent readers Porcupine ========= - Problem: email service for many users. many users receive lots of email per second (billion message a day, ~10,000 per second). More than one machine can deal with. - Email operations, again: - deliver msg requires updating the user's index of messages and storing the message - retrieve message lookup message and return optionally remove message and update the user's index - Idea: use a cluster of machines to construct a high-performance email server. - PCs are inexpensive - Alternative 1: simple partition of users over machines. I.e. each user's mail stored on a particular machine. problems: - manual configuration (doesn't scale) - one machine fails, many users no email (low availability) - load imbalance (low performance) - rebalance is hard, requires manual intervention - adding new servers doesn't automatically improve performance Why is load balance so important? Some machines may be slower than others Some users may be stuck on overloaded machines. Some machines overloaded while others are idle. Wasted resources. Means we could have handled a higher load. - Alternative 1 + plus replication for availabilty and performance - replicate each user's mailbox on a few machines (availability) - don't need to choose in advance (management) - read operations can happen at any replica (good performance) - write operations must touch each replica (bad performance) - must serialize to maintain replication (bad performance) e.g. using primary-copy scheme from HARP - Problem: email service with serializability is limited in the performance (e.g., of the primary). Canot do billion messages - Delivering and retrieving message may require modifications to the user's index and therefore must be executed by the primary to get total ordering and therefore consistency (i.e., serializability). - Does an email service have to mimic single-copy semantics for mailboxes? No: 1. It's OK to re-order messages: updates commute. 2. It's OK to deliver a message twice. 3. It's OK to un-delete a message. 4. It's OK to present only a subset of messages during a partial failure. 5. It's OK to change the order from one session to another. 6. Updates are individually atomic, so no locking. I.e. "mailbox" does not need to be a coherent idea. Enough to have just a set of messages. Real operations are add/delete message. Only real consistency rule is "don't lose a message". All else is optional. Plan 3: When a new message arrives, copy it to 2 random servers. For availability. When a user reads mail, Ask all reachable servers for the messages they store. What's good about Plan 3? Highly available. Probably load-balanced. Automatically starts using new servers. No management required. What's bad about Plan 3? It's expensive to contact every server for every user. Doesn't even attempt to preserve message order. How does Porcupine fix the problems with Plan 3? 1. Maintain affinity between each user and a few preferred servers. "Fragment list" So reading mail only need to contact those servers. Need to remember correct per-user server set despite failures. But can be soft state! If all else fails, can ask each server for whom it holds mail. 2. But mail delivery needs to find fragment list. They are distributed across servers, using "user map". Every node has complete copy of (identical) user map. Which is again soft state. 3. What about preserving order? Incoming mail stamped with current wall-clock time. Clocks of different servers roughly synchronized. Can be sorted when user reads mail. 4. What about preserving consistency? "Total updates" -- Eliminates the need for locks Means the last update has all the state you need If you see two object updates, just take the latest one by time E.g., Clocks synced to +/- 10ms, users delete mailbox much less frequently - basic ideas: - reduce hard state mail fragments and some user profile state - approach serializability using synchronized clock (can also order events that are not logically related---e.g., deletes of the same email message on different nodes because the first delete from the client perspective didn't complete). - updates are complete (no locking). - update operations (delivery and delete) can be performed any node - if receiving node fails, restart update on some other node - updates are ordered by timestamp (can order delete and delivery of a message) - log operations with timestamp apply update later - replicate log to some nodes - each node that has the log can recover (in parallel) - will eventually reach consistency (clients can see all behavior as with SMTP) - garbage collect log when: - log is replicated successfully in n places and update is performed locally - entry is a week old - retrieve email on any node conceptually: scan all disks for fragments - detect replicas by observing the same UID - messages on unavailable disks are unavailable - Improve performance of retrieve operation: - keep soft state for: - user map. mapping a user name to a node - fragment list. node in user map keeps list of nodes that have mail fragments - membership list. record which nodes are up. quick recovery etc. - recovery: reconstruct soft state, if necessary - maintaining soft state reduces performance of update operations! - How widely should the mail fragments be replicated? - enough nodes to deal with reasonable failure scenarios - enough nodes to deal with load imbalance load metrics: free space on disk, number of outstanding disk ops - not too widely, bad for performance What happens when mail is delivered? Mail arrives on randomly chosen server. User map finds server holding user's fragment list. Fragment list tells us where to store/replicate. Can store somewhere else. If preferred servers are down or out of disk or overloaded. Just add new server to fragment list. How does Porcupine make sure data is replicated properly? I.e. if can't (or slow) contacting some replica servers? Log updates, keep trying. What happens when a user reads mail? Again, user contacts any server. Find fragment list. Gather fragments. Ordinarily only from a few preferred nodes. Sort by timestamp, eliminate duplicates. What happens during recovery? Soft state has to be correct. Not just a hint, since not cheap to verify. If user maps don't agree, a user may have multiple (different) frag lists. If fragment list isn't correct, frag may exist but not be read. Must agree on new set of nodes. First, choose a server to coordinate the recovery. Coordinator decides which servers are live. And computes a new user map -- with minimum change from old one. Tells every node the user map. Then servers help each other reconstruct fragment lists. Timestamp entries in the user map to know wish are new ones Hard state stored in directories corresponding to user-map hash buckets Scan relevant directories, and forward state to new nodes in user-map table For replicated user profile, only live node with highest IP address sends Are these techniques widely applicable? I.e. could you use them for e.g. a replicated NFS server? How fast *should* such a system be? I.e. messages per second per server. Limited by disk seek? Because messages must be permantently stored. At least one seek per message delivered, retrieved, deleted. But sounds like DB might be written as well as log So three disk I/Os per message. Assume 15ms per disk seek. 66 I/Os per second. 22 messages per second per server. What would we expect if every message replicated (i.e. two copies)? Delivery and deletion are two seeks on two servers But will slower of two seeks will dominate Retrieval still one seek -- just need one copy. Expect a bit more than half as fast, roughly. Though surely we could use logs for the replicas... How would you build porcupine to reduce the number of seeks? LFS would eliminate seeks on write -- but not retrieval Some variation of LFS? Maybe cleaner could sort by user-id Or rewrite all mail messages when less than a segment size? but then an extra seek to read them - Evaluation - synthetic - for 30 nodes, 5,000,000 users - NVRAM for log - performance (400 per second) far from maximum performance possible - Different solution: - many round-robin SMTP and POP frontend - a set of user mailboxes per machine - store mailboxes on RAID devices, accessible from two machine - if primary fails, fail over to backup - allow slow migration of data to new disks. How fast is Porcupine? (does synthetic workload balance delivery and retrieval? probably.) No replication: 26 messages / server / second. Replication: 10 m/s/s. Sendmail: 10 m/s/s. What do they claim the bottleneck is? Disk seeks.