Exokernel ========= What's the overall point? OS abstractions get in the way of aggressive applications Not about performance of individual operations (e.g. system call or IPC) The problem is application structure You often just can't do what you want in an ordinary OS How will we be able to tell if the Exokernel approach is viable? They need to demonstrate an app with structure impossible in UNIX And they need to show the app is desireable I.e. *much* higher performance, or *much* more functionality What's an abstraction? Typically a virtualization of some hardware resource Examples? Disk blocks vs file systems Phys mem vs address space / process CPU vs time slicing or scheduler activations TLB entries vs address spaces Frame buffer vs windows Ethernet frames vs TCP/IP Why would you want them? More convenient API Allows sharing (files, TCP ports) Helps w/ composable, re-usable, general-purpose applications e.g. "standard output" and UNIX pipes Helps make applications portable to different hardware Mediate/protect shared resources Apps don't get direct hardware access OS mediates all accesses to enforce protection E.g. files have owners; apps can't directly read the disk This is the only deep reason! Why is it hard to get rid of abstractions? I.e. why can't we move most of the OS into libraries? Because kernel abstractions help w/ tension between sharing and protection Example: TLB access rather than address space OS would need to check each insertion So OS needs to know who owns phys mem Seems easy Example: disk access rather than file system How to enforce file protections? How to ensure meta-data integrity? Seems hard OK, let's design a high-performance application See whether we run into trouble with UNIX abstractions Let's DMA data directly from disk buffer cache to net Or stream data at full speed from disk to net What actually happens on UNIX on a simple server SYN, SYN/ACK, ACK -- now tell process Request arrives, ACKed Copy request data -> process open() may block.. UNIX directory structure has some O(N) problems read() from file disk -> buffer cache (maybe) buffer cache -> application (always) write() to TCP connection application -> mbufs in TCP retransmit queue TCP must keep copy for possible re-transmission Packetization may be different from disk block-ization TCP segments, computes checksum, sends -> net As ACKs arrive, TCP sends more May decide to retransmit, re-computes checksum Figure 3 shows about 700 requests per second (from cache) What Cheetah does. Section 7.3 Avoid copies Just disk -> cache -> net If packet lost, retransmit out of disk cache Store TCP checksum per block in file I.e., file format a bit like packet format Avoids checksum costs Avoid *re*-computing checksum on retransmit Intelligent ACK merging ACK for request with first data packet Intelligent clustering on disk GIFs with pages. Inodes near data Cheetah pre-fetches intelligently How do we know if Cheetah is a good idea? Performance data in Figure 3 Result: 8000 requests per second -- a factor of 10 faster From cache or disk? (must be cache) Same document over and over, or distribution? Why such a speedup for 0-byte docs? Not due to HTML-based file grouping Probably not due to copy avoidance Probably not due to checksum avoidance (no data...) Maybe due to eliminating one ack.. Can we explain the performance increase for 100 kbyte docs? In terms of memory copies avoided? What facilities does Cheetah need from the OS? User-level TCP, and thus low-level access to packet I/O Control over memory At least to avoid copies disk->kbuf->user->mbuf->net Direct access to disk cache Async read, including of meta-data Needs to control disk layout Why are these facilities hard in UNIX? User-level TCP: protection/sharing no raw access to incoming packets Disk layout: can't let apps have direct disk access Typically the problem is protection of shared resources What's the exokernel's general approach? Move as much as possible to OS libraries Libraries are easy to customize, and may be faster than system calls Separate protection and management Kernel just protects, lets apps manage Expose allocation, physical names, revocation, information Let's design an exokernel network system Goal: support user-level TCP Can we just hand all incoming packets to any program that wants them? I.e. just expose raw hardware No: I might see your packets (Actually this is probably OK; any secure protocol encrypts...) Also bad for performance I tell kernel what dst port I want Kernel accepts if no other app wants that port Rejects if some other app does So kernel implements just the port abstraction; not TCP &c This gets us first-come-first-served port access Can generalize to patterns, not just ports: exokernel DPF does this By downloading pattern-matching code into the kernel DPF dynamically compiles pattern matching code for fast demultiplexing Where to put incoming packet data? Don't know which process will get it until it has arrived So must expose kernel network buffers to applications Have we separated protection from management? How does multiplexing the disk work? Why is this a hard problem? Need to meet ordering constraints from Soft Updates But also want to maximize flexibility--e.g., let app read data & metedata simultaneously, then interpret data How does XN work? Download code into kernel to interpret metadata E.g., owns(inode) -> list of file blocks, indirect blocks owns(indirect-block) -> list of file blocks Note a block in list is really As long as code is deterministic, just use it to verify metadata Also output ACLs, etc What owns function would you use to implement a directory? owns (directory block) -> list of inodes (specifying inode type) What things might be hard to do with XN? LFS or journaling might require extending owns function a bit Not obvious how to enforce quotas Manage files with B-trees, do crash-consistent split/join ops Copy on write snapshoting would require XN to keep reference counts Does exokernel provide comparable protection/fault isolation to UNIX? Does UNIX even provide hard fault isolation between a user's processes? Not really -- ptrace (debugger syscall) lets procs trash each other's mem So unix's hard isolation is often painful and useless So Exokernel doesn't really enforce in this case either But what about, say, pipes between processes owned by different users? libOS must use more defensive impl of pipes E.g., don't get confused by negative offset values/other weirdness "Unidirectional trust" sometimes makes things easier Do you believe the exokernel story? I.e. should we bag current OSes and use exokernels and lib OSes Are exokernels easy to program? Are exokernel programs likely to be portable? Chaos if every program does its own abstractions? Are we likely to be always able to separate management from protection? Look at the XN file system; pretty complex What are lessons learned from this paper? Exposing kernel data structures is a big win (e.g., for wake predicates) Exokernel interface design is hard Even before exokernel, things like scheduler activations not obvious DPF, buf cache, XN, wake predicates, all non-trivial Information loss can put libOSes at a disadvantage E.g., UNIX can implement LRU paging across applications Solution: Exokernel can keep statistics, but leave interpretation to apps Provide space for application data in kernel structures Fast applications don't require good microbenchmark numbers Cheap critical sections useful--how did this work? Didn't actually disable interrupts Other kernel code could run, but not other processes Basically gave proc a bit more time to run an "epilogue" before preempting User-level page tables were very hard ASHes (application-specific handler) could process packets w. low latency e.g., used to get a TCP ACK packet out quickly before process scheduled When an ASH accesses VM, might need app-level fault handler Even w. kernel page tables, self-paging is complicated ASHes might not have been necessary Yes, upcalls are expensive, but maybe not that expensive Downloaded code is powerful But not so much because of performance reasons, like fewer upcalls Rather, because you can control and reason about the execution Check packet filters for conflicts, merge packet filters XN (file system) needs to know code is deterministic