Exokernel ========= What's the overall point? O/S abstractions get in the way of aggressive applications Not about performance of individual operations (e.g. system call or IPC) The problem is application structure You often just can't do what you want in an ordinary OS Examples: User/kernel threads -> scheduler activations Flash, requires gratuitous context switches (jump through hoops) Appel & Li - need fast VM primitives, not available everywhere Receive livelock - can't control how kernel is allocating resources also can't taylor implementation to application (e.g., PCBs) Afterbruner - kernel was imposing two copies on all TCP data UBM - still no best Bcache scheme in 2000 even UBM not as good as 2Q for some applications XFS - changed on disk data structures, needed raw I/O interface Resource containers - effort to allocated resources better Goal: Build an OS in which apps. can solve the problems for themselves! How will we be able to tell if they are right? We need to demonstrate an app with structure impossible in UNIX. And we they need to show the app is desireable. I.e. *much* higher performance, or *much* more functionality. What are typical OS abstractions? Typically a virtualization of some hardware resource. Examples: Disk blocks vs file systems. Phys mem vs address space / process. CPU vs time slicing or scheduler activations. TLB entries vs address spaces. Frame buffer vs windows. Ethernet frames vs TCP/IP. Why would you want them? More convenient API. Allows sharing (files, TCP ports). Helps w/ composable, re-usable, general-purpose applications. e.g. "standard output" and UNIX pipes. Helps make applications portable to different hardware. Mediate/protect shared resources. Apps don't get direct hardware access. OS mediates all accesses to enforce protection. E.g. files have owners; apps can't directly read the disk. This is the only deep reason! OK, let's design a high-performance application. See whether we run into trouble with UNIX abstractions. Let's DMA data directly from disk buffer cache to net. Or stream data at full speed from disk to net. What actually happens on UNIX on a simple server. SYN, SYN/ACK, ACK -- now tell process. Request arrives, ACKed. Copy request data -> process. open() may block... UNIX directory structure has some O(N) problems. read() from file disk -> buffer cache (maybe) buffer cache -> application (always) write() to TCP connection application -> mbufs in TCP retransmit queue TCP must keep copy for possible re-transmission. Packetization may be different from disk block-ization. TCP segments, computes checksum, sends -> net. As ACKs arrive, TCP sends more. May decide to retransmit, re-computes checksum. Figure 3 shows about 700 requests per second (from cache). What Cheetah does. Section 7.3. Avoid copies. Just disk -> cache -> net. rxmt out of disk cache. Store TCP checksum per block in file. IE file format a bit like packet format. Avoids checksum costs. Avoid *re*-computing checksum on retransmit. Intelligent ACK merging. ACK for request with first data packet. Intelligent clustering on disk. GIFs with pages. Inodes near data. Cheetah pre-fetches intelligently. Note: Also keeps compact, special-purpose protocol control blocks (PCBs) Since local address of TCP port is always the same How do we know if Cheetah is a good idea? Performance data in Figure 3. Result: 8000 requests per second -- a factor of 10 faster. From cache or disk? (must be cache) Same document over and over, or distribution? Why such a speedup for 0-byte docs? Not due to HTML-based file grouping. Probably not due to copy avoidance. Probably not due to checksum avoidance (no data...). Maybe due to eliminating one ack... Can we explain the performance increase for 100 kbyte docs? In terms of memory copies avoided? What facilities does Cheetah need from the OS? User-level TCP, and thus low-level access to packet I/O. Control over memory. At least to avoid copies disk->kbuf->user->mbuf->net. Direct access to disk cache. Async read, including of meta-data. Needs to control disk layout. Why are these facilities hard in UNIX? User-level TCP: protection/sharing no raw access to incoming packets. Disk layout: can't let apps have direct disk access. Typically the problem is protection of shared resources. What's the exokernel's general approach? Move as much as possible to OS libraries. Libraries are easy to customize, and may be faster than system calls. Separate protection and management. Kernel just protects, lets apps manage. Expose allocation, physical names, revocation, information. Collect information applications can use to implement policy e.g., buffer cache LRU But don't impose any policy in the kernel Let's design an exokernel network system. Goal: support user-level TCP. Can we just hand all incoming packets to any program that wants them? I.e. just expose raw hardware. No: I might see your packets. (Actually this is probably OK; any secure protocol encrypts...) I tell kernel what dst port I want. Kernel accepts if no other app wants that port. Rejects if some other app does. So kernel implements just the port abstraction; not TCP &c. This gets us first-come-first-served port access. Can generalize to patterns, not just ports: exokernel DPF does this. By downloading pattern-matching code into the kernel. Where to put incoming packet data? Don't know which process will get it until it has arrived. So must expose kernel network buffers to applications. Have we separated protection from management? How does disk multiplexing work? Why is this a hard problem? Need to track ownership of blocks w/o dictating on-disk metadata structures Need to continue to track properly even after a crash and reboot! Otherwise, protection might be violated after a power outage What is it that we need to guarantee? 1. Never re-use on-disk data structure before nullifying all pointers to it Otherwise, might get cross-allocated blocks after a reboot Won't be clear who owns a block that appears in two different files 2. Never write pointer to uninitialized data Otherwise, after crash I might see a block of your deleted file (Note, ordinary file systems actuall have this problem) Otherwise, after crash might interpret garbage as metadata 3. When moving resource, don't rest old pointer before writing new one So how does XN guarantee 1-3 w/o imposing on-disk data structures? Idea: Download code into the kernel to interpret data structures Each on-disk data structure has a template, with three functions: owns-udf - transforms data into list of extents "owned" by metadata acl-uf - outputs list of principles allowed to access data size-uf - just size of metadata Store templates on disk, for persistent file system, plus set of root blocks Why UDF? (untrusted deterministic function) Is this useful? Won't everyone just use the same templates? Specialized applications use specialized FSes (e.g., Cheetah) Even if all apps use same templates, can use different FS implementations E.g., look at XCP. Compatible with C-FFS, but much faster Or imagine untarring a large directory Traditionally requires lots of synchronous writes to preserve order With XN, can delay all writes if dir not reachable from persistent root Then sort and flush buffers before connecting How does buffer cache work? Problem: Don't want kernel managing memory But applications should be able to share the buffer cache Otherwise, low performance (must fetch from disk what other app. has) Otherwise, inconsistency when apps have different copies of a dirty block Idea: Buffer cache registry, tracks application pages with blocks Each block in 1 of 4 states: dirty, out-of-core, uninitialized, locked Do we believe this story? I.e. should we bag current OS's and use exokernels and lib OS's. Are exokernels easy to program? Are exokernel programs likely to be portable? Chaos if every program does its own abstractions? Are we likely to be always able to separate management from protection? E.g., could you implement stide scheduling on Xok?