Exokernel
=========

What's the overall point?
  OS abstractions get in the way of aggressive applications
  Not about performance of individual operations (e.g. system call or IPC)
  The problem is application structure
    You often just can't do what you want in an ordinary OS
  How will we be able to tell if the Exokernel approach is viable?
    They need to demonstrate an app with structure impossible in UNIX
    And they need to show the app is desireable
      I.e. *much* higher performance, or *much* more functionality

What's an abstraction?
  Typically a virtualization of some hardware resource

Examples?
  Disk blocks vs file systems
  Phys mem vs address space / process
  CPU vs time slicing or scheduler activations
  TLB entries vs address spaces
  Frame buffer vs windows
  Ethernet frames vs TCP/IP

Why would you want them?
  More convenient API
  Allows sharing (files, TCP ports)
  Helps w/ composable, re-usable, general-purpose applications
    e.g. "standard output" and UNIX pipes
  Helps make applications portable to different hardware
  Mediate/protect shared resources
    Apps don't get direct hardware access
    OS mediates all accesses to enforce protection
    E.g. files have owners; apps can't directly read the disk
    This is the only deep reason!

Why is it hard to get rid of abstractions?
  I.e. why can't we move most of the OS into libraries?
  Because kernel abstractions help w/ tension between sharing and protection
  Example: TLB access rather than address space
    OS would need to check each insertion
    So OS needs to know who owns phys mem
    Seems easy
  Example: disk access rather than file system
    How to enforce file protections?
    How to ensure meta-data integrity?
    Seems hard

OK, let's design a high-performance application
  See whether we run into trouble with UNIX abstractions
  Let's DMA data directly from disk buffer cache to net
  Or stream data at full speed from disk to net

What actually happens on UNIX on a simple server
  SYN, SYN/ACK, ACK -- now tell process
  Request arrives, ACKed
  Copy request data -> process
  open()
    may block..
    UNIX directory structure has some O(N) problems
  read() from file
    disk -> buffer cache (maybe)
    buffer cache -> application (always)
  write() to TCP connection
    application -> mbufs in TCP retransmit queue
    TCP must keep copy for possible re-transmission
    Packetization may be different from disk block-ization
    TCP segments, computes checksum, sends -> net
  As ACKs arrive,
    TCP sends more
    May decide to retransmit, re-computes checksum
  Figure 3 shows about 700 requests per second (from cache)

What Cheetah does. Section 7.3
  Avoid copies
    Just disk -> cache -> net
    If packet lost, retransmit out of disk cache
  Store TCP checksum per block in file
    I.e., file format a bit like packet format
    Avoids checksum costs
      Avoid *re*-computing checksum on retransmit
  Intelligent ACK merging
    ACK for request with first data packet
  Intelligent clustering on disk
    GIFs with pages. Inodes near data
    Cheetah pre-fetches intelligently

How do we know if Cheetah is a good idea?
  Performance data in Figure 3
  Result: 8000 requests per second -- a factor of 10 faster
  From cache or disk? (must be cache)
  Same document over and over, or distribution?
  Why such a speedup for 0-byte docs?
    Not due to HTML-based file grouping
    Probably not due to copy avoidance
    Probably not due to checksum avoidance (no data...)
    Maybe due to eliminating one ack..
  Can we explain the performance increase for 100 kbyte docs?
    In terms of memory copies avoided?

What facilities does Cheetah need from the OS?
  User-level TCP, and thus low-level access to packet I/O
  Control over memory
    At least to avoid copies disk->kbuf->user->mbuf->net
  Direct access to disk cache
  Async read, including of meta-data
  Needs to control disk layout

Why are these facilities hard in UNIX?
  User-level TCP: protection/sharing no raw access to incoming packets
  Disk layout: can't let apps have direct disk access
  Typically the problem is protection of shared resources

What's the exokernel's general approach?
  Move as much as possible to OS libraries
    Libraries are easy to customize, and may be faster than system calls
  Separate protection and management
    Kernel just protects, lets apps manage
  Expose allocation, physical names, revocation, information

Let's design an exokernel network system
  Goal: support user-level TCP
  Can we just hand all incoming packets to any program that wants them?
    I.e. just expose raw hardware
    No: I might see your packets
      (Actually this is probably OK; any secure protocol encrypts...)
    Also bad for performance
  I tell kernel what dst port I want
    Kernel accepts if no other app wants that port
    Rejects if some other app does
    So kernel implements just the port abstraction; not TCP &c
  This gets us first-come-first-served port access
  Can generalize to patterns, not just ports: exokernel DPF does this
    By downloading pattern-matching code into the kernel
    DPF dynamically compiles pattern matching code for fast demultiplexing
  Where to put incoming packet data?
    Don't know which process will get it until it has arrived
    So must expose kernel network buffers to applications
  Have we separated protection from management?

How does multiplexing the disk work?
  Why is this a hard problem?
    Need to meet ordering constraints from Soft Updates
    But also want to maximize flexibility--e.g., let app
        read data & metedata simultaneously, then interpret data
  How does XN work?
    Download code into kernel to interpret metadata
    E.g., owns(inode) -> list of file blocks, indirect blocks
          owns(indirect-block) -> list of file blocks
	  Note a block in list is really <block #, type>
    As long as code is deterministic, just use it to verify metadata
    Also output ACLs, etc
  What owns function would you use to implement a directory?
    owns (directory block) -> list of inodes (specifying inode type)
  What things might be hard to do with XN?
    LFS or journaling might require extending owns function a bit
    Not obvious how to enforce quotas
    Manage files with B-trees, do crash-consistent split/join ops
    Copy on write snapshoting would require XN to keep reference counts

Does exokernel provide comparable protection/fault isolation to UNIX?
  Does UNIX even provide hard fault isolation between a user's processes?
    Not really -- ptrace (debugger syscall) lets procs trash each other's mem
    So unix's hard isolation is often painful and useless
  So Exokernel doesn't really enforce in this case either
  But what about, say, pipes between processes owned by different users?
    libOS must use more defensive impl of pipes
    E.g., don't get confused by negative offset values/other weirdness
    "Unidirectional trust" sometimes makes things easier

Do you believe the exokernel story?
  I.e. should we bag current OSes and use exokernels and lib OSes
  Are exokernels easy to program?
  Are exokernel programs likely to be portable?
  Chaos if every program does its own abstractions?
  Are we likely to be always able to separate management from protection?
    Look at the XN file system; pretty complex

What are lessons learned from this paper?
  Exposing kernel data structures is a big win (e.g., for wake predicates)
  Exokernel interface design is hard
    Even before exokernel, things like scheduler activations not obvious
    DPF, buf cache, XN, wake predicates, all non-trivial
  Information loss can put libOSes at a disadvantage
    E.g., UNIX can implement LRU paging across applications
    Solution:  Exokernel can keep statistics, but leave interpretation to apps
  Provide space for application data in kernel structures
  Fast applications don't require good microbenchmark numbers
  Cheap critical sections useful--how did this work?
    Didn't actually disable interrupts
      Other kernel code could run, but not other processes
    Basically gave proc a bit more time to run an "epilogue" before preempting
  User-level page tables were very hard
    ASHes (application-specific handler) could process packets w. low latency
      e.g., used to get a TCP ACK packet out quickly before process scheduled
    When an ASH accesses VM, might need app-level fault handler
    Even w. kernel page tables, self-paging is complicated
  ASHes might not have been necessary
    Yes, upcalls are expensive, but maybe not that expensive
  Downloaded code is powerful
    But not so much because of performance reasons, like fewer upcalls
    Rather, because you can control and reason about the execution
      Check packet filters for conflicts, merge packet filters
      XN (file system) needs to know code is deterministic