Exokernel
=========

What's the overall point?
  O/S abstractions get in the way of aggressive applications
  Not about performance of individual operations (e.g. system call or IPC)
  The problem is application structure
    You often just can't do what you want in an ordinary OS
    Examples:
      User/kernel threads -> scheduler activations
      Flash, requires gratuitous context switches (jump through hoops)
      Appel & Li - need fast VM primitives, not available everywhere 
      Receive livelock - can't control how kernel is allocating resources
        also can't taylor implementation to application (e.g., PCBs)
      Afterbruner - kernel was imposing two copies on all TCP data
      UBM - still no best Bcache scheme in 2000
        even UBM not as good as 2Q for some applications
      XFS - changed on disk data structures, needed raw I/O interface
      Resource containers - effort to allocated resources better
  Goal:  Build an OS in which apps. can solve the problems for themselves!
  How will we be able to tell if they are right?
    We need to demonstrate an app with structure impossible in UNIX.
    And we they need to show the app is desireable.
      I.e. *much* higher performance, or *much* more functionality.

What are typical OS abstractions?
  Typically a virtualization of some hardware resource.
Examples:
  Disk blocks vs file systems.
  Phys mem vs address space / process.
  CPU vs time slicing or scheduler activations.
  TLB entries vs address spaces.
  Frame buffer vs windows.
  Ethernet frames vs TCP/IP.

Why would you want them?
  More convenient API.
  Allows sharing (files, TCP ports).
  Helps w/ composable, re-usable, general-purpose applications.
    e.g. "standard output" and UNIX pipes.
  Helps make applications portable to different hardware.
  Mediate/protect shared resources.
    Apps don't get direct hardware access.
    OS mediates all accesses to enforce protection.
    E.g. files have owners; apps can't directly read the disk.
    This is the only deep reason!

OK, let's design a high-performance application.
  See whether we run into trouble with UNIX abstractions.
  Let's DMA data directly from disk buffer cache to net.
  Or stream data at full speed from disk to net.

What actually happens on UNIX on a simple server.
  SYN, SYN/ACK, ACK -- now tell process.
  Request arrives, ACKed.
  Copy request data -> process.
  open()
    may block...
    UNIX directory structure has some O(N) problems.
  read() from file
    disk -> buffer cache (maybe)
    buffer cache -> application (always)
  write() to TCP connection
    application -> mbufs in TCP retransmit queue
    TCP must keep copy for possible re-transmission.
    Packetization may be different from disk block-ization.
    TCP segments, computes checksum, sends -> net.
  As ACKs arrive,
    TCP sends more.
    May decide to retransmit, re-computes checksum.
  Figure 3 shows about 700 requests per second (from cache).

What Cheetah does. Section 7.3.
  Avoid copies.
    Just disk -> cache -> net.
    rxmt out of disk cache.
  Store TCP checksum per block in file.
    IE file format a bit like packet format.
    Avoids checksum costs.
      Avoid *re*-computing checksum on retransmit.
  Intelligent ACK merging.
    ACK for request with first data packet.
  Intelligent clustering on disk.
    GIFs with pages. Inodes near data.
    Cheetah pre-fetches intelligently.
  Note:  Also keeps compact, special-purpose protocol control blocks (PCBs)
    Since local address of TCP port is always the same

How do we know if Cheetah is a good idea?
  Performance data in Figure 3.
  Result: 8000 requests per second -- a factor of 10 faster.
  From cache or disk? (must be cache)
  Same document over and over, or distribution?
  Why such a speedup for 0-byte docs?
    Not due to HTML-based file grouping.
    Probably not due to copy avoidance.
    Probably not due to checksum avoidance (no data...).
    Maybe due to eliminating one ack...
  Can we explain the performance increase for 100 kbyte docs?
    In terms of memory copies avoided?

What facilities does Cheetah need from the OS?
  User-level TCP, and thus low-level access to packet I/O.
  Control over memory.
    At least to avoid copies disk->kbuf->user->mbuf->net.
  Direct access to disk cache.
  Async read, including of meta-data.
  Needs to control disk layout.

Why are these facilities hard in UNIX?
  User-level TCP: protection/sharing no raw access to incoming packets.
  Disk layout: can't let apps have direct disk access.
  Typically the problem is protection of shared resources.

What's the exokernel's general approach?
  Move as much as possible to OS libraries.
    Libraries are easy to customize, and may be faster than system calls.
  Separate protection and management.
    Kernel just protects, lets apps manage.
  Expose allocation, physical names, revocation, information.
  Collect information applications can use to implement policy
      e.g., buffer cache LRU
    But don't impose any policy in the kernel

Let's design an exokernel network system.
  Goal: support user-level TCP.
  Can we just hand all incoming packets to any program that wants them?
    I.e. just expose raw hardware.
    No: I might see your packets.
    (Actually this is probably OK; any secure protocol encrypts...)
  I tell kernel what dst port I want.
    Kernel accepts if no other app wants that port.
    Rejects if some other app does.
    So kernel implements just the port abstraction; not TCP &c.
  This gets us first-come-first-served port access.
  Can generalize to patterns, not just ports: exokernel DPF does this.
    By downloading pattern-matching code into the kernel.
  Where to put incoming packet data?
    Don't know which process will get it until it has arrived.
    So must expose kernel network buffers to applications.
  Have we separated protection from management?

How does disk multiplexing work?  Why is this a hard problem?
  Need to track ownership of blocks w/o dictating on-disk metadata structures
  Need to continue to track properly even after a crash and reboot!
    Otherwise, protection might be violated after a power outage
  What is it that we need to guarantee?
    1. Never re-use on-disk data structure before nullifying all pointers to it
        Otherwise, might get cross-allocated blocks after a reboot
	Won't be clear who owns a block that appears in two different files
    2. Never write pointer to uninitialized data
        Otherwise, after crash I might see a block of your deleted file
        (Note, ordinary file systems actuall have this problem)
	Otherwise, after crash might interpret garbage as metadata
    3. When moving resource, don't rest old pointer before writing new one

So how does XN guarantee 1-3 w/o imposing on-disk data structures?   
  Idea:  Download code into the kernel to interpret data structures
  Each on-disk data structure has a template, with three functions:
    owns-udf - transforms data into list of extents "owned" by metadata
    acl-uf - outputs list of principles allowed to access data
    size-uf - just size of metadata
  Store templates on disk, for persistent file system, plus set of root blocks
  Why UDF?  (untrusted deterministic function)

Is this useful?  Won't everyone just use the same templates?
  Specialized applications use specialized FSes (e.g., Cheetah)
  Even if all apps use same templates, can use different FS implementations
  E.g., look at XCP.  Compatible with C-FFS, but much faster
  Or imagine untarring a large directory
    Traditionally requires lots of synchronous writes to preserve order     
    With XN, can delay all writes if dir not reachable from persistent root
    Then sort and flush buffers before connecting

How does buffer cache work?
  Problem:  Don't want kernel managing memory
    But applications should be able to share the buffer cache
    Otherwise, low performance (must fetch from disk what other app. has)
    Otherwise, inconsistency when apps have different copies of a dirty block
  Idea:  Buffer cache registry, tracks application pages with blocks
    Each block in 1 of 4 states:  dirty, out-of-core, uninitialized, locked

Do we believe this story?
  I.e. should we bag current OS's and use exokernels and lib OS's.
  Are exokernels easy to program?
  Are exokernel programs likely to be portable?
  Chaos if every program does its own abstractions?
  Are we likely to be always able to separate management from protection?
    E.g., could you implement stide scheduling on Xok?