Question: how can we reduce the amount of highly-privileged code (to reduce the
number of vulnerabilities and/or their effect)?

Why does any code run with kernel privilege in HiStar?

[ only counting .c files, some debugging code not included ]

(1) Device drivers: requires access to raw hardware, so fully trusted, but
    can move some of them (e.g. network, console?) into user-space when
    IOMMU hardware comes out and we can give them scoped hardware access.

    ~1900 lines of .c files for things we can move to user-space.

(2) Persistent storage (btree, write-ahead logging, disk block allocation):
    needs to be trusted to return the latest version of the right objects;
    could use signing and hash-trees to ensure correctness but not liveness.

    2432 lines of .c files for btree
     575 lines of .c files for write-ahead logging and disk allocation
     461 lines of .c files for disk driver

    Total: 3468 lines

(3) Hardware multiplexing: PCI bridges, page tables, preemption interrupts,
    physical memory allocation: fundamentally fully-trusted?

     350 lines of .c files for PCI, IRQ, clock
    874+ lines of .c files for page tables, trap handling, memory alloc

    Total: 1224+ lines

(4) Kernel object management, labels: fundamentally trusted to multiplex
    labeled objects, correctly snapshot them, etc.

    2229 lines of .c files

(5) Kernel object semantics / system call interface: provides a MAC-safe
    interface to persistent objects.

    936 lines for everything (syscall.c -- dispatch)
    642- lines for address spaces
    499- lines for threads
    88 lines for segments
    260 lines for containers

    Total: 2425-- lines

===

How can we make things smaller?
(1) 1900 LOC, move to user-space, introduce protected HW access code into kernel
(2) 3468 LOC, ???
(3) 1224+ LOC, fully trusted
(4) 2229 LOC, fully trusted
(5) 2425- LOC, can potentially split up?

So (1) drivers, (2) btrees and (5) syscall API are code that might not need to
be fully trusted.

(1) is almost obvious, given the hardware support.

(2) is a lot of code, but no real ideas for what to do with it yet.

(5) we can potentially partially move into an intermediate ring.

===

Moving device-drivers (1) into user-space:

 - Add a device kernel object, logically associated with a PCI device.

 - Each device object has an IOMMU page table.  The kernel object points
   to an address space object that should be used to fill out the page
   table.  Because IOMMU doesn't generate page faults, the page table
   needs to be always fully-populated, something that our current code
   doesn't implement.  Page table filled out as if on behalf of a thread
   running with the label of the device object.

 - Each device object has an interrupt; can issue a syscall to wait for
   the next interrupt on a device, with some generation number ala netdev
   to avoid lost wakeups (must be able to read/write device object).

 - Each device object has some IO space associated with it.  System calls
   to read/write IO ports associated with a device -- simpler than doing
   IOPB (must be able to read/write device object).  Potential problem if
   we have timing-sensitive devices (then would need direct IOPB, and some
   scheduling support), but that seems unlikely.

 - Interrupts: can mask individual interrupts on old PIC.  APIC allows
   masking contiguous groups of 16 interrupts, so using an APIC we can
   only have one un-ACKed interrupt at a time.  Minix3 does individual
   interrupt masking using the i8259 old PIC.

This will allow us to get rid of the kernel network-device API, along
with network drivers.  Could move console driver into user-space, but
makes things difficult to debug..

One problem: how does this interact with persistence?  Presumably all
device objects are invalidated on bootup, and a monitoring process of
some sort is needed to create some sort of base PCI object, find all
newly-present devices, create device objects for them, kill old drivers
and start new drivers.

Resource exhaustion issue with IO device page tables (cannot be GCed).

===

Splitting up (5) system call API / object semantics:

 - See mac-on-dac.txt

===

Btrees (2): it seems like even if we don't trust the btree code, we need
a way to determine whether an object given to us by the untrusted btree
code is consistent and fresh.  Consistency can be solved by signatures,
but freshness, for practical purposes, requires a trusted mapping from
object ID and its signature to one of { fresh, stale }.  ("Practical"
here excludes schemes which would modify the signature of every object
every time a snapshot is taken.)  Such a mapping would be about as large
as the mapping maintained by the untrusted btree (object ID space mapping
to disk offset + length), and as a result doesn't reduce the complexity
by much.

One possible difference is that the trusted mapping only needs to be a
set, with the ability to query, add and remove members, whereas the
untrusted mapping needs to look up values.  A bad way of implementing
the trusted index, which would reduce complexity, would be a bloom filter.
However, the trusted mapping should be precise..

Possible solution: fully-trusted btree reader (reasonably simple, going
along the btree chain to the leaf node), untrusted btree mutation ops
("nilpotent" rebalancing, plus insert, change and delete) and the btree
mutations would be checked by the btree reader.  The reader would need
to be more complicated, in order to declare two btrees identical based
on observing just the modified intermediate nodes (for "nilpotent" ops),
and that only one value was added / modified / deleted (for actual change
ops)..