Question: how can we reduce the amount of highly-privileged code (to reduce the number of vulnerabilities and/or their effect)? Why does any code run with kernel privilege in HiStar? [ only counting .c files, some debugging code not included ] (1) Device drivers: requires access to raw hardware, so fully trusted, but can move some of them (e.g. network, console?) into user-space when IOMMU hardware comes out and we can give them scoped hardware access. ~1900 lines of .c files for things we can move to user-space. (2) Persistent storage (btree, write-ahead logging, disk block allocation): needs to be trusted to return the latest version of the right objects; could use signing and hash-trees to ensure correctness but not liveness. 2432 lines of .c files for btree 575 lines of .c files for write-ahead logging and disk allocation 461 lines of .c files for disk driver Total: 3468 lines (3) Hardware multiplexing: PCI bridges, page tables, preemption interrupts, physical memory allocation: fundamentally fully-trusted? 350 lines of .c files for PCI, IRQ, clock 874+ lines of .c files for page tables, trap handling, memory alloc Total: 1224+ lines (4) Kernel object management, labels: fundamentally trusted to multiplex labeled objects, correctly snapshot them, etc. 2229 lines of .c files (5) Kernel object semantics / system call interface: provides a MAC-safe interface to persistent objects. 936 lines for everything (syscall.c -- dispatch) 642- lines for address spaces 499- lines for threads 88 lines for segments 260 lines for containers Total: 2425-- lines === How can we make things smaller? (1) 1900 LOC, move to user-space, introduce protected HW access code into kernel (2) 3468 LOC, ??? (3) 1224+ LOC, fully trusted (4) 2229 LOC, fully trusted (5) 2425- LOC, can potentially split up? So (1) drivers, (2) btrees and (5) syscall API are code that might not need to be fully trusted. (1) is almost obvious, given the hardware support. (2) is a lot of code, but no real ideas for what to do with it yet. (5) we can potentially partially move into an intermediate ring. === Moving device-drivers (1) into user-space: - Add a device kernel object, logically associated with a PCI device. - Each device object has an IOMMU page table. The kernel object points to an address space object that should be used to fill out the page table. Because IOMMU doesn't generate page faults, the page table needs to be always fully-populated, something that our current code doesn't implement. Page table filled out as if on behalf of a thread running with the label of the device object. - Each device object has an interrupt; can issue a syscall to wait for the next interrupt on a device, with some generation number ala netdev to avoid lost wakeups (must be able to read/write device object). - Each device object has some IO space associated with it. System calls to read/write IO ports associated with a device -- simpler than doing IOPB (must be able to read/write device object). Potential problem if we have timing-sensitive devices (then would need direct IOPB, and some scheduling support), but that seems unlikely. - Interrupts: can mask individual interrupts on old PIC. APIC allows masking contiguous groups of 16 interrupts, so using an APIC we can only have one un-ACKed interrupt at a time. Minix3 does individual interrupt masking using the i8259 old PIC. This will allow us to get rid of the kernel network-device API, along with network drivers. Could move console driver into user-space, but makes things difficult to debug.. One problem: how does this interact with persistence? Presumably all device objects are invalidated on bootup, and a monitoring process of some sort is needed to create some sort of base PCI object, find all newly-present devices, create device objects for them, kill old drivers and start new drivers. Resource exhaustion issue with IO device page tables (cannot be GCed). === Splitting up (5) system call API / object semantics: - See mac-on-dac.txt === Btrees (2): it seems like even if we don't trust the btree code, we need a way to determine whether an object given to us by the untrusted btree code is consistent and fresh. Consistency can be solved by signatures, but freshness, for practical purposes, requires a trusted mapping from object ID and its signature to one of { fresh, stale }. ("Practical" here excludes schemes which would modify the signature of every object every time a snapshot is taken.) Such a mapping would be about as large as the mapping maintained by the untrusted btree (object ID space mapping to disk offset + length), and as a result doesn't reduce the complexity by much. One possible difference is that the trusted mapping only needs to be a set, with the ability to query, add and remove members, whereas the untrusted mapping needs to look up values. A bad way of implementing the trusted index, which would reduce complexity, would be a bloom filter. However, the trusted mapping should be precise.. Possible solution: fully-trusted btree reader (reasonably simple, going along the btree chain to the leaf node), untrusted btree mutation ops ("nilpotent" rebalancing, plus insert, change and delete) and the btree mutations would be checked by the btree reader. The reader would need to be more complicated, in order to declare two btrees identical based on observing just the modified intermediate nodes (for "nilpotent" ops), and that only one value was added / modified / deleted (for actual change ops)..