Nooks ===== What is the problem this paper is addressing? Why drivers? Does this seem like a viable approach? How does it compare to virtual machines? challenge is virtualizing kernel/driver interface, not hardware What alternative approaches might one take? Code checking tools -- static analysis, or run-time tools (like eraser) Safe languages Software fault isolation (SFI) kind of like safe languages How might you combine Nooks with previous approaches? E.g., SFI saves you from context switch & TLB misses but still have to track object usage Nooks is trying to find a new design point between unprotected and safe: - fault resistance, not fault isolation - design for mistakes, not abuse What are the benefits and drawbacks of this approach? + works with today's code + performance impact possibly less than, say, full microkernel - not 100% effective - only applies to extensions, not core kernel (unlike automated checking) What are three goals of the system: Isolation - don't let fault in one extension infect rest of system Recovery - support automatic recovery after a function Backwards compatibility - e.g., work with Linux How does vanilla Linux deal with bugs/assertion failures in kernel Make a distinction between running in process vs. interrupt context Process context - kill process Note this isn't quite "fair", because bug is kernel bug, not user bug But allows some degree of recoverability Interrupt context - crash & reboot machine. Why? How does Nooks achieve Isolation? Use paging hardware to protect kernel & extensions against bad extensions See Fig 3: Kernel can write everything, extensions only write themselves Use Extension Procedure Call (XPC) What do you have to do to call into an extension? (Fig. 4) Copy any argument data structures to where extension can write them Might need to follow/adjust any pointers in data structures Adjust stack pointer Load %cr3 with address space of extension === run extension Switch %cr3 and stack back Copy results back; synchronize any modified structures What about modifications to non-argument kernel data structures? Fortunately, happens often to be done through macros and inline functions Can change these into XPCs Where do page tables come from when loading %cr3? Nooks has to maintain a set of "shadow" page tables Just change code where linux touches page tables Have to modify page fault handler... how? Current task (Linux equiv of proc) structure on kernel stack? Could you optimize this process on the x86? If extensions are in different 4 MB regions... maybe re-use page tables (Just clear PTE_W in page directory entry) Or at least do this for some regions (might not work for buffer cache) Also, maybe targeted TLB flush in stead of %cr3 load? What is deferred XPC mechanism? Where/why does this come in? What are wrappers? How do they work? Three purposes: Check parameters for validity Implement call-by-value-result? (What's this vs. call by reference?) Perform XPC Basically works through linker Who writes a wrapper? Tool auto-generates skeleton from header Fill in by hand Need to know properties How specific to each extension is the wrapper? See Fig. 5 What is Object Tracking and why? Records address/type of all objects in use by an extension If used for call, just attach to stack If held, keep in per-extension hash table If ext. might write object, keep association between kern & ext. versions How do you know lifetime of objects? By hand inspection - determine type of object passed in for call, allocated/deallocated by ext., special (timer), ... Do you always copy objects? No... more efficient just to re-map network & disk buffers How do you detect a fault? Easy cases... page fault or other exception in extension What about harder cases... e.g., no network packets received User can detect and initiate recovery How do you recover from a fault? - Disable any interrupts vectored to the extension, if driver (what if you didn't do this... could get livelock or worse) - Invoke user-mode recovery agent Perform extension-specific recovery, notify sysadmin, Change configuration, disable after repeated failures, ... By default, unloads and re-loads module What's this about interruptable vs. non-interruptible state? What about allocated memory? (This is why we need object tracking) What about things like network buffers w. pending DMA? Only free buffers after re-loading driver after it has re-initialized the device How could an extension bypass Nooks to corrupt system? Set %esp to something bad and take an exception (what happens on x86?) DMA to physical memory you can't write move something to %cr3 (after all, XPC mechanism does this) disable interrupts and loop forever logic bugs that don't involve trashing memory How do you evaluate something like this? Care if it achieved backwards compatibility Care about whether it improves Reliability, and cost in Performance How backwards-compatible is Nooks? One time costs Basic kernel changes (e.g., to update shadow page tables) Need to implement base Nooks functionality (object tracking, XPC, etc.) Need to write wrappers for various types of extensions See Table 2 for idea, non-wrapper code is only about 8,000 lines Per-extension costs Need some driver-specific wrappers Need to re-compile extensions Sometimes need to modify extensions--when? If directly modifies kernel data structure w/o using function/macro kHTTPd was only one of their extensions that required this (in 13 places) How to measure reliability? Fault injection. What did they do (see journal paper)? Automatically changes single instructions Emulate programming errors: - source & destination faults emulate assignment errors - pointer faults emulate bad pointer calculations to corrupt memory - interface faults emulate bad parameters - branch faults remove branch conditions - loop faults change termination condition of loops Other random changes - text fault: flip a random bit in some instruction - NOP fault: delete a random instruction (change it to nop) Is this realistic? How do results look? Too optimistic or pessimistic? Performance... Let's look at table 4: Play-mp3 looks good Why does send-stream have more XPCs than receive stream? (batching) Why does this not matter for performance? (cost overlaps w. xmit) Why does compile take bigger hit than send-stream (which has more XPCs)? Compile is CPU-bound How did they produce graph in Figure 8? What is statistical profiling? What does this tell us? Why don't they show user-mode execution time? Where is CPU time going? Extra code -- e.g., XPC, object tracking Existing code running more slowly? Why? TLB misses; What are "Pentium 4 performance counters"? Why does khttpd do so much worse under Nooks? (60% worse, ouch) CPU problem, like compile Also, transactional, not buffered... how does this affect things? Do we care? khttpd does sound like a bogus project Maybe use special OS if you care so much (more next lecture...exokernel) Would same ideas apply to other OSes? Authors claim Linux is worst case scenario? Why? Do we believe this? In terms of lots of ill-defined extension interfaces, probably true That linux doesn't reboot on process-context panic might help, though