L3 == What is a microkernel, and why would anyone want one? Idea: Implement many traditional OS abstractions in servers Paging, File system, possibly even interrupt handlers (like L3) Advantages? Modularity, extensibility, fault isolation, Disadvantages? Performance. What would be simple calls into the kernel are now IPCs How bad is performance? Look at Table 2 (p. 12) What's the minimum you need to do for an IPC? See Table 3, p. 15: 127 cycles What's expensive here? int, iret. Why? Flushes pipeline TLB misses. Why are 5 TLB misses necessary? B's thread control block loading %cr3 flushes TLB, so kernel text causes miss iret, accesses both stack and user text - two pages B's user code looks at message How do you think this trend has progressed in 10 years after paper? Worse now. Faster processors optimized for straight-line code Traps/Exceptions flush deeper pipeline, cache misses cost more cycles What is the actual IPC time of optimized L3? 5 usec How does that compare to costs of things you might do in servers Access a disk during a page fault? Milliseconds, no problem How about some of the tricks from Appel Li paper? Typically at least 100s of microseconds, so 5 usec overhead probably OK How about network interrupts (from livelock paper)? We saw numbers like 5,000 pkts per second 2 IPCs per packet would take 50% of CPU, but with polling maybe ok What are the principles behind L3? Basically, IPC performance is the master Plus a bunch of other things that emphasize IPC performance What is the basic approach? Design the ľkernel for a specific CPU What is the IPC interface? call (threadID, send-message, receive-message, timeout); reply_and_receive (reply-message, receive-message, timeout); * What are the optimizations? New system call: reply_and_receive. Effect: 2 system calls per RPC. Complex messages: direct string, indirect strings, and memory objects. Direct data transfer by temporary mapping. How does this work? Copy two PDE's from B's address space into kernel range of A's pgdir Why two PDEs? Maximum message size is 4 Meg, so guaranteed to work Why not just copy PTEs? Would be much more expensive What does it mean for the TLB to be "window clean"? Why do we care? Means TLB contains no mappings within communication window We care because mapping is cheap (copy PDE), but invalidation not x86 only lets you invalidate one page at a time, or whole TLB Does TLB invalidation of communication window turn out to be a problem? Not usually, because have to load %cr3 during IPC anyway Why not just use user-level shared pages between A and B for data? Multi-level security (makes it hard to reason about information flow) Receiver can't check message legality (might change after check) When server has many clients, could run out of virtual address space Requires shared memory region to be established ahead of time Not application friendly, since data may already be at another address Meaning applications would have to copy anyway--possibly more copies What is thread control block (tcb), and how is it optimized tcb contains basic info about thread registers, links for various doubly-linked lists, pgdir, uid, ... commonly accessed fields packed together on the same cache line Kernel stack is on same page as tcb. why? Minimizes TLB misses (since accessing kernel stack will bring in tcb) Allows very efficient access to tcb -- just mask off lower 12 bits of %esp With VM, can use lower 32-bits of thread id to indicate which tcb Using one page per tcb means no need to check if thread is swapped out Can simply not map that tcb if shouldn't access it How does system handle thread queues? There's a ready queue, 8 timeout queues, maybe a long-term timeout queue Use doubly-linked list Invariant: no page faults when traversing queues so remove from all queues before unmapping How does timeout queue work and why? Uses 8 queues, selected based on wakeup time MOD 8 At each clock tick, examine all threads in appropriate queue How much does this cost? 2 msec for 16K entries So probably ok given calculation at bottom of page 7 1st column Also use long-term queue so don't have to keep checking long-sleeping tcbs How is time represented? Don't want to use 64-bit numbers; too expensive So use base + offset in msec Bump base & recompute all offsets every ~4 hours Maximum timeout is ~24 days (2^{31} msec) What is the problem addressed by lazy scheduling? Conventional approach to scheduling: A sends message to B: Move A from ready queue to waiting queue Move B from waiting queue to ready queue This requires 58 cycles, including 4 TLB misses. What are TLB misses? One each for head of ready and waiting queues One each for previous queue element during the remove Lazy scheduling: Ready queue must contain all ready threads except current one Might contain other threads that aren't actually ready, though Each wakeup queue contains all threads waiting in that queue Again, might contain other threads, too Scheduler removes inappropriate queue entries when scanning queue Why does this help performance? Only three situations in which thread gives up CPU but stays ready: send syscall (as opposed to call), preemption, and hardware interrupts So very often can IPC into thread while not putting it on ready list What is the optimization for short messages? Use two registers for short message bodies (See Table 6, p. 15) Also allows other optimizations, for a total win of 2.4 usec What is optimization for avoiding unnecessary copies? Basically can send and receive messages w. same vector Makes forwarding efficient, which is important for Clans/Chiefs model Segment register optimization Loading segment registers is slow -- have to access GDT, etc. But common case is that users don't change their segment registers Observation: It's faster to check a segment register than load it So just check that segment registers are okay Only need to load if user code changed them Minimizing TLB misses Try to cram as many things as possible onto same page: IPC kernel code, GDT, IDT, TSS, all on same page Actually maybe can't fit whole tables but put the important parts of tables on the same page (maybe beginning of TSS, IDT, or GDT only?) Other coding tricks: short offsets avoid jumps in common case avoid checks in common case (e.g., sending message to self illegal, checked if receiver not ready) pack often-used data on same cache lines lazily save/restore other CPU state like debug and FPU registers What do we think of results? Figures 7 and 8 look good What about general theme of paper? Was Liedtke fighting a losing battle against CPU makers? Should fast IPC time be a hardware, or just an OS issue?