VM primitives for user programs =============================== What's the point? O/S should provide better support for user-controlled VM. Faster. More correct. More complete. Would make programs faster. Would allow neat tricks that are otherwise too painful. They provide laundry list of examples of uses. They analyze O/S VM efficiency, argue plenty of room for improvement. Do they define a new VM interface or design or implementation? (DIRTY?) What are the primitives? TRAP, PROT1, PROTN, UNPROT, DIRTY, MAP2 How do these map to system calls? Are any of these hard to implement? (MAP2 w. virtually addressed cache) What does PROTx actually do? and how to implement on x86? Mark PTE "protected". And/or mark O/S vm structures "protected". And at least invalidate h/w PTE/TLB. x86: Clear the P(resent) bit in PTE, call INVLPG What does TRAP actually mean? PTE (or TLB entry) marked "protected" CPU saves user state, jumps into kernel. Kernel asks VM system what to do? I.e. page in from disk? Core dump? Generate signal -- upcall into user process. Write sigcontext to user's stack, return to user handler Now executing lower on user stack (or on separate signal stack) Run user handler, can do anything. Probably must call UNPROT for referenced page. That is, must avoid repeated fault. Maybe we can change faulting address/register???? Maybe not. User handler returns to kernel--why? Mostly to reset signal mask (maybe also stack if SA_ONSTACK) Kernel returns to user program. Continue or re-start instruction that trapped. Were the primitives available in 1991 O/S's? Yes Were the primitives fast? What would fast mean? Perhaps relative to compiler-generated checking code? Perhaps relative to what we were going to do to handle the fault? Are they faster today? 12 microseconds on 1.2 GHz Athlon, FreeBSD 4.3. For trap, unprot, prot. So getting relatively slower--Why? (especially given wide-issue arch) Do we really need VM hardware for these primitives? Not a security issue, so can be user controlled. Why not do it in assembler/compiler if you can gcc could generate code to simulate VM More flexible... e.g., can chose optimal page size Performance... e.g., software overhead in generational GC 5-10% Plus it's a pain to modify the compiler And might not have anticipated E.g., didn't know you wanted checkpoints when you compiled program CPUs already have VM h/w, why not use it? Because then the OS has to be involved. And it's slow and rigid. Also, used to be Cheap embedded CPUs didn't have VM, but less true now. Let's look at concurrent GC. 1. How does two-space compacting copying GC work? Need forwarding pointers in old space (and "copied" flag). Why is this attractive? Alloc is cheap. Compacts, so no free list. Why isn't it perfect? 2. How does Baker's incremental GC work? "Mutator" thread (i.e., program) only accesses "scanned area" of to-space Every load from non-scanned to-space must be checked Scan loaded object to ensure it doesn't point back to from-space Must leave forwarding pointers in from-space for copied objects. Incremental: every allocation scans a little. 3. How does VM help? Avoid explicit checks for ptrs back to from space. By read-protecting unscanned area. Why can't we just read-protect from-space? Also, a concurrent collector on another CPU. Why no conflict? Collector only reads from-space and protected unscanned to-space. Need sync when mutator thread traps. Are existing VM primitives good enough for concurrent GC? MAP2 is the only functionality issue -- but not really. We never have to make the same page accessible twice! Are traps &c fast enough? They say no: 500 us to scan a page, 1200 us to take the trap. (p. 102) Why not scan 3 pages? (latency) How much slower to run Baker's actual algorithm, w/ checks? VM version might be faster! Even w/ slow traps. What about time saved by 2nd CPU scanning? They don't count this. Is it an issue how often faults occur for concurrent GC? Not really -- more faults means more scanning. I.e. we'll get <= one fault per page, at most. Shared virtual memory Process snapshots Generational GC Persistent stores Persistent stores for DBs bigger than VM space Data-compression paging What about heap overflow detection? Eliminates check when allocating memory Note: Don't necessarily unprotect faulting page Means you can't just restart the faulting instruction I.e., might have to change some of the registers in sigcontext structure How to simulate missing primitives? What if you don't have DIRTY? Periodically read-protect valid pages, flag DIRTY upon TRAP What about MAP2? Why do you want MAP2 anyway? So one thread can manipulate memory that another thread will trap on Could have kernel copy data to/from protected page If that works, could put garbage collector in kernel (probably bad idea) Could use two different address spaces with different protection But then overhead of context switch How might you change the kernel interface to support applications better Perhaps combine signal handler return with unprot? Right now, requires extra entry into kernel Does return from signal even need to go through kernel? Could maybe keep SIGSEGV unmasked, and skip return (but danger--bug in handler -> infinite loop) Dirty might not require system call Could maybe give a process read-only access to its own page tables What to conclude from this paper? OS designers didn't care about user VM primitives in '91 E.g., broken mprotect in SunOS 4.0.3c A few VM primitives seem to have a lot of applications Many algorithms use PROTN and UNPROT, but not PROT Why is this good? What's a TLB shootdown? Most algorithms just unprotect a page on a fault. Why is this good? Invitation to hell... returned to sender In fact, are precise exceptions an issue today? (usually hardware precise)