VM Background ============= Brief history of virtual machines Old idea from the 60s and 70s Allowed people to share hardware before multiprogrammed OSes Then fell out of favor with newer OSes and cheaper hardware By mid 90s, perception was that software was trailing hardware People were working on large ccNUMA machines "Large" means chances of a node failing are higher So need new OS techniques for fault containment NUMA (non-uniform memory access) requires scheduling, VMem support Very little hope 3rd party vendors like Microsoft would support this In 1997, Stanford Disco proposed addressing problem with virtual machines Idea: Run many instances of commodity OS on many CPU machine Takes care of NUMA, and maybe fault containment For aggressive scientific applications, run specialized, small OS Disco led to VMware, which was very successful probably for different reasons - Running many applications on Windows NT causes instability With VMware, just run multiple instances of NT on same physical machine - Security isolation if don't fully trust app/OS in one VM - Statistical multiplexing across OSes saved hardware For policy reasons might have different superusers in different OSes VMware led to other VM projects So successful that CPU manufacturers are now supporting virtualization (more on this when we discuss today's paper) Stanford Disco project details Directly executed both kernel and user code in VM Required changing the kernel, often called "paravirtualization" Virtualized the MIPS architecture, which required some incompatibility MIPS reserves upper VA space for pseudo-physical memory Modify kernel to run out of lower segment in mapped virtual addresses Reading/writing certain machine registers privileged Trapping on every instruction would be expensive Instead change kernel to read/write to special virtual addresses New virtual disk/network devices require new drivers Made several other modifications to OS (IRIX): VM call to get pre-zeroed page (since VM must do it for privacy anyway) VM call to say page is on free list Change mbuf management to avoid linked list of free pages Thus some phys page contains only packet data, not OS-specific links Change bcopy to make VM call that just re-maps physical page Several key optimizations gave good performance: Copy-on-write disks - if same OS, programs loaded in multiple VMs Until page written, allows one physical page to be used in multiple VMs Combined buffer cache - allows one VM to use clean buffers from another VM Virtual network device which allows arbitrarily large packets Also multiply maps aligned pages CoW rather than copy data Detect when VM OS puts CPU in low-power mode (idle loop) and deschedule Example: Efficient use NFS across VMs Read from disk, uses global buffer cache Copy to network buffer, uses bcopy which calls VM remap Send message to other VM over virtual network, just remaps page again VMware virtualizes x86 But works with umodified OSes, so no paravirtualization Comparison of Software and Hardware Techniques ============================================== Popek and Goldberg define Virtual Machine Monitor as having 3 properties: Fidelity - VMM looks to guests just like real hardware (except timing) Performance - Most instructions directly executed w/o VMM intervention Safety - The VMM manages hardware resources What does it mean for an architecture to be "classically virtualizable"? Means you can implement a VMM using trap-and-emulate. What's this? VMM runs guest kernel in CPU mode less privileged than it normally does *De-privileges* all instructions that read or write privileged state E.g., I/O ports, page table base register (PTBR), etc. De-privileged instructions trap, emulate them to make VMM transparent Keep primary and shadow structures *primary* structure is what guest would show hardware w/o the VMM E.g., page tables, PT base register (%cr3 on x86), flags register, etc. *shadow* structure is what VMM actually shows hardware E.g., actual value of %cr3 Example: page tables primary page table translates Virtual -> guest "Physical" Page Nos. (i.e., would be physical address if no VMM, but is fictitious w. VMM) Shadow page tables translate Virtual -> Machine physical Page Nos. combines guest VPN -> "P"PN and VMM's "P"PN -> MPN mappings What are complications of guest OS access to PTEs? Make previously invalid PTE a valid mapping Might take page fault on valid page But VMM can detect and fix up transparently. Called "hidden page fault". Make previously valid PTE an invalid mapping Problem if shadow page table still has mapping Access that should fault into VMM will execute and use old mapping But: Correct OS should use INVLPG to flush TLB entry, so de-privilege that Change PTE to point from one to a different "physical" page number Similarly, OS should call INVLPG to can detect Read and inspect PTE Problem: Accessed & dirty bits in primary PTE won't reflect shadow PTE Would have to fix with more hidden page faults (e.g., never make shadow PTE writable until you set dirty bit in primary) It turns out many OSes are buggy In particular, many OSes wouldn't work properly with an infinite TLB So relying on intercepting INVLTB might not be good enough And depending on workload performance penalty of hidden faults might be high What is tracing? As alternative to above, intercept PTE updates by protecting page tables Similarly intercept access to memory mapped devices by protecting pages Any accesses to these pages cause "tracing faults" VMM emulates write (e.g., updates primary & shadow PTE, emulates device) Restarts guest OS, so guest OS doesn't know there was a page fault Note distinction from "true page faults" - what are these? Memory access that is also invalid according to the primary guest PTE VMM must vector to guest OS's page fault handler All true page faults must vector to VMM for correctness Page faults that are not true page faults are hidden Note VMWare treats shadow page tables as cache So potentially any guest memory access can cause hidden page fault Section 2.4 (p. 2): "striking a favorable balance in this three-way trade-off among trace costs, hidden page faults, and context switch costs is surprising both in its difficulty and its criticality to VMM performance." Where does this trade-off come from? Idea: guest OS can access primary PTEs or corresponding VAs, but not both Always trace access to page tables? lots of tracing faults Every PTE update causes a tracing fault, but shadow PTEs always correct No tracing - allow guest kernel to write most page tables most of the time? 1) lots of hidden faults (shadow PTE invalid if page table writable), or 2) pre-generate primary PTEs from shadow when switching into guest kernel (otherwise kernel will not see correct accessed & dirty bits) Is x86 classically virtualizable? No. Why not? Can't prevent some shadow state from being visible E.g., run kernel at privilege level 1 instead of normal kernel mode (0) But lower 2 bits of %cs register reflect Current Privilege Level (CPL) "push %cs; popl %eax" will not trap, and will put wrong value in %eax Can't de-privilege all instructions Some instructions just behave differently depending on CPL Example: popfl - pops flags off the stack When CPL > 0, does not change certain privileged bits in flags register (e.g., IF bit is interrupt enable flag - user code shouldn't clear it) But still useful instruction that you can't make trap But can virtualize x86 with an interpreter. What's this? Basically a software implementation of x86 hardware while (inst = get_next_instruction ()) { switch (decode_instruction_type (inst)) { ... } } Emulates screen (e.g., in an X window or on terminal) bochs is like this if you've ever used it Is bocks a Popek & Goldberg VMM? Fidelity - yes Performance - No, much too slow Safety - yes Refinement #1: Directly execute user code Observation: Only kernel code really needs to be interpreted So within guest OS, directly execute all user-mode code Slight annoyance: VMM resides in some range of virtual memory If OS gave user code VMM virtual addresses, must relocate (Maybe use segmentation to help VMM relocate) Upon trap to kernel, fire up slow interpreter Will be faster, but still not good enough for VMware VMWare uses binary translation of kernel. What's this? Translate guest kernel code into privileged code that runs in kernel mode But dangerous or non-virtualizable instructions not identically translated E.g., load of %cr3 must put shadow PTBR into %cr3, not guests value Note, translated code uses CPL1, not CPL0, mostly for x86 particularities Hardware scribbles stuff onto stack on traps and exceptions For fidelity, don't want to clobber memory below guest's stack pointer By running at CPL1, trap frame written to VMM's private CPL0 stack Works one Translation Unit (TU) at a time TU is basically a basic block, translated into compiled code fragment (CCF) Except capped at 12 instructions to have max size for convenience Dumps translated TUs into memory region used as Translation Cache (TC) --- Look at isPrime example in section 3.1 (p. 3) Most code is identically translated (IDENT) - exact same instructions run But "jge prime" -> "jge [takenAddr]; jmp [fallthrAddr]" Addresses in brackets are *continuations* - what are these? Code is translated on demand When translating code, will see jumps to code not yet translated These are translated into jumps to continuation First time through, invokes translator Then patches jump so next time jump right to translated code Note: can elide [fallthrAddr] if that code next translation emitted What can't be IDENT? PC-relative addressing - compensation to compute relative to untranslated PC Direct control flow - map to corresponding address in TC at translation time Indirect control flow - must compute target dynamically Target is in register or on stack (e.g., return from call) Note, for fidelity don't assume ret has a corresponding call! Unlike direct control flow, don't know target when translating TU Emit code to look up target in hash table each time Privileged instructions - work on shadow state E.g., update IF in shadow eflags - possibly faster than untranslated code! What's "innocent until proven guilty" mean? Some translated instructions still very expensive ("guilty") - writing to a page table and causing a tracing fault - reading/writing device registers - trying to use region of virtual memory reserved for VMM itself Keep track of cost - how many hidden faults generated by instruction Assume innocent, but after many faults take steps to make instruction cheaper Option 1: Use a "callout" (see ccf5 in Figure 1) Patch start of CCF to jump to new region There do something cheaper (e.g., update shadow PTE & avoid trap) Then jump back Option 2: Re-translate the CCF containing the "guilty" instruction" More expensive as all CCFs that branch to original CCF must be updated Note non-IDENT translated code may need to access VMM data structures Use segmentation to avoid accidental access to VMM data by guest OS Use %gs: override prefix to access VMM data Means any use of %gs by guest (rare) cannot be identically translated How does hardware virtualization change things? New cpu mode, guest mode, less privileged than host mode (where VMM runs) In guest mode, some sensitive instructions trap But hardware also keeps shadow state for many things (e.g., eflags) Enter guest mode using VMRUN instruction Loads state from VMCB data structure used to communicate guest OS state between H/W and VMM Various events cause EXIT back into host mode saves state to VMCB VMCB contains: * Control bits Intercept vector: - one bit for each of %cr0-%cr15 to say if trap on read of register - one bit for each of %cr0-%cr15 to say if trap on write of register - 32 analogous bits for the debug registers (%dr0-%dr15) - 32 bits for whether to intercept exception vectors 0-31 - bits for various other events (e.g., NMI, SMI, ...) - bit to intercept writes to sensitive bits of %cr0 (not TS or MP) - 8 bits to intercept reads and writes of IDTR, GDTR, LDTR, TR - bits to intercept RDTSC, RDPMC, PUSHF, POPF, VMRUN HLT, INVLPG, INT, IRET, IN/OUT (to selected ports), ... - Exit code and reason (e.g., which instruction/event caused exit) Other control values: - Pending virtual interrupt - Event injection of various exceptions * Saved guest state - Full segment registers (i.e., base, lim, attr, not just selectors) - Full GDTR, LDTR, IDTR, TR - Guest %cr3, %cr2, (and other cr/dr registers) - Guest eip and eflags (really rip & rflags for 64-bit processors) - Guest %rax register Entering/exiting VMM is a bit more expensive than traditional trap to kernel Saving/loading VMCB expensive - structure is 1024 bytes (664 now used) Big benefit: makes writing a VMM (with fidelity) much easier! What about performance? Is hardware virtualization faster? Sometimes yes, usually not. When is it faster? Better at entering/exiting kernel E.g., Apache on windows: one address space lots of syscalls, H/W better Apache on Linux: multiple address spaces means context switches, tracing faults, etc., so software better Section 6.2 (p. 7): Why is fork/wait so hard for VMMs? (6.0 sec -> 36.9/106.4) Section 4.3 (p. 5): What happens during fork? - fork system call vectors to kernel (CPL3 -> 0), no EXIT required - OS implements copy-on-write optimization. Must write-protect parent Loops through page tables clearing W bits in PTEs Each PTE write causes a tracing fault which causes an EXIT EXIT more expensive than simple fault which is what BT does But BT also adapts "guity" instructions to avoid any fault - OS switches to child process, writes to %cr3, requires EXIT - Child touches pages not mapped in shadow PTs Causes hidden page faults, each of which requires an EXIT - Parent/child write to CoW pages, causing true page faults Each requires an EXIT (for VMM to decide it's a true fault) Who is writing this paper? Might we expect any bias? Both authors work for VMWare, which makes money from software virtualization H/W schemes reduce barrier to entry, so enable more competition So good news for VMWare if BT is still faster than H/W virtualization! Authors are honest and would report either way But VMWare prohibits publishing benchmark numbers without permission Nobody else could have legally published this paper Company might have barred publication if it made them look bad How do we expect numbers to look 5 years from now? Nested page tables should be a big deal - eliminate tracing faults But will dramatically increase the cost of TLB misses VMWare sources suggest their BT still faster on many benchmarks E.g., specjbb has large number of TLB faults so nested PT worse than both BT and existing H/W-based VMM