Comparison of Software and Hardware Techniques ============================================== Popek and Goldberg define Virtual Machine Monitor as having 3 properties: Fidelity - VMM looks to guests just like real hardware (except timing) Performance - Most instructions directly executed w/o VMM intervention Safety - The VMM manages hardware resources What does it mean for an architecture to be "classically virtualizable"? Means you can implement a VMM using trap-and-emulate. What's this? VMM runs guest kernel in CPU mode less privileged than it normally does *De-privileges* all instructions that read or write privileged state E.g., I/O ports, page table base register (PTBR), etc. De-privileged instructions trap, emulate them to make VMM transparent Keep primary and shadow structures *primary* structure is what guest would show hardware w/o the VMM E.g., page tables, PT base register (%cr3 on x86), flags register, etc. *shadow* structure is what VMM actually shows hardware E.g., actual value of %cr3 Example: page tables primary page table translates Virtual -> guest "Physical" Page Nos. (i.e., would be physical address if no VMM, but is fictitious w. VMM) Shadow page tables translate Virtual -> Machine physical Page Nos. combines guest VPN -> "P"PN and VMM's "P"PN -> MPN mappings What's the distinction between true and hidden page faults? All true page faults must vector to VMM for correctness Memory access that is also invalid according to the primary guest PTE VMM vectors to guest OS fault handler exactly as hardware would w/o VMM Page faults that are not true page faults are *hidden page faults* Note VMWare treats shadow page tables as cache So potentially any guest memory access can cause hidden page fault What are complications of guest OS access to PTEs? Make previously invalid PTE a valid mapping? Might take page fault on valid page But VMM can detect and fix up transparently. Called "hidden page fault". Make previously valid PTE an invalid mapping? Problem if shadow page table still has mapping Access that should fault into VMM will execute and use old mapping Correct OS should flush TLB entry. Can happen one of two ways: 1. Guest OS can use INVLPG to flush TLB entry By de-privileging that instruction, VMM can keep shadow PT in sync 2. Guest OS can load %cr3 VMM must scan every possibly modified primary PTE to sync with shadow ...or VMM can invalidate most shadow PTEs (causing hidden page faults) Change PTE to point from one to a different "physical" page number Similarly, OS should call INVLPG or re-load %cr3 Read and inspect PTE Problem: Accessed & dirty bits in primary PTE won't reflect shadow PTE Would have to fix with more hidden page faults (e.g., delay making shadow PTE writable until set dirty bit in primary) It turns out many OSes are buggy Supposedly many OSes won't work properly with an infinite TLB So relying on intercepting INVLTB might be iffy And depending on workload performance penalty of hidden faults might be high What is tracing? As alternative to above, intercept PTE updates by protecting page tables Similarly intercept access to memory mapped devices by protecting pages Any accesses to these pages cause "tracing faults" VMM emulates write (e.g., updates primary & shadow PTE, emulates device) Restarts guest OS at next instruction ...so guest OS doesn't know there was a page fault Note distinction between tracing and hidden page faults After hidden: VMM fixes shadow PTE and re-executes faulting instruction After tracing: VMM decodes & emulates faulting instruction, then skips it, resuming guest execution at the *subsequent* instruction Section 2.4 (p. 2): "striking a favorable balance in this three-way trade-off among trace costs, hidden page faults, and context switch costs is surprising both in its difficulty and its criticality to VMM performance." Where does this trade-off come from? Idea: Should guest OS have direct access to: 1) Primary PTEs, 2) corresponding VAs, or 3) both? 1) Means most shadow PTEs must be invalid Most memory references will cause hidden page faults Use hidden page faults to sync shadow PTE with primary and vice versa Compute shadow PTE address mapping based on current primary PTE Set accessed and dirty bits appropriately in the primary PTE 2) Means always trace access to page tables Lots of tracing faults (every PTE access causes one) But can ensure shadow PTEs are always correct 3) Rely on INVLTB to keep shadow PTEs up to date High context switch overhead Whenever %cr3 loaded, must recompute entire shadow page table Still need some hidden faults for proper accessed/dirty in primary PTE Is x86 classically virtualizable? No. Why not? Can't prevent some shadow state from being visible E.g., run kernel at privilege level 1 instead of normal kernel mode (0) But lower 2 bits of %cs register reflect Current Privilege Level (CPL) "push %cs; popl %eax" will not trap, and will put wrong value in %eax Can't de-privilege all instructions Some instructions just behave differently depending on CPL Example: popfl - pops flags off the stack When CPL > 0, does not change certain privileged bits in flags register (e.g., IF bit is interrupt enable flag - user code shouldn't clear it) But still useful instruction that you can't make trap But can virtualize x86 with an interpreter. What's this? Basically a software implementation of x86 hardware while (inst = get_next_instruction ()) { switch (decode_instruction_type (inst)) { ... } } Emulates screen (e.g., in an X window or on terminal) bochs is like this if you've ever used it Is bocks a Popek & Goldberg VMM? Fidelity - yes Performance - No, much too slow Safety - yes Refinement #1: Directly execute user code Observation: Only kernel code really needs to be interpreted So within guest OS, directly execute all user-mode code Slight annoyance: VMM resides in some range of virtual memory If OS gave user code VMM virtual addresses, must relocate Upon trap to kernel, fire up slow interpreter Will be faster, but still not good enough for VMware VMWare uses binary translation of kernel. What's this? Translate guest kernel code into privileged code that runs in kernel mode But dangerous or non-virtualizable instructions not identically translated E.g., load of %cr3 must put shadow PTBR into %cr3, not guest's value Note, translated code uses CPL1, not CPL0, mostly for x86 particularities Hardware scribbles stuff onto stack on traps and exceptions For fidelity, don't want to clobber memory below guest's stack pointer By running at CPL1, trap frame written to VMM's private CPL0 stack Works one Translation Unit (TU) at a time TU is basically a basic block, translated into compiled code fragment (CCF) Except capped at 12 instructions to have max size for convenience Dumps translated TUs into memory region used as Translation Cache (TC) Look at isPrime example in section 3.1 (p. 3) Most code is identically translated (IDENT) - exact same instructions run But "jge prime" -> "jge [takenAddr]; jmp [fallthrAddr]" Addresses in brackets are *continuations* - what are these? Code is translated on demand When translating code, will see jumps to code not yet translated These are translated into jumps to continuation First time through, invokes translator Then patches jump so next time jump right to translated code Note: can elide [fallthrAddr] if that code next translation emitted What can't be IDENT? PC-relative addressing - compensation to compute relative to untranslated PC Direct control flow - map to corresponding address in TC at translation time Indirect control flow - must compute target dynamically Target is in register or on stack (e.g., return from call) Note, for fidelity don't assume ret has a corresponding call! Unlike direct control flow, don't know target when translating TU Emit code to look up target in hash table each time Privileged instructions - work on shadow state E.g., update IF in shadow eflags - possibly faster than untranslated code! What does "innocent until proven guilty" mean? Some translated instructions still very expensive ("guilty") - writing to a page table and causing a tracing fault - reading/writing device registers - trying to use region of virtual memory reserved for VMM itself Keep track of cost - how many hidden faults generated by instruction Assume innocent, but after many faults take steps to make instruction cheaper Option 1: Use a "callout" (see ccf5 in Figure 1) Patch start of CCF to jump to new region There do something cheaper (e.g., update shadow PTE & avoid trap) Then jump back Option 2: Re-translate the CCF containing the "guilty" instruction" More expensive as all CCFs that branch to original CCF must be updated Note non-IDENT translated code may need to access VMM data structures Use segmentation to avoid accidental access to VMM data by guest OS Use %gs: override prefix to access VMM data Means any use of %gs by guest (rare) cannot be identically translated How does hardware virtualization change things? New cpu mode, guest mode, less privileged than host mode (where VMM runs) In guest mode, some sensitive instructions trap But hardware also keeps shadow state for many things (e.g., eflags) AMD: Enter guest mode using VMRUN instruction Loads state from VMCB data structure used to communicate guest OS state between H/W and VMM Various events cause EXIT back into host mode saves state to VMCB VMCB contains: * Control bits Intercept vector: - one bit for each of %cr0-%cr15 to say if trap on read of register - one bit for each of %cr0-%cr15 to say if trap on write of register - 32 analogous bits for the debug registers (%dr0-%dr15) - 32 bits for whether to intercept exception vectors 0-31 - bits for various other events (e.g., NMI, SMI, ...) - bit to intercept writes to sensitive bits of %cr0 (not TS or MP) - 8 bits to intercept reads and writes of IDTR, GDTR, LDTR, TR - bits to intercept RDTSC, RDPMC, PUSHF, POPF, VMRUN HLT, INVLPG, INT, IRET, IN/OUT (to selected ports), ... - Exit code and reason (e.g., which instruction/event caused exit) Other control values: - Pending virtual interrupt - Event injection of various exceptions * Saved guest state - Full segment registers (i.e., base, lim, attr, not just selectors) - Full GDTR, LDTR, IDTR, TR - Guest %cr3, %cr2, (and other cr/dr registers) - Guest eip and eflags (really rip & rflags for 64-bit processors) - Guest %rax register Entering/exiting VMM is a bit more expensive than traditional trap to kernel Saving/loading VMCB expensive - structure is 1024 bytes (664 now used) Intel: Similar ideas but calls it Virtual-Machine Control Structure (VMCS) - VMPTRLD - Loads machine address of VMCS - VMLAUNCH - Hardware is allowed to cache VMCS contents Reading or writing corresponding machine memory illegal - VMREAD/VMWRITE - read and write VMCS of active current VMCS - VMCLEAR - flush VMCS back to memory E.g., can swap to disk or migrate to different core Big benefit: makes writing a VMM (with fidelity) much easier! What about performance? Is hardware virtualization faster? Sometimes yes, usually not. When is it faster? Better at entering/exiting kernel E.g., Apache on windows: one address space lots of syscalls, H/W better Apache on Linux: multiple address spaces means context switches, tracing faults, etc., so software better Section 6.2 (p. 7): Why is fork/wait so hard for VMMs? (6.0 sec -> 36.9/106.4) Section 4.3 (p. 5): What happens during fork? - fork system call vectors to kernel (CPL3 -> 0), no EXIT required - OS implements copy-on-write optimization. Must write-protect parent Loops through page tables clearing W bits in PTEs Each PTE write causes a tracing fault which causes an EXIT EXIT more expensive than simple fault which is what BT does But BT also adapts "guity" instructions to avoid any fault - OS switches to child process, writes to %cr3, requires EXIT - Child touches pages not mapped in shadow PTs Causes hidden page faults, each of which requires an EXIT - Parent/child write to CoW pages, causing true page faults Each requires an EXIT (for VMM to decide it's a true fault) Go over Figure 4 (p. 8) Who is writing this paper? Might we expect any bias? At time, both authors at VMWare, which sold software virtualization H/W schemes reduce barrier to entry, so enable more competition So good news for VMWare if BT is still faster than H/W virtualization! Authors are honest and would report either way But VMWare tries to prohibit publishing benchmarks without permission Company might have barred publication if it made them look bad Paper from 2006. How would we expect numbers to look today? Nested page tables should be a big deal - eliminate tracing faults But will dramatically increase the cost of TLB misses VMWare sources suggested their BT still faster on many benchmarks E.g., specjbb has large number of TLB faults so nested PT worse than both BT and existing H/W-based VMM But now even VMWare mostly uses hardware virtualization BT mostly used for boot code before processor enters 64-bit mode