Comparison of Software and Hardware Techniques
==============================================

Popek and Goldberg define Virtual Machine Monitor as having 3 properties:
  Fidelity - VMM looks to guests just like real hardware (except timing)
  Performance - Most instructions directly executed w/o VMM intervention
  Safety - The VMM manages hardware resources

What does it mean for an architecture to be "classically virtualizable"?
Means you can implement a VMM using trap-and-emulate.  What's this?
  VMM runs guest kernel in CPU mode less privileged than it normally does
    *De-privileges* all instructions that read or write privileged state
      E.g., I/O ports, page table base register (PTBR), etc.
     De-privileged instructions trap, emulate them to make VMM transparent
    Keep primary and shadow structures
      *primary* structure is what guest would show hardware w/o the VMM
        E.g., page tables, PT base register (%cr3 on x86), flags register, etc.
      *shadow* structure is what VMM actually shows hardware
        E.g., actual value of %cr3
      Example: page tables
        primary page table translates Virtual -> guest "Physical" Page Nos.
          (i.e., would be physical address if no VMM, but is fictitious w. VMM)
        Shadow page tables translate Virtual -> Machine physical Page Nos.
          combines guest VPN -> "P"PN and VMM's "P"PN -> MPN mappings

What's the distinction between true and hidden page faults?
  All true page faults must vector to VMM for correctness
    Memory access that is also invalid according to the primary guest PTE
    VMM vectors to guest OS fault handler exactly as hardware would w/o VMM
  Page faults that are not true page faults are *hidden page faults*
    Note VMWare treats shadow page tables as cache
    So potentially any guest memory access can cause hidden page fault

What are complications of guest OS access to PTEs?
  Make previously invalid PTE a valid mapping?
    Might take page fault on valid page
    But VMM can detect and fix up transparently.  Called "hidden page fault".
  Make previously valid PTE an invalid mapping?
    Problem if shadow page table still has mapping
    Access that should fault into VMM will execute and use old mapping
    Correct OS should flush TLB entry.  Can happen one of two ways:
      1. Guest OS can use INVLPG to flush TLB entry
         By de-privileging that instruction, VMM can keep shadow PT in sync
      2. Guest OS can load %cr3
         VMM must scan every possibly modified primary PTE to sync with shadow
         ...or VMM can invalidate most shadow PTEs (causing hidden page faults)
  Change PTE to point from one to a different "physical" page number
    Similarly, OS should call INVLPG or re-load %cr3
  Read and inspect PTE
    Problem: Accessed & dirty bits in primary PTE won't reflect shadow PTE
    Would have to fix with more hidden page faults
      (e.g., delay making shadow PTE writable until set dirty bit in primary)

It turns out many OSes are buggy
  Supposedly many OSes won't work properly with an infinite TLB
  So relying on intercepting INVLTB might be iffy
  And depending on workload performance penalty of hidden faults might be high

What is tracing?
  As alternative to above, intercept PTE updates by protecting page tables
  Similarly intercept access to memory mapped devices by protecting pages
  Any accesses to these pages cause "tracing faults"
    VMM emulates write (e.g., updates primary & shadow PTE, emulates device)
    Restarts guest OS at next instruction
      ...so guest OS doesn't know there was a page fault

Note distinction between tracing and hidden page faults
  After hidden:  VMM fixes shadow PTE and re-executes faulting instruction
  After tracing:  VMM decodes & emulates faulting instruction, then skips it,
                  resuming guest execution at the *subsequent* instruction

Section 2.4 (p. 2):
  "striking a favorable balance in this three-way trade-off among
   trace costs, hidden page faults, and context switch costs is
   surprising both in its difficulty and its criticality to VMM
   performance."
  Where does this trade-off come from?
    Idea:  Should guest OS have direct access to:
           1) Primary PTEs, 2) corresponding VAs, or 3) both?
    1) Means most shadow PTEs must be invalid
         Most memory references will cause hidden page faults
         Use hidden page faults to sync shadow PTE with primary and vice versa
           Compute shadow PTE address mapping based on current primary PTE
           Set accessed and dirty bits appropriately in the primary PTE
    2) Means always trace access to page tables
         Lots of tracing faults (every PTE access causes one)
         But can ensure shadow PTEs are always correct
    3) Rely on INVLTB to keep shadow PTEs up to date
         High context switch overhead
           Whenever %cr3 loaded, must recompute entire shadow page table
         Still need some hidden faults for proper accessed/dirty in primary PTE

Is x86 classically virtualizable?  No.  Why not?
  Can't prevent some shadow state from being visible
    E.g., run kernel at privilege level 1 instead of normal kernel mode (0)
    But lower 2 bits of %cs register reflect Current Privilege Level (CPL)
    "push %cs; popl %eax" will not trap, and will put wrong value in %eax
  Can't de-privilege all instructions
    Some instructions just behave differently depending on CPL
    Example: popfl - pops flags off the stack
      When CPL > 0, does not change certain privileged bits in flags register
        (e.g., IF bit is interrupt enable flag - user code shouldn't clear it)
      But still useful instruction that you can't make trap

But can virtualize x86 with an interpreter.  What's this?
  Basically a software implementation of x86 hardware
    while (inst = get_next_instruction ()) {
      switch (decode_instruction_type (inst)) {
         ...
      }
    }
  Emulates screen (e.g., in an X window or on terminal)
  bochs is like this if you've ever used it
  Is bocks a Popek & Goldberg VMM?
    Fidelity - yes
    Performance - No, much too slow
    Safety - yes

Refinement #1:  Directly execute user code
  Observation:  Only kernel code really needs to be interpreted
  So within guest OS, directly execute all user-mode code
    Slight annoyance:  VMM resides in some range of virtual memory
    If OS gave user code VMM virtual addresses, must relocate
  Upon trap to kernel, fire up slow interpreter
  Will be faster, but still not good enough for VMware

VMWare uses binary translation of kernel.  What's this?
  Translate guest kernel code into privileged code that runs in kernel mode
    But dangerous or non-virtualizable instructions not identically translated
      E.g., load of %cr3 must put shadow PTBR into %cr3, not guest's value
  Note, translated code uses CPL1, not CPL0, mostly for x86 particularities
    Hardware scribbles stuff onto stack on traps and exceptions
    For fidelity, don't want to clobber memory below guest's stack pointer
    By running at CPL1, trap frame written to VMM's private CPL0 stack
  Works one Translation Unit (TU) at a time
    TU is basically a basic block, translated into compiled code fragment (CCF)
    Except capped at 12 instructions to have max size for convenience
  Dumps translated TUs into memory region used as Translation Cache (TC)

Look at isPrime example in section 3.1 (p. 3)
  Most code is identically translated (IDENT) - exact same instructions run
  But "jge prime" -> "jge [takenAddr]; jmp [fallthrAddr]"
  Addresses in brackets are *continuations* - what are these?
    Code is translated on demand
    When translating code, will see jumps to code not yet translated
      These are translated into jumps to continuation
      First time through, invokes translator
      Then patches jump so next time jump right to translated code
    Note: can elide [fallthrAddr] if that code next translation emitted

What can't be IDENT?
  PC-relative addressing - compensation to compute relative to untranslated PC
  Direct control flow - map to corresponding address in TC at translation time
  Indirect control flow - must compute target dynamically
    Target is in register or on stack (e.g., return from call)
      Note, for fidelity don't assume ret has a corresponding call!
    Unlike direct control flow, don't know target when translating TU
    Emit code to look up target in hash table each time
  Privileged instructions - work on shadow state
    E.g., update IF in shadow eflags - possibly faster than untranslated code!

What does "innocent until proven guilty" mean?
  Some translated instructions still very expensive ("guilty")
    - writing to a page table and causing a tracing fault
    - reading/writing device registers
    - trying to use region of virtual memory reserved for VMM itself
  Keep track of cost - how many hidden faults generated by instruction
  Assume innocent, but after many faults take steps to make instruction cheaper
    Option 1: Use a "callout" (see ccf5 in Figure 1)
      Patch start of CCF to jump to new region
      There do something cheaper (e.g., update shadow PTE & avoid trap)
      Then jump back
    Option 2: Re-translate the CCF containing the "guilty" instruction"
      More expensive as all CCFs that branch to original CCF must be updated

Note non-IDENT translated code may need to access VMM data structures
  Use segmentation to avoid accidental access to VMM data by guest OS
  Use %gs: override prefix to access VMM data
    Means any use of %gs by guest (rare) cannot be identically translated

How does hardware virtualization change things?
  New cpu mode, guest mode, less privileged than host mode (where VMM runs)
  In guest mode, some sensitive instructions trap
    But hardware also keeps shadow state for many things (e.g., eflags)
  AMD: Enter guest mode using VMRUN instruction
    Loads state from VMCB data structure
      used to communicate guest OS state between H/W and VMM
    Various events cause EXIT back into host mode
      saves state to VMCB
    VMCB contains:
      * Control bits
          Intercept vector:
            - one bit for each of %cr0-%cr15 to say if trap on read of register
            - one bit for each of %cr0-%cr15 to say if trap on write of register
            - 32 analogous bits for the debug registers (%dr0-%dr15)
            - 32 bits for whether to intercept exception vectors 0-31
            - bits for various other events (e.g., NMI, SMI, ...)
            - bit to intercept writes to sensitive bits of %cr0 (not TS or MP)
            - 8 bits to intercept reads and writes of IDTR, GDTR, LDTR, TR
            - bits to intercept RDTSC, RDPMC, PUSHF, POPF, VMRUN
                  HLT, INVLPG, INT, IRET, IN/OUT (to selected ports), ...
            - Exit code and reason (e.g., which instruction/event caused exit)
          Other control values:
            - Pending virtual interrupt
            - Event injection of various exceptions
      * Saved guest state
          - Full segment registers (i.e., base, lim, attr, not just selectors)
          - Full GDTR, LDTR, IDTR, TR
          - Guest %cr3, %cr2, (and other cr/dr registers)
          - Guest eip and eflags (really rip & rflags for 64-bit processors)
          - Guest %rax register
  Entering/exiting VMM is a bit more expensive than traditional trap to kernel
    Saving/loading VMCB expensive - structure is 1024 bytes (664 now used)
  Intel: Similar ideas but calls it Virtual-Machine Control Structure (VMCS)
    - VMPTRLD - Loads machine address of VMCS 
    - VMLAUNCH - Hardware is allowed to cache VMCS contents
                 Reading or writing corresponding machine memory illegal
    - VMREAD/VMWRITE - read and write VMCS of active current VMCS
    - VMCLEAR - flush VMCS back to memory
                E.g., can swap to disk or migrate to different core
  Big benefit: makes writing a VMM (with fidelity) much easier!

What about performance?  Is hardware virtualization faster?
  Sometimes yes, usually not.
  When is it faster?
    Better at entering/exiting kernel
    E.g., Apache on windows: one address space
            lots of syscalls, H/W better
          Apache on Linux: multiple address spaces
            means context switches, tracing faults, etc., so software better

Section 6.2 (p. 7): Why is fork/wait so hard for VMMs? (6.0 sec -> 36.9/106.4)
  Section 4.3 (p. 5):  What happens during fork?
    - fork system call vectors to kernel (CPL3 -> 0), no EXIT required
    - OS implements copy-on-write optimization.  Must write-protect parent
        Loops through page tables clearing W bits in PTEs
        Each PTE write causes a tracing fault which causes an EXIT
          EXIT more expensive than simple fault which is what BT does
          But BT also adapts "guity" instructions to avoid any fault
    - OS switches to child process, writes to %cr3, requires EXIT
    - Child touches pages not mapped in shadow PTs
        Causes hidden page faults, each of which requires an EXIT
    - Parent/child write to CoW pages, causing true page faults
        Each requires an EXIT (for VMM to decide it's a true fault)

Go over Figure 4 (p. 8)

Who is writing this paper?  Might we expect any bias?
  At time, both authors at VMWare, which sold software virtualization
    H/W schemes reduce barrier to entry, so enable more competition
    So good news for VMWare if BT is still faster than H/W virtualization!
  Authors are honest and would report either way
    But VMWare tries to prohibit publishing benchmarks without permission
    Company might have barred publication if it made them look bad

Paper from 2006. How would we expect numbers to look today?
  Nested page tables should be a big deal - eliminate tracing faults
  But will dramatically increase the cost of TLB misses
    VMWare sources suggested their BT still faster on many benchmarks
    E.g., specjbb has large number of TLB faults
          so nested PT worse than both BT and existing H/W-based VMM
  But now even VMWare mostly uses hardware virtualization
    BT mostly used for boot code before processor enters 64-bit mode