VM Background
=============

Brief history of virtual machines
  Old idea from the 60s and 70s
  Allowed people to share hardware before multiprogrammed OSes
  Then fell out of favor with newer OSes and cheaper hardware
  By mid 90s, perception was that software was trailing hardware
    People were working on large ccNUMA machines
    "Large" means chances of a node failing are higher
      So need new OS techniques for fault containment
    NUMA (non-uniform memory access) requires scheduling, VMem support
    Very little hope 3rd party vendors like Microsoft would support this
  In 1997, Stanford Disco proposed addressing problem with virtual machines
    Idea:  Run many instances of commodity OS on many CPU machine
      Takes care of NUMA, and maybe fault containment
    For aggressive scientific applications, run specialized, small OS
  Disco led to VMware, which was very successful probably for different reasons
    - Running many applications on Windows NT causes instability
      With VMware, just run multiple instances of NT on same physical machine
    - Security isolation if don't fully trust app/OS in one VM
    - Statistical multiplexing across OSes saved hardware
      For policy reasons might have different superusers in different OSes
  VMware led to other VM projects
    So successful that CPU manufacturers are now supporting virtualization
      (more on this when we discuss today's paper)

Stanford Disco project details
  Directly executed both kernel and user code in VM
  Required changing the kernel, often called "paravirtualization"
  Virtualized the MIPS architecture, which required some incompatibility
    MIPS reserves upper VA space for pseudo-physical memory
      Modify kernel to run out of lower segment in mapped virtual addresses
    Reading/writing certain machine registers privileged
      Trapping on every instruction would be expensive
      Instead change kernel to read/write to special virtual addresses
    New virtual disk/network devices require new drivers
  Made several other modifications to OS (IRIX):
    VM call to get pre-zeroed page (since VM must do it for privacy anyway)
    VM call to say page is on free list
    Change mbuf management to avoid linked list of free pages
      Thus some phys page contains only packet data, not OS-specific links
    Change bcopy to make VM call that just re-maps physical page
  Several key optimizations gave good performance:
    Copy-on-write disks - if same OS, programs loaded in multiple VMs
      Until page written, allows one physical page to be used in multiple VMs
    Combined buffer cache - allows one VM to use clean buffers from another VM
    Virtual network device which allows arbitrarily large packets
      Also multiply maps aligned pages CoW rather than copy data
    Detect when VM OS puts CPU in low-power mode (idle loop) and deschedule
  Example:  Efficient use NFS across VMs
    Read from disk, uses global buffer cache
    Copy to network buffer, uses bcopy which calls VM remap
    Send message to other VM over virtual network, just remaps page again

VMware virtualizes x86
  But works with umodified OSes, so no paravirtualization

Comparison of Software and Hardware Techniques
==============================================

Popek and Goldberg define Virtual Machine Monitor as having 3 properties:
  Fidelity - VMM looks to guests just like real hardware (except timing)
  Performance - Most instructions directly executed w/o VMM intervention
  Safety - The VMM manages hardware resources

What does it mean for an architecture to be "classically virtualizable"?
Means you can implement a VMM using trap-and-emulate.  What's this?
  VMM runs guest kernel in CPU mode less privileged than it normally does
    *De-privileges* all instructions that read or write privileged state
      E.g., I/O ports, page table base register (PTBR), etc.
     De-privileged instructions trap, emulate them to make VMM transparent
    Keep primary and shadow structures
      *primary* structure is what guest would show hardware w/o the VMM
        E.g., page tables, PT base register (%cr3 on x86), flags register, etc.
      *shadow* structure is what VMM actually shows hardware
        E.g., actual value of %cr3
      Example: page tables
        primary page table translates Virtual -> guest "Physical" Page Nos.
          (i.e., would be physical address if no VMM, but is fictitious w. VMM)
        Shadow page tables translate Virtual -> Machine physical Page Nos.
          combines guest VPN -> "P"PN and VMM's "P"PN -> MPN mappings

What are complications of guest OS access to PTEs?
  Make previously invalid PTE a valid mapping
    Might take page fault on valid page
    But VMM can detect and fix up transparently.  Called "hidden page fault".
  Make previously valid PTE an invalid mapping
    Problem if shadow page table still has mapping
    Access that should fault into VMM will execute and use old mapping
    But: Correct OS should use INVLPG to flush TLB entry, so de-privilege that
  Change PTE to point from one to a different "physical" page number
    Similarly, OS should call INVLPG to can detect
  Read and inspect PTE
    Problem: Accessed & dirty bits in primary PTE won't reflect shadow PTE
    Would have to fix with more hidden page faults
      (e.g., never make shadow PTE writable until you set dirty bit in primary)

It turns out many OSes are buggy
  In particular, many OSes wouldn't work properly with an infinite TLB
  So relying on intercepting INVLTB might not be good enough
  And depending on workload performance penalty of hidden faults might be high

What is tracing?
  As alternative to above, intercept PTE updates by protecting page tables
  Similarly intercept access to memory mapped devices by protecting pages
  Any accesses to these pages cause "tracing faults"
    VMM emulates write (e.g., updates primary & shadow PTE, emulates device)
    Restarts guest OS, so guest OS doesn't know there was a page fault
  Note distinction from "true page faults" - what are these?
    Memory access that is also invalid according to the primary guest PTE
    VMM must vector to guest OS's page fault handler
  All true page faults must vector to VMM for correctness
  Page faults that are not true page faults are hidden
    Note VMWare treats shadow page tables as cache
    So potentially any guest memory access can cause hidden page fault

Section 2.4 (p. 2):
  "striking a favorable balance in this three-way trade-off among
   trace costs, hidden page faults, and context switch costs is
   surprising both in its difficulty and its criticality to VMM
   performance."
  Where does this trade-off come from?
    Idea:  guest OS can access primary PTEs or corresponding VAs, but not both
    Always trace access to page tables? lots of tracing faults
      Every PTE update causes a tracing fault, but shadow PTEs always correct
    No tracing - allow guest kernel to write most page tables most of the time?
      1) lots of hidden faults (shadow PTE invalid if page table writable), or
      2) pre-generate primary PTEs from shadow when switching into guest kernel
         (otherwise kernel will not see correct accessed & dirty bits)

Is x86 classically virtualizable?  No.  Why not?
  Can't prevent some shadow state from being visible
    E.g., run kernel at privilege level 1 instead of normal kernel mode (0)
    But lower 2 bits of %cs register reflect Current Privilege Level (CPL)
    "push %cs; popl %eax" will not trap, and will put wrong value in %eax
  Can't de-privilege all instructions
    Some instructions just behave differently depending on CPL
    Example: popfl - pops flags off the stack
      When CPL > 0, does not change certain privileged bits in flags register
        (e.g., IF bit is interrupt enable flag - user code shouldn't clear it)
      But still useful instruction that you can't make trap

But can virtualize x86 with an interpreter.  What's this?
  Basically a software implementation of x86 hardware
    while (inst = get_next_instruction ()) {
      switch (decode_instruction_type (inst)) {
         ...
      }
    }
  Emulates screen (e.g., in an X window or on terminal)
  bochs is like this if you've ever used it
  Is bocks a Popek & Goldberg VMM?
    Fidelity - yes
    Performance - No, much too slow
    Safety - yes

Refinement #1:  Directly execute user code
  Observation:  Only kernel code really needs to be interpreted
  So within guest OS, directly execute all user-mode code
    Slight annoyance:  VMM resides in some range of virtual memory
    If OS gave user code VMM virtual addresses, must relocate
      (Maybe use segmentation to help VMM relocate)
  Upon trap to kernel, fire up slow interpreter
  Will be faster, but still not good enough for VMware

VMWare uses binary translation of kernel.  What's this?
  Translate guest kernel code into privileged code that runs in kernel mode
    But dangerous or non-virtualizable instructions not identically translated
      E.g., load of %cr3 must put shadow PTBR into %cr3, not guests value
  Note, translated code uses CPL1, not CPL0, mostly for x86 particularities
    Hardware scribbles stuff onto stack on traps and exceptions
    For fidelity, don't want to clobber memory below guest's stack pointer
    By running at CPL1, trap frame written to VMM's private CPL0 stack
  Works one Translation Unit (TU) at a time
    TU is basically a basic block, translated into compiled code fragment (CCF)
    Except capped at 12 instructions to have max size for convenience
  Dumps translated TUs into memory region used as Translation Cache (TC)

---

Look at isPrime example in section 3.1 (p. 3)
  Most code is identically translated (IDENT) - exact same instructions run
  But "jge prime" -> "jge [takenAddr]; jmp [fallthrAddr]"
  Addresses in brackets are *continuations* - what are these?
    Code is translated on demand
    When translating code, will see jumps to code not yet translated
      These are translated into jumps to continuation
      First time through, invokes translator
      Then patches jump so next time jump right to translated code
    Note: can elide [fallthrAddr] if that code next translation emitted

What can't be IDENT?
  PC-relative addressing - compensation to compute relative to untranslated PC
  Direct control flow - map to corresponding address in TC at translation time
  Indirect control flow - must compute target dynamically
    Target is in register or on stack (e.g., return from call)
      Note, for fidelity don't assume ret has a corresponding call!
    Unlike direct control flow, don't know target when translating TU
    Emit code to look up target in hash table each time
  Privileged instructions - work on shadow state
    E.g., update IF in shadow eflags - possibly faster than untranslated code!

What's "innocent until proven guilty" mean?
  Some translated instructions still very expensive ("guilty")
    - writing to a page table and causing a tracing fault
    - reading/writing device registers
    - trying to use region of virtual memory reserved for VMM itself
  Keep track of cost - how many hidden faults generated by instruction
  Assume innocent, but after many faults take steps to make instruction cheaper
    Option 1: Use a "callout" (see ccf5 in Figure 1)
      Patch start of CCF to jump to new region
      There do something cheaper (e.g., update shadow PTE & avoid trap)
      Then jump back
    Option 2: Re-translate the CCF containing the "guilty" instruction"
      More expensive as all CCFs that branch to original CCF must be updated

Note non-IDENT translated code may need to access VMM data structures
  Use segmentation to avoid accidental access to VMM data by guest OS
  Use %gs: override prefix to access VMM data
    Means any use of %gs by guest (rare) cannot be identically translated

How does hardware virtualization change things?
  New cpu mode, guest mode, less privileged than host mode (where VMM runs)
  In guest mode, some sensitive instructions trap
    But hardware also keeps shadow state for many things (e.g., eflags)
  Enter guest mode using VMRUN instruction
    Loads state from VMCB data structure
      used to communicate guest OS state between H/W and VMM
    Various events cause EXIT back into host mode
      saves state to VMCB
    VMCB contains:
      * Control bits
          Intercept vector:
            - one bit for each of %cr0-%cr15 to say if trap on read of register
            - one bit for each of %cr0-%cr15 to say if trap on write of register
            - 32 analogous bits for the debug registers (%dr0-%dr15)
            - 32 bits for whether to intercept exception vectors 0-31
            - bits for various other events (e.g., NMI, SMI, ...)
            - bit to intercept writes to sensitive bits of %cr0 (not TS or MP)
            - 8 bits to intercept reads and writes of IDTR, GDTR, LDTR, TR
            - bits to intercept RDTSC, RDPMC, PUSHF, POPF, VMRUN
                  HLT, INVLPG, INT, IRET, IN/OUT (to selected ports), ...
            - Exit code and reason (e.g., which instruction/event caused exit)
          Other control values:
            - Pending virtual interrupt
            - Event injection of various exceptions
      * Saved guest state
          - Full segment registers (i.e., base, lim, attr, not just selectors)
          - Full GDTR, LDTR, IDTR, TR
          - Guest %cr3, %cr2, (and other cr/dr registers)
          - Guest eip and eflags (really rip & rflags for 64-bit processors)
          - Guest %rax register
  Entering/exiting VMM is a bit more expensive than traditional trap to kernel
    Saving/loading VMCB expensive - structure is 1024 bytes (664 now used)
  Big benefit: makes writing a VMM (with fidelity) much easier!

What about performance?  Is hardware virtualization faster?
  Sometimes yes, usually not.
  When is it faster?
    Better at entering/exiting kernel
    E.g., Apache on windows: one address space
            lots of syscalls, H/W better
          Apache on Linux: multiple address spaces
            means context switches, tracing faults, etc., so software better

Section 6.2 (p. 7): Why is fork/wait so hard for VMMs? (6.0 sec -> 36.9/106.4)
  Section 4.3 (p. 5):  What happens during fork?
    - fork system call vectors to kernel (CPL3 -> 0), no EXIT required
    - OS implements copy-on-write optimization.  Must write-protect parent
        Loops through page tables clearing W bits in PTEs
        Each PTE write causes a tracing fault which causes an EXIT
          EXIT more expensive than simple fault which is what BT does
          But BT also adapts "guity" instructions to avoid any fault
    - OS switches to child process, writes to %cr3, requires EXIT
    - Child touches pages not mapped in shadow PTs
        Causes hidden page faults, each of which requires an EXIT
    - Parent/child write to CoW pages, causing true page faults
        Each requires an EXIT (for VMM to decide it's a true fault)

Who is writing this paper?  Might we expect any bias?
  Both authors work for VMWare, which makes money from software virtualization
    H/W schemes reduce barrier to entry, so enable more competition
    So good news for VMWare if BT is still faster than H/W virtualization!
  Authors are honest and would report either way
    But VMWare prohibits publishing benchmark numbers without permission
    Nobody else could have legally published this paper
    Company might have barred publication if it made them look bad

How do we expect numbers to look 5 years from now?
  Nested page tables should be a big deal - eliminate tracing faults
  But will dramatically increase the cost of TLB misses
    VMWare sources suggest their BT still faster on many benchmarks
    E.g., specjbb has large number of TLB faults
          so nested PT worse than both BT and existing H/W-based VMM