Intro to Lecture
================

1) Lab 3 due.
2) Let's talk about some fundamentals (CS140).
  - Privilege separation (rings)
    - privilege instructions
    - exceptions: faults and traps
  - Interaction with rings and MMU via page table entries

Intro to Paper
===============

Q: What is emulation? What is virtualization?

Paper uses definition from Popek and Goldberg's 1974 paper. A VMM must have:

  1) Fidelity. Software running in VMM runs identically to software on real
     machine. (Barring timing effects)

  2) Performance. The machine executes almost all of the instructions itself.

  3) Safety. VMM manages hardware resources. Isolation between VMs.

Emulation doesn't satisfy 2).

Q: How do we virtualize?

A: "Classically": trap-and-emulate. VMware: binary translate untrappable things.
   Future (present, past): Hardware

Trap-And-Emulate
================

Idea: Keep a shadow of physical CPU structures. Trap on priviledged instructions
that modify/read these CPU structures and emulate/replay their effects.

Tracing:
  Q: What about memory mapped I/O devices?
  A: Use MMU to page-fault on all read/writes.

  Q: What about priviledged structures that are memory mapped?
  A: Use MMU to page-fault on all read, writes, or both.

What's the difference between hidden vs. true page faults?

A: Hidden: faults that occur BECAUSE of virtualization. Don't forward.
A: True: faults that occur due to guest OS setup. Forward.

x86: Trap-and-Emulate Not Possible?
===================================

Q: Why?
A: Don't have traps on some priviledged actions are done!
A: Machine reveals state (CPL, etc.)!

Examples:
  * popf silently ignores changes to interrupt flag (guest kernel should change)
    pushf reveals *real* interrupt flag
  * Can read %cs: no trap, CPL is in there.

What real x86 state do we have to hide (i.e. != virtual state)? (thanks, 6.828!)
  * CPL (low bits of CS) since it is 3, guest expecting 0
  * gdt descriptors (DPL 3, not 0)
  * gdtr (pointing to shadow gdt)
  * idt descriptors (traps go to VMM, not guest kernel)
  * idtr
  * pagetable (doesn't map to expected physical addresses)
  * %cr3 (points to shadow pagetable)
  * IF in EFLAGS
  * %cr0 &c

Q: So paper proposes one way to get around this: dynamic binary translation, and
   says in future (present), x86 is clasically virtualizable. But, how else can
   we get around this?
A: Emulate, but then this isn't virtualization.
A: Statically replace privledeged instructions with int 3: emulate in VMM.
  - won't this mess up the code?
  - nope: INT 3 is 1 byte!
  - what about code generation?......!

Dynamic Binary Translation
==========================

Idea: Translate the binary as it executes.
  As Dawson likes to say, super simple: fn :: x86 -> x86

Example: let's binary translate the following:
(example from paper in AT&T instead of Intel)

      isPrime:
PC ->   mov %edi, %ecx
        mov $2, %esi
        cmp %esi, %ecx
        jge prime
      nexti:
        mov %ecx, %eax
        ...
      prime:
        ...
        ret
      notPrime:
        ...
        ret

  Note: %edi contains the first argument, a (from calling convention)

So, PC is at isPrime. So, we binary translate.

  - Read/parse some number of instructions into IR objects grouped into one
    translation unit (TU).

    Q: How many?
    A: Up to 12 or until the first "terminating instruction" (control flow).

    Q: What is a terminating instruction?
    A. Any that does control flow. e.g.: jmp, ret, call, (mov %eax, %rip), etc.

    Q: Why this limit?
    A: Want to keep instructions being translated in static memory.

Translation pipeline (so far) is:

  ---------------
  | instruction |  -> (decoder) -> (wrap in IR) -> (add to TU)
  ---------------

So now we have a TU. For our example above, we have the following TU:

  isPrime:
    mov %edi, %ecx
    mov $2, %esi
    cmp %esi, %ecx
    jge prime
  nexti:
    ...
  prime:
    ...

In compiler basic block notation, we have:

              -----------
              | isPrime |
              -----------
                 /   \
             ge /     \ else
               /       \
          ---------   ---------
          | prime |   | nexti |
          ---------   ---------

Now, we actually translate each instruction. Two types of translations:

  1) IDENT: do nothing to the instructions
  2) Non-IDENT: instuctions becomes something else, usually several instructions

Translation pipeline is:

          for each instr. in TU:
  |------------------------------------------
  |   ------------------                    |
  |   | IR instruction |  -> (translate)    |  -> CCF -> (insert in hash table)
  |   ------------------                    |         -> (insert into trans. cache)
  |------------------------------------------

Hash Table:
  real PC of basic block -> address of CCF

First three instructions are IDENT. jge instructions is not because instructions
are no longer in the same place as they used to be. So for a jump, we (may) have
to figure out where to go dynamically. At the very least, we may have not yet
translated the TU at that address. So, need to insert a call into the translator
for each possible branch (have two, see diagrama above).

So the translated unit becomes:

  isPrime':
    mov %edi, %ecx
    mov $2, %esi
    cmp %esi, %ecx
    jge [takenAddr = prime]    - calls the translator with address `prime`
    jmp [fallThrAddr = nexti]  - calls the translator with address `nexti`

And we're done! So now we execute the code and keep going going until...?

  Q: Until when?
  A: Until something forces us to jump back into the translator! In the example
     above, the bottom two `jmp` instructions would do this.

  Q: What exactly happens when we hit `jgq [takenAddr = prime]`?
  A: Calls into the translator to translate the TU beginning at `prime`.
  A: Probably something like:
     mov $prime, %gs:0xff890ec9b
     jmp $translator -> which then jumps to the CCF address for $prime

  Q: Do we have to do this jump to translate everytime?
  A: Afer the TU being jumped to has been translated,  we can patch the old CCF
     to refer directly to the address of the TU's CCF. Then will jump from one
     CCF into another. This is "chaining" the CCFs.

  Q: What if we translate 12 instructions without a "terminating" instruction?
  A: Add a call into the translator at end with where to go next.

Q: What's this %gs?
A: Segment regsiter. Segmented memory: can partition memory; register selects
   which partition to refer to. VMM keeps its structures in segment pointed to
   be %gs. Needs to translate each reference to %gs in guest.

Q: With that, what's going on with the ret translation for notPrime?
  xor %eax, %eax
  pop %r11                        <-- stores returns address in %r11
  mov %rcx, %gs:0xff39eb8(%rip)   <-- stores %rcx in VMM memory
  movzx %r11b, $ecx               <-- stores lower 8 bytes of ret addr in %ecx
                                  <-- this clobbers %rcx (ecx = lower 32 of rcx)
                                  <-- used by translator in some way
  jmp %gs(0xfc7dde0(8*%rcx))      <-- hash table lookup for real address
                                  <-- or jumps into translator to do lookup
                                  <-- and restore %rcx

Q: What other things do we have translate non-IDENT:
  * all control flow (TC cache is not in the same place as original bianry)
    * PC relative addressing, (in)direct control flow
  * priviledged instructions (cli, can be replaced with fast versions)
  * accesses to %gs: used by VMM for its data.

Q: Paper says we don't have to translate user-mode code. Why not?
A: ...

Adaptive Binary Translation
===========================

Doing the above translation for priveledged instructions is nice because we can
translate typically expensive things (like cli) into cheap things (like a memory
store). What about other traps?

We still have a bunch of traps from virtual memory stuff.

Idea: Make a non-IDENT translation for instructions that frequently cause these
traps. In short: take something that would trap (say, a change of a page table)
and translate that into the direct instructions necessary to update the VMM's
page table metadata.

How?
  1) retranslate ccf to be more efficient
  2) add jmp at beginning of old ccf to new ccf
  3) that's it.

Q: What if instructions would not longer trap? (IE: address stopped corresponding
   to a page table or something...)
A: Revert the changes. Remove jmp from original ccf. Add jump to original ccf to
   old ccf for code pointing there.

Hardware Virtualization
=======================

Basically, Intel and AMD made it possible to trap on priveledged stuff.

Setup:
  * New CPL (ring): -1 (host mode).
  * VMM adds a VM entry for each virtual machine CPU
    -> Need VMCS (VMCB in paper) for each one. Stores information about the
       virtul CPU.
  * VMM does (vmlaunch (vmrun in paper), vmresume) to start VM
  * CPU handles almost all priviledged operations, virtualizes itself.
  * vmexit returns control to VMM on traps and other conditions
    * e.g, syscall, hypercall, (port) I/O, page faults, etc.
      * No EPT (MMU virtualization) in VT-x (VMX) version of paper.
    * programmable

Q: So, is the VMM doing anything anymore?
A: Yes!
  1. Programming the CPU to do VT-x.
  2. Still shadowing memory (no EPT - even with EPT need control)
  3. Implementing virtual devices (I/O for emulated devices).
  4. Handling I/O (MMIO).
  5. Managing multiples VMs.

Fork example: guest user-mode process calls fork(). What happens?

1) Well, this is a syscall. CPU handles this directly, invoking guest kernel.
  * CPU changes virtual state by itself.
2) Guest kernel needs to modify paging structures for new process.
  * This traps into VMM. Shadow page table updated, etc.
3) Guest switches address spaces (%cr3).
  * This traps into VMM. Virtual %cr3 updated.
4) Child runs. Page faults may occurs.
  * These trap into VVM. Shadow page tables updated or faults forwarded.

Q: What of child's sycalls?
A: Handled by CPU directly!

Q: What about popf now?
A: CPU handles it correctly, directly. Stored in VMCS.