Memory Resource Management in VMware ESX Server
===============================================

Terminology review:  This paper talks about machine and "physical" memory
  An OS uses *page tables* to map from virtual pages -> physical pages
    On a raw machine, physical page = machine page (i.e., H/W page)
    On a virtual machine, "physical" page = *virtual* machine page (not H/W)
  VMM uses *pmap* to map from virtual machine pages -> real H/W machine pages
  So we have three kinds of pages now:
    - Virtual pages: addresses referenced by software
    - "Physical" pages: in physical machine corresponded to H/W page
                        in virtual machine is determined by VMM
                          could be some H/W page (at any address)
			  could be paged out to disk and not in memory
    - Machine pages: actual H/W page
  And three kinds of mapping:
    - "primary" page table: (per-process) virtual -> (per-VM) "physical" pages
        Mappings written by guest OS, Accessed/Dirty bits written by VMM
        Not used by Hardware MMU
    - VMM's pmap: (per-VM) "physical" -> (global) machine pages
        Only used by VMM
    - Shadow page tables: (per-process) virtual -> (global) machine pages
        Mappings written by VMM, Accessed/Dirty bits written by hardware
        Never seen by Guest OS
        All translations are a function of primary PT and pmap
          Can be computed lazily on the fly, so use shadow PT as a cache
        Accessed/Dirty bits authoritative (VMM must copy back to primary PT)

In ESX, what are three basic memory parameters for a VM?  min, max, shares
  min - VMM always guarantees this much machine memory, or won't run VM
    Actually, need min + ~32MB overhead
  max - this is amount of "physical" memory VM OS thinks machine has
    Obviously VM can never consume more than this much memory
  share - how much machine memory this VM should have relative to other VMs
  Big question addressed by this paper:  Memory management when over-committed

Straw man:  Just page "physical" mem to disk with LRU.  Why is this bad?
  OS probably already uses LRU, which leads to "double paging" problem
    OS will free and re-use whatever "physical" page VMM just paged out
  Also, performance concerns limit how much you can over-commit
    Plus can't modify OS bcopy to use many of Disco's tricks
  Goal:  Minimize memory usage and maximize performance with cool tricks

How to get pages back from the OS under memory pressure?
  System can be in one of four states:  high, soft, hard, low
    high - (6% free) plenty of memory
    soft - (4% free) try to convince OSes to give back memory
    hard - (2% free) use random eviction to page stuff out to disk
    low - (1% free) block execution of VMs above their mem usage targets
  How to convince an OS to give back memory?  Ballooning
    Implement special psudo device driver that allocates pinned physical memory
    VMM asks baloon driver to allocate memory
    Baloon driver tells VMM about pages that guest OS will not touch
  How well does this work (Figure 2)?
    Looks like only small penalty compared to limiting physical memory at boot
    Why the penalty?  OS sizes data structures based on physical memory size
      E.g., might have one "ppage" structure per physical page, will use memory
    Would have been nice to have third bar with just random eviction paging
      As it is, don't know how much cleverness of technique is buying us

How to share pages across OSes?
  Use hashing to find pages with identical contents
  Big hash table maps hash values onto machine pages:
    If only mapped once, page may be writable, so hash is only *hint*
      Hash table has pointer back to machine page
      Must do full compare w. other page before combining into shared CoW page
    If mapped multiple times, just keep 16-bit reference count
      If counter overflows (e.g., on zero page) use overflow table
  Scan OS pages randomly to stick hints in hash table
  Note:  Always try sharing a page before paging it out to disk
  How well does this work?  (figure 4)

How does proportional share typically work (no idle tax)?
  Each VM has been assigned number of "shares" S by administrator
    and has been given some number of pages P by VMM
  Reclaim a page from OS with lowest ratio of "shares-to-pages" S/P
    E.g., if A and B both have S=1, reclaim from larger of the two
          if A has twice B's share, then A can use twice as much memory
    Can view S/P as "price" guest OS can afford to pay for a page
  Why is this not good enough?
    May have high-priority VM wasting tons of memory
    (This is reasonable:  high priority VM might sometimes not use memory)
  Digression:  Simple tax arithmetic
    [Ignore payroll, state, and local taxes.]
    Suppose my income tax rate is T (e.g., might be 28% in U.S.)
      For each $1 gross I earn, I pay T*$1 in taxes
      So $1 gross = $(1-T) take home
      And $1 take home requires $1/(1-T) gross pay
    Call k = 1/(1-T) the cost of a take home dollar
  Idea: idle memory tax.  substituting memory pages for dollars...
    Any page you are using is fully "tax deductible" - just costs one page
    If you aren't using a page, must pay fraction T of it back to the system
      So each idle page actually costs you k times the price of a non-idle page
    Now how much can a VM afford to pay for each "take home" page?
		                 S
	  rho = ------------------------------------
	          (# used pages) + k*(#idle pages)

		       S
	  rho = -----------------
	          P(f + k(1-f))
      where f is fraction of active (non-idle) pages, and
            k is "idle page cost", k = 1/(1-T) for tax rate 0 <= T <1
    So reclaim from VM with lowest rho, instead of lowest S/P

How to determine how much idle memory?
  Statistical sampling:  Pick n pages at random, invalidate, see if accessed
    If t pages touched out of n at end of period, estimate usage as t/n
    How expensive is this?  <= 100 page faults over 30 seconds negligible
  Actually keep three estimates:
    Slow exponentially weighted moving average of t/n over many samples
    Faster weighted average that adapts more quickly
    Version of faster average that incorporates samples in current period
  Use max of 3.  Why?
    Basically, when in doubt, want to respect priorities
      Spike in usage likely means VM has "woken up"
      Small pause in usage doesn't necessarily mean it will last longer
    Anecdote:  Behavior when X server paged out by low-priority simulation
  How well does this do?
    Fig. 7 (p. 9) looks good in terms of memory utilization
    Would be nice to see some end-to-end throughput numbers, too, though

What is issue with I/O pages and bounce buffers?
  Better to re-locate pages that are often used for I/O