Memory Resource Management in VMware ESX Server =============================================== Terminology review: This paper talks about machine and "physical" memory An OS uses *page tables* to map from virtual pages -> physical pages On a raw machine, physical page = machine page (i.e., H/W page) On a virtual machine, "physical" page = *virtual* machine page (not H/W) VMM uses *pmap* to map from virtual machine pages -> real H/W machine pages So we have three kinds of pages now: - Virtual pages: addresses referenced by software - "Physical" pages: in physical machine corresponded to H/W page in virtual machine is determined by VMM could be some H/W page (at any address) could be paged out to disk and not in memory - Machine pages: actual H/W page And three kinds of mapping: - "primary" page table: (per-process) virtual -> (per-VM) "physical" pages Mappings written by guest OS, Accessed/Dirty bits written by VMM Not used by Hardware MMU - VMM's pmap: (per-VM) "physical" -> (global) machine pages Only used by VMM - Shadow page tables: (per-process) virtual -> (global) machine pages Mappings written by VMM, Accessed/Dirty bits written by hardware Never seen by Guest OS All translations are a function of primary PT and pmap Can be computed lazily on the fly, so use shadow PT as a cache Accessed/Dirty bits authoritative (VMM must copy back to primary PT) In ESX, what are three basic memory parameters for a VM? min, max, shares min - VMM always guarantees this much machine memory, or won't run VM Actually, need min + ~32MB overhead max - this is amount of "physical" memory VM OS thinks machine has Obviously VM can never consume more than this much memory share - how much machine memory this VM should have relative to other VMs Big question addressed by this paper: Memory management when over-committed Straw man: Just page "physical" mem to disk with LRU. Why is this bad? OS probably already uses LRU, which leads to "double paging" problem OS will free and re-use whatever "physical" page VMM just paged out Also, performance concerns limit how much you can over-commit Plus can't modify OS bcopy to use many of Disco's tricks Goal: Minimize memory usage and maximize performance with cool tricks How to get pages back from the OS under memory pressure? System can be in one of four states: high, soft, hard, low high - (6% free) plenty of memory soft - (4% free) try to convince OSes to give back memory hard - (2% free) use random eviction to page stuff out to disk low - (1% free) block execution of VMs above their mem usage targets How to convince an OS to give back memory? Ballooning Implement special psudo device driver that allocates pinned physical memory VMM asks baloon driver to allocate memory Baloon driver tells VMM about pages that guest OS will not touch How well does this work (Figure 2)? Looks like only small penalty compared to limiting physical memory at boot Why the penalty? OS sizes data structures based on physical memory size E.g., might have one "ppage" structure per physical page, will use memory Would have been nice to have third bar with just random eviction paging As it is, don't know how much cleverness of technique is buying us How to share pages across OSes? Use hashing to find pages with identical contents Big hash table maps hash values onto machine pages: If only mapped once, page may be writable, so hash is only *hint* Hash table has pointer back to machine page Must do full compare w. other page before combining into shared CoW page If mapped multiple times, just keep 16-bit reference count If counter overflows (e.g., on zero page) use overflow table Scan OS pages randomly to stick hints in hash table Note: Always try sharing a page before paging it out to disk How well does this work? (figure 4) How does proportional share typically work (no idle tax)? Each VM has been assigned number of "shares" S by administrator and has been given some number of pages P by VMM Reclaim a page from OS with lowest ratio of "shares-to-pages" S/P E.g., if A and B both have S=1, reclaim from larger of the two if A has twice B's share, then A can use twice as much memory Can view S/P as "price" guest OS can afford to pay for a page Why is this not good enough? May have high-priority VM wasting tons of memory (This is reasonable: high priority VM might sometimes not use memory) Digression: Simple tax arithmetic [Ignore payroll, state, and local taxes.] Suppose my income tax rate is T (e.g., might be 28% in U.S.) For each $1 gross I earn, I pay T*$1 in taxes So $1 gross = $(1-T) take home And $1 take home requires $1/(1-T) gross pay Call k = 1/(1-T) the cost of a take home dollar Idea: idle memory tax. substituting memory pages for dollars... Any page you are using is fully "tax deductible" - just costs one page If you aren't using a page, must pay fraction T of it back to the system So each idle page actually costs you k times the price of a non-idle page Now how much can a VM afford to pay for each "take home" page? S rho = ------------------------------------ (# used pages) + k*(#idle pages) S rho = ----------------- P(f + k(1-f)) where f is fraction of active (non-idle) pages, and k is "idle page cost", k = 1/(1-T) for tax rate 0 <= T <1 So reclaim from VM with lowest rho, instead of lowest S/P How to determine how much idle memory? Statistical sampling: Pick n pages at random, invalidate, see if accessed If t pages touched out of n at end of period, estimate usage as t/n How expensive is this? <= 100 page faults over 30 seconds negligible Actually keep three estimates: Slow exponentially weighted moving average of t/n over many samples Faster weighted average that adapts more quickly Version of faster average that incorporates samples in current period Use max of 3. Why? Basically, when in doubt, want to respect priorities Spike in usage likely means VM has "woken up" Small pause in usage doesn't necessarily mean it will last longer Anecdote: Behavior when X server paged out by low-priority simulation How well does this do? Fig. 7 (p. 9) looks good in terms of memory utilization Would be nice to see some end-to-end throughput numbers, too, though What is issue with I/O pages and bounce buffers? Better to re-locate pages that are often used for I/O