Memory Resource Management in VMware ESX Server =============================================== We discussed three kinds of memory addresses in last lecture * Guest virtual addresses * Guest physical addresses Would be plain physical addresses if not in VM Only have meaning in the context of a specific VM Can be paged out to disk and not exist in RAM Can be mapped to any host physical address by VMM Can even map multiple guest physical pages to same host physical page * Host physical addresses Global across all VMs Never seen by guest OSes Correspond to actual bits in DRAM chips These are defined by several data structures A guest operating systems running in a VM maintains a *primary* page table Maps from *guest virtual* addresses to *guest physical* addresses The VMM has per-VM data structure called pmap in this paper Maps from *guest physical* to *host physical* (Not to be confused with pmap machine-dependent layer in the BSD kernel.) In addition, the VMM maintains a *shadow* page table Maps directly from *guest virtual* addresses to *host physical* Hence is a function of guest PT and pmap Shadow PT is the only PT seen by hardware (assuming no nested paging) Accessed/Dirty bits authoritative (VMM must copy back to primary PT) Can be computed lazily on the fly, so use shadow PT as a cache Lots of last lecture dealt with keeping shadow PT in sync w. primary Unfortunately, today's paper uses an older terminology: Ed Bugnion, who coined older terms, says he now prefers newer terminology Let's translate the following terms as we read the paper: Virtual address -> guest virtual address "Physical" address -> guest physical address Machine address -> host physical address Similar, we have VPN, PPN, MPN for virtual, physical, machine page number Note that VMWare workstation (or kvm, virtualbox, etc.) has 4th address type: *Host virtual memory* is memory in processes running in host OS That's because VMWare workstation runs as a process on an existing OS Re-uses the host OSes device drivers, networking stack, etc. E.g., can run emacs and vmplayer side-by-side, where emacs uses host VA VMWare ESX is different in that it replaces host OS Directly accesses NICs and other hardware No host networking stack to worry about In ESX, what are three basic memory parameters for a VM? min, max, shares min - VMM always guarantees this much machine memory, or won't run VM Actually, need min + ~32MB overhead max - this is amount of guest physical memory VM OS thinks machine has Obviously VM can never consume more than this much memory share - how much host phys. memory this VM should have relative to other VMs Big question addressed by this paper: Memory management when over-committed Straw man: Just page host physical mem to disk with LRU. Why is this bad? OS probably already uses LRU, which leads to "double paging" problem OS will free and re-use whatever guest physical page VMM just paged out Also, performance concerns limit how much you can over-commit Plus can't modify OS bcopy to use many of Disco's tricks Goal: Minimize memory usage and maximize performance with cool tricks What happens under memory pressure (Sec 6.3)? System can be in one of four states: high, soft, hard, low high - (6% free) plenty of memory soft - (4% free) try getting OSes to give back memory (page as last resort) hard - (2% free) use random eviction to page stuff out to disk low - (1% free) block execution of VMs above their mem usage targets How to convince an OS to give back memory? Ballooning Implement special psudo device driver that allocates pinned physical memory VMM asks baloon driver to allocate memory Baloon driver tells VMM about pages that guest OS will not touch How does balloon driver communicate with VMM? Section 3.6 says polls once per second to get target balloon size Could use any IO mechanism to communicate Access special guest physical address (handle via tracing), inb/outb Today would use "hypercall" (e.g., vmcall) instruction Could also conceivably use interrupts instead of polling What happens if balloon memory accessed? OS shouldn't touch private balloon driver memory, so VM probably rebooted Handle as hidden page fault that VMM satisfies with zero-filled page After this, assuming reboot, VMM needs to resynchronize balloon state How well ballooning work (Figure 2)? Looks like only small penalty compared to limiting physical memory at boot Why the penalty? OS sizes data structures based on physical memory size E.g., might have one "ppage" structure per physical page, will use memory Does Figure 2 show that ballooning is effective? Would be nice to have third bar with just random eviction paging As it is, don't know how much cleverness of technique is buying us How to share pages across OSes? Use hashing to find pages with identical contents Big hash table maps hash values onto host physical pages: If only mapped once, page may be writable, so hash is only *hint* Hash table has pointer back to guest physical page Must do full compare w. other page before combining into shared CoW page If mapped multiple times, just keep 16-bit reference count If counter overflows (e.g., on zero page) use overflow table Scan OS pages randomly to stick hints in hash table Note: Always try sharing a page before paging it out to disk How well does this work? Figures 4, 5 show significant memory savings Does it matter for performance? p. 7 - tiny 0.5% average speed-up (when no memory contention) So this probably means sharing is rarely harmful despite extra hashing But would be nice to see actual speedups under limited memory conditions How does proportional share typically work (no idle tax)? Each VM has been assigned number of "shares" S by administrator and has been given some number of pages P by VMM Reclaim a page from OS with lowest ratio of "shares-to-pages" S/P E.g., if A and B both have S=1, reclaim from larger of the two if A has twice B's share, then A can use twice as much memory Can view S/P as "price" guest OS can afford to pay for a page Why is this not good enough? May have high-priority VM wasting tons of memory (This is reasonable: high priority VM might sometimes not use memory) Digression: Simple tax arithmetic [Ignore payroll, state, and local taxes.] Suppose my income tax rate is T (e.g., might be 28% in U.S.) For each $1 gross I earn, I pay T*$1 in taxes So $1 gross = $(1-T) take home And $1 take home requires $1/(1-T) gross pay Call k = 1/(1-T) the cost of a take home dollar (e.g., ~$1.39 for 28% tax) Idea: idle memory tax. substituting memory pages for dollars... Any page you are using is fully "tax deductible" - just costs one page If you aren't using a page, must pay fraction T of it back to the system So each idle page actually costs you k times the price of a non-idle page Now how much can a VM afford to pay for each "take home" page? S rho = ------------------------------------ (# used pages) + k*(#idle pages) S rho = ------------------ P * (f + k(1-f)) where f is fraction of active (non-idle) pages, and k is "idle page cost", k = 1/(1-T) for tax rate 0 <= T <1 at 75% tax rate, k = 4 So reclaim from VM with lowest rho, instead of lowest S/P How to determine how much idle memory? Statistical sampling: Pick n pages at random, invalidate, see if accessed If t pages touched out of n at end of period, estimate usage as t/n How expensive is this? <= 100 page faults over 30 seconds negligible Actually keep three estimates: Slow exponentially weighted moving average of t/n over many samples Faster weighted average that adapts more quickly Version of faster average that incorporates samples in current period Use max of 3. Why? Basically, when in doubt, want to respect priorities Spike in usage likely means VM has "woken up" Small pause in usage doesn't necessarily mean it will last longer Anecdote: Behavior when X server paged out by low-priority simulation How well does this do? Fig. 7 (p. 9) looks good in terms of memory utilization Would be nice to see some end-to-end throughput numbers, too, though What is issue with I/O pages and bounce buffers? Better to re-locate pages that are often used for I/O