Memory Resource Management in VMware ESX Server =============================================== Brief history of virtual machines Old idea from the 60s and 70s Allowed people to share hardware before multiprogrammed OSes Then fell out of favor with newer OSes and cheaper hardware By mid 90s, perception was that software was trailing hardware People were working on large ccNUMA machines "Large" means chances of a node failing are higher So need new OS techniques for fault containment NUMA (non-uniform memory access) requires scheduling, VMem support Very little hope 3rd party vendors like Microsoft would support this In 1997, Stanford Disco proposed addressing problem with virtual machines Idea: Run many instances of commodity OS on many CPU machine Takes care of NUMA, and maybe fault containment For aggressive scientific applications, run specialized, small OS Disco led to VMware, which was very successful probably for different reasons - Running many applications on Windows NT causes instability With VMware, just run multiple instances of NT on same physical machine - Security isolation if don't fully trust app/OS in one VM - Statistical multiplexing across OSes saved hardware For policy reasons might have different superusers in different OSes VMware led to other VM projects So successful that CPU manufacturers are now supporting virtualization In Nov 7 guest lecture you will hear more about this... Stanford Disco project details Directly executed both kernel and user code in VM Virtualized the MIPS architecture, which required some incompatibility Recall MIPS reserves upper VA space for pseudo-physical memory Modify kernel to run out of lower segment in mapped virtual addresses Reading/writing certain machine registers privileged Trapping on every instruction would be expensive Instead change kernel to read/write to special virtual addresses New virtual disk/network devices require new drivers Made several other modifications to OS (IRIX): VM call to get pre-zeroed page (since VM must do it for privacy anyway) VM call to say page is on free list Change mbuf management to avoid linked list of free pages Change bcopy to make VM call that just re-maps physical page Several key optimizations gave good performance: Copy-on-write disks - if same OS, programs loaded in multiple VMs Until page written, allows one physical page to be used in multiple VMs Combined buffer cache - allows one VM to use clean buffers from another VM Virtual network device which allows arbitrarily large packets Also multiply CoW maps aligned pages rather than copy data Detect when VM OS puts CPU in low-power mode (idle loop) and deschedule Example: Efficient use NFS across VMs Read from disk, uses global buffer cache Copy to network buffer, uses bcopy which calls VM remap Send message to other VM over virtual network, just remaps page again VMware virtualized x86 But had to work with umodified OSes, so could not use direct execution Uses binary translation of kernel code (direct execution of user code) Early versions used device drivers, file system, etc., of host OS But ESX is essentially its own operating system specifically for VMs Terminology: This paper talks about machine and "physical" memory machine is hardware - what we usually think of as physical memory "physical" means what VM OS thinks of as physical memory, virtualized by VMM How would you implement "physical" -> machine mapping on MIPS? Since software managed TLB, VMM can just implement very large software TLB "Physical" memory small, so can use one big lookup table indexed by PPN How to virtualize memory on x86 (no software TLB)? VMM keeps shadow page table, while OS keeps its own page table Somewhat tricky to provide correct semantics of Accessed & Dirty bits First have to detect what might be a page table (non-trivial) But if lose track, can always clear valid bits in PDIR, flush TLB Set permission on OS page table memory so every access faults to VMM Would be expensive Or might switch back and forth between OS page table & memory access When OS accesses PT: Make corresponding entry invalid in shadow page directory (expensive) Synchronize OS page table from shadow Make shadow page table readable and writable by OS When user code accesses page in accessible OS page table: Synchronize shadow PT from OS (through pmap PPN->MPN mappings) Disable OS access to page table Huge advantage of binary translation vs. direct execution: Can detect which stores are to page tables Re-write code so it updates both OS and shadow page tables What are three basic memory parameters for a VM? min, max, shares min - VMM always guarantees this much machine memory, or won't run VM Actually, need min + ~32MB overhead max - this is amount of "physical" memory VM OS thinks machine has Obviously VM can never consume more than this much memory share - how much machine memory this VM should have relative to other VMs Big question addressed by this paper: Memory management when over-committed Straw man: Just page "physical" mem to disk with LRU. Why is this bad? OS probably already uses LRU, which leads to "double paging" problem OS will free and re-use whatever "physical" page VMM just paged out Also, performance concerns limit how much you can over-commit Plus can't modify OS bcopy to use many of Disco's tricks Goal: Minimize memory usage and maximize performance with cool tricks How to get pages back from the OS under memory pressure? System can be in one of four states: high, soft, hard, low high - (6% free) plenty of memory soft - (4% free) try to convince OSes to give back memory hard - (2% free) use random eviction to page stuff out to disk low - (1% free) block execution of VMs above their mem usage targets How to convince an OS to give back memory? Ballooning Implement special psudo device driver that allocates pinned physical memory VMM asks baloon driver to allocate memory Baloon driver tells VMM about pages that guest OS will not touch How well does this work (Figure 2)? Looks like only small penalty compared to limiting physical memory at boot But would have been nice to have third bar random eviction paging As it is, don't know how much cleverness of technique is buying us How to share pages across OSes? Use hashing to find pages with identical contents Big hash table maps hash values onto machine pages: If only mapped once, page may be writable, so hash is only *hint* Hash table has pointer back to machine page Must do full compare w. other page before combining into shared CoW page If mapped multiple times, just keep 16-bit reference count If counter overflows (e.g., on zero page) use overflow table Scan OS pages randomly to stick hints in hash table Note: Always try sharing a page before paging it out to disk How well does this work? (figure 4) How does proportional share typically work (no idle tax)? Each VM has some number of "shares" Reclaim a page from OS with lowest ration of "shares-to-pages" S/P E.g., if A and B both have S=1, reclaim from larger of the two if A has twice B's share, then A can use twice as much memory Why is this not good enough? May have high-priority VM wasting tons of memory (This is reasonable: high priority VM might sometimes not use memory) Idea: idle memory tax. Instead of S/P, reclaim from VM with lowest: S p = ----------------- P(f + k(1-f)) where f is fraction of active (non-idle) pages k is "idle page cost", k = 1/(1-T) for tax rate 0 <= T <1 How to determine how much idle memory? Statistical sampling: Pick n pages at random, invalidate, see if accessed If t pages touched out of n at end of period, estimate usage as 1/t Actually keep three estimates: Slow exponentially weighted moving average of t/n over many samples Faster weighted average that adapts more quickly Version of faster average that incorporates samples in current period Use max of 3. Why? Basically, when in doubt, want to respect priorities Spike in usage likely means VM has "woken up" Small pause in usage doesn't necessarily mean it will last longer Anecdote: Behavior when X server paged out by low-priority simulation How well does this do? Fig. 7 (p. 9) looks good in terms of memory utilization Would be nice to see some end-to-end throughput numbers, too, though What is issue with I/O pages and bounce buffers? Better to re-locate pages that are often used for I/O