Quick BSD VM implementation overview ==================================== Each process has a vmspace structure specifying address space. Contains: - vm_map: machine-independent virtual address space This is where most of the "action" is - vm_pmap: machine-dependent virtual address space Installs mappings, protects/unprotects pages, returns dirty/accessed bits Layer is lazy and is allowed to discard mappings (basically just a cache) - statistics (e.g., for syscalls like getrusage) The vm_map structure has a linked list of *vm_map_entry* structures Each describes contiguous virtual memory w. same protection+inheritence Can point to a memory object E.g., vnode of a memory mapped file (including text segment) or swap-backed memory for anonymous mmap also used for program heap, and for program stack Can also point to a *shadow object* Used for copy on write after fork, or MAP_PRIVATE file Objects point to lists of vm_page structures Corresponds to physical page (or couple pages if larger software page size) Links for various queues (like the free list) What happens on a page fault? Traverse vm_map_entries in vm_map to get appropriate entry If nothing at this virtual address, send process a SIGSEGV Traverse list of shadow objects and final object For each object, traverse vm_page structures to find appropriate one If first vm_page found in first object in chain, map page Else if read fault, map page read only Else, make copy of page and install in first (shadow) object read-write If vm_page not found, request page from object Will page in from file or swap, or zero-fill new page, etc. Practical, transparent OS support for superpages ================================================ Look at Fig 1. Why is TLB coverage decreasing over time? Do we care? Physically addressed caches put TLB on critical path for CPU Memory latencies not improving nearly as quickly as CPU speeds TLB is often fully associative - harder to scale than on-chip cache (Note newer processors not always fully associative) We care because: Relative cost of TLB misses is increasing If TLB coverage smaller than cache, cache hits may cause TLB misses! What do things look like now? E.g., My machine (Xeon E5620): 48 GiB RAM 64 x 4-KiB D-TLB entries [All TLBs are 4-way set associative] 64 x 4-KiB I-TLB entries 512 x 4-KiB L2 TLB entries So L1 D-TLB coverage (w/o large pages) = 0.0005% L2 TLB coverage = 0.0041% Good news: Hardware supports superpages OSes typically only use for kernel memory & frame buffer, but use for more E.g., On my Xeon, have 7 x 2-MiB I-TLB entries, 32 x 2-MiB D-TLB entries Closer to .4% TLB coverage But other CPUs have even more flexibility (and sounds like same TLB entries can hold multiple size pages) E.g., on alpha in paper have 8KiB, 64KiB, 512KiB, and 4MiB translations Should we just use 64KB pages instead of 8KB? Instant 8x coverage gain Changes semantics (e.g., of guard pages and mprotect) Would waste a lot of memory and cause more paging Would waste a lot I/O bandwidth for writing out larger dirtied pages Would Potentially also hurt cache hit rate with low set associativity cache And should use even larger superpages if possible to further reduce miss rate Hence, need more intelligent use of superpages. Will need to: 1. Track available superpages, so we can allocate them 2. Make more superpages available under memory pressure 3. Figure out when to promote pages into superpages, or demote back to pages Authors track memory using a buddy allocator - what's this? Binary buddy: keep one bitmap for each power of from min to max chunk size When freeing chunk, if its "binary buddy" is free, coalesce to larger size Can do same for each power of 8 --> if other 7 chunks free, coalesce Allocate chunk of appropriate size, or if none available, break larger chunk Alternative: Keep two trees to index free chunks by location and size Often better than binary buddy if you don't care about alignment But here virtual and physical pages need same alignment, so buddy better To make more superpages available, could relocate existing pages. Drawbacks? Copying pages is not free. Uses CPU, potentially trashes your cache Past work used cost-benefit analysis to decide when to copy Cost estimated based on TLB misses--so have get more misses in first place Worse yet: complicates TLB miss handler and makes it even slower What to do instead? Reservation. How does this work? If might later want a superpage, reserve physical memory surrounding page So when to promote pages? When all pages in a candidate superpage have been accessed and are clean Or when all pages in a candidate superpage have been dirtied When to demote? Can speculatively demote to track usage (re-promote if all subpages used) Shatter clean superpage whenever application dirties a page. Why? Sec 6.7 shows up to a factor of 20 performance penalty for not doing this What is hash trick? Why not use? SHA-1 collision resistant hash, but too expensive to compute Note: Should have used Rabin or some cheap message authentication code! Promotion policy simple, but: How much to reserve? (Sec 4.2) Philosophy: Best to err on side of reserving too much For fixed-size object (e.g., mmapped file): Pick largest superpage that won't overlap other reservation or beyond end For dynamically growing object (e.g., stack or heap) Eliminate restriction about overlapping beyond end of object But limit size of new superpage by current size of object (so don't reserve 4MB for 8KB stack that may only grow to 16KB) What to do if insufficient memory for a reservation? Preempt an old reservation that hasn't used its pages because useful reservations often populated quickly When multiple preemption candidates? Chose one which least recently allocated a new page How does superpage system keep track of reservations? Population map (see Fig. 3, p. 8) Basically a trie that uses 3-bit VA snippet to select children Each node has #child with a page "somepop", #child with all pages "fullpop" Quick lookup to see if faulting page has already been reserved Quick check if one reservation would overlap with another Page promotion determination easy (when fullpop == 8) Reservation useful for preemption when somepop == 0 at right size System keeps reservations in lists One list for each size you could get by shattering a superpage (i.e., all but largest available superpage size) Each list in LRA (L.R. allocated a page) order How does eviction work in unmodified FreeBSD? Pages on four lists, last three in approximate LRU order: - free: Useless pages (e.g., deleted file, exited process's memory) Note that some of these pages may be pre-zeroed in idle loop - cache: clean and unmapped (file data) - inactive: mapped, but not recently referenced (dirty or clean) - active: accessed recently but may not have ref bit set Under memory pressure, page daemon is invoked: - moves clean inactive -> cache - page out (i.e., clean) dirty inactive - move unreferenced pages from active list -> inactive How does superpage-enabled FreeBSD handle memory or contiguity pressure? All cache pages available for reservations Why not merge cache and free? Might still resurrect Page daemon invoked on memory or contiguity shortage All clean pages backed by file moved to inactive list when file closed What about wired pages? (Sec 5.2) What about multiple mappings? (Sec 5.3) Good thing mmap can select the address What questions should we ask of evaluation? 1. Does this really matter to the performance of real workloads? 2. Are there alternative ways of addressing the problem? 3. Will this work on other hardware, like the x86? 4. Are there further improvements to the proposed technique? 5. Does what they did hurt performance in some situations? 1. Does it matter? Matrix - 7.5x speedup. Yes! Linker - still good speedup, 32% Mesa - slowdown (but modest) - why? New system less likely to allocate pre-zeroed pages 2. Are there alternative ways of addressing the problem? Hardware - subblock TLBs (translation w. hole), or second level of remapping Maybe should sacrifice associativity for TLB size??? 3. Will this work on x86? Have 2MB pages Table 2 - looks like it helps, but not as good as on alpha 4. Further improvements? Hash based dirty pages using cheaper technique than SHA-1 Re-write PAL code for more efficient page tables (don't replicate PTEs) Do better job with pre-zeroing pages Coalesce related small objects such as files into superpages 5. How might this work make performance *worse*? Overhead of extra computation & data structures Worse page-out choices to get contiguity (e.g., flush closed file blocks) Worse page-out choices because only one accessed/dirty bit per superpage Any lessons here for hardware designers? Multiple superpage sizes important (alpha good, x86 not as good) Would be useful to have 8 dirty (+ accessed) bits in superpage translations Replicating PTEs bad way to specify superpages (here x86 better than alpha)