Practical, transparent OS support for superpages ================================================ Look at Fig 1. Why is TLB coverage decreasing over time? Do we care? Physically addressed caches put TLB on critical path for CPU Memory latencies not improving nearly as quickly as CPU speeds TLB is often fully associative - harder to scale than on-chip cache (Note newer processors not always fully associative) We care because: Relative cost of TLB misses is increasing If TLB coverage smaller than cache, cache hits may cause TLB misses! Good news: Hardware support superpages OSes typically only use for kernel memory & frame buffer, but use for more Should we just use 64KB pages instead of 8KB? Instant 8x coverage gain Changes semantics (e.g., of guard pages and mprotect) Would wasted a lot of memory and cause more paging Would wasted a lot I/O bandwidth for writing out larger dirtied pages Would Potentially also hurt cache hit rate with low set associativity cache And should use even larger superpages if possible to further reduce miss rate Hence, need more intelligent use of superpages. Will need to: 1. Track available superpages, so we can allocate them 2. Make more superpages available under memory pressure 3. Figure out when to promote pages into superpages, or demote back to pages Authors track memory using a buddy allocator - what's this? Binary buddy: keep one bitmap for each power of from min to max chunk size When freeing chunk, if its "binary buddy" is free, coalesce to larger size Can do same for each power of 8 --> if other 7 chunks free, coalesce Allocate chunk of appropriate size, or if none available, break larger chunk Alternative: Keep two trees to index free chunks by location and size Often better than binary buddy if you don't care about alignment But here virtual and physical pages need same alignment, so buddy better To make more superpages available, could relocate existing pages. Drawbacks? Copying pages is not free. Uses CPU, potentially trashes your cache Past work used cost-benefit analysis to decide when to copy Cost estimated based on TLB misses--so have get more misses in first place Worse yet: complicates TLB miss handler and makes it even slower What to do instead? Reservation. How does this work? If might later want a superpage, reserve physical memory surrounding page So when to promote pages? When all pages in a candidate superpage have been accessed and are clean Or when all pages in a candidate superpage have been dirtied When to demote? Can speculatively demote to track usage (re-promote if all subpages used) Shatter clean superpage whenever application dirties a page. Why? Sec 6.7 shows up to a factory of 20 performance penalty for not doing this What is hash trick? Why not use? SHA-1 collision resistant hash, but too expensive to compute Note: Should have used Rabin or some cheap message authentication code! Promotion policy simple, but: How much to reserve? (Sec 4.2) Philosophy: Best to err on side of reserving too much For fixed-size object (e.g., mmapped file): Pick largest superpage that won't overlap other reservation or beyond end For dynamically growing object (e.g., stack or heap) Eliminate restriction about overlapping beyond end of object But limit size of new superpage by current size of object (so don't reserve 4MB for 8KB stack that may only grow to 16KB) What to do if insufficient memory for a reservation? Preempt an old reservation that hasn't used its pages Prefer new reservations because useful reservations often populated quickly When multiple preemption candidates? Chose one which least recently allocated a new page How does superpage system keep track of reservations? Population map (see Fig. 3, p. 8) Basically a trie that uses 3-bit VA snippet to select children Each node has #child with a page "sompop", #child with all pages "fullpop" Quick lookup to see if faulting page has already been reserved Quick check if one reservation would overlap with another Page promotion determination easy (when fullpop == 8) Reservation useful for preemption when somepop == 0 at right size System keeps reservations in lists One list for each size you could get by shattering (i.e., all but largest available superpage size) Each list in LRA (L.R. allocated a page) order How does evication work in unmodified FreeBSD? Pages on four lists, last three in approximate LRU order: - free: Pages that correspond to nothing (e.g., exited process's memory) - cache: clean and unmapped (file data) - inactive: mapped, but not recently referenced (dirty or clean) - active: accessed recently but may not have ref bit set Under memory pressure, page daemon is invoked: - moves clean inactive -> cache - page out (i.e., clean) dirty inactive - deactivate unreferenced pages from active list How does superpage-enabled FreeBSD handle memory or contiguity pressure? All cache pages available for reservations Page daemon invoked on memory or contiguity shortage All clean pages backed by file moved to inactive list when file closed What about wired pages? (Sec 5.2) What about multiple mappings? (Sec 5.3) Good think mmap can select the address What questions should we ask of evaluation? 1. Does this really matter to the performance of real workloads? 2. Are there alternative ways of addressing the problem? 3. Will this work on other hardware, like the x86? 4. Are there further improvements to the proposed technique? 5. Does what they did hurt performance in some situations? 1. Does it matter? Matrix - 7.5x speedup. Yes! Linker - still good speedup, 32% Mesa - slowdown (but modest) 2. Are there alternative ways of addressing the problem? Hardware - sub-block TLBs, or second level of remapping Maybe should sacrifice associativity for TLB size??? 3. Will this work on x86? Have 2MB or 4MB pages Figure 2 - looks like it helps, but not as good as on alpha 4. Further improvements? Hash based dirty pages using cheaper technique than SHA-1 Re-write PAL code for more efficient page tables (don't replicate PTEs) Do better job with pre-zeroing pages Coalesce related small objects such as files into superpages 5. How might this work make performance *worse*? Overhead of extra computation & data structures Worse page-out choices to get contiguity (e.g., flush closed file blocks) Worse page-out choices because only one accessed/dirty bit per superpage