Practical, transparent OS support for superpages
================================================

Look at Fig 1.  Why is TLB coverage decreasing over time?  Do we care?
  Physically addressed caches put TLB on critical path for CPU
  Memory latencies not improving nearly as quickly as CPU speeds
  TLB is often fully associative - harder to scale than on-chip cache
    (Note newer processors not always fully associative)
  We care because:
    Relative cost of TLB misses is increasing
    If TLB coverage smaller than cache, cache hits may cause TLB misses!

Good news:  Hardware support superpages
  OSes typically only use for kernel memory & frame buffer, but use for more
Should we just use 64KB pages instead of 8KB?  Instant 8x coverage gain
  Changes semantics (e.g., of guard pages and mprotect)
  Would wasted a lot of memory and cause more paging
  Would wasted a lot I/O bandwidth for writing out larger dirtied pages
  Would Potentially also hurt cache hit rate with low set associativity cache
  And should use even larger superpages if possible to further reduce miss rate
Hence, need more intelligent use of superpages.  Will need to:
  1. Track available superpages, so we can allocate them
  2. Make more superpages available under memory pressure
  3. Figure out when to promote pages into superpages, or demote back to pages

Authors track memory using a buddy allocator - what's this?
  Binary buddy: keep one bitmap for each power of from min to max chunk size
    When freeing chunk, if its "binary buddy" is free, coalesce to larger size
    Can do same for each power of 8 --> if other 7 chunks free, coalesce
  Allocate chunk of appropriate size, or if none available, break larger chunk
  Alternative:  Keep two trees to index free chunks by location and size
    Often better than binary buddy if you don't care about alignment
    But here virtual and physical pages need same alignment, so buddy better

To make more superpages available, could relocate existing pages.  Drawbacks?
  Copying pages is not free.  Uses CPU, potentially trashes your cache
  Past work used cost-benefit analysis to decide when to copy
    Cost estimated based on TLB misses--so have get more misses in first place
    Worse yet:  complicates TLB miss handler and makes it even slower

What to do instead?  Reservation.  How does this work?
  If might later want a superpage, reserve physical memory surrounding page

So when to promote pages?
  When all pages in a candidate superpage have been accessed and are clean
  Or when all pages in a candidate superpage have been dirtied
When to demote?
  Can speculatively demote to track usage (re-promote if all subpages used)
  Shatter clean superpage whenever application dirties a page.  Why?
    Sec 6.7 shows up to a factory of 20 performance penalty for not doing this
  What is hash trick?  Why not use?
    SHA-1 collision resistant hash, but too expensive to compute
    Note:  Should have used Rabin or some cheap message authentication code!

Promotion policy simple, but:  How much to reserve?  (Sec 4.2)
  Philosophy:  Best to err on side of reserving too much
  For fixed-size object (e.g., mmapped file):
    Pick largest superpage that won't overlap other reservation or beyond end
  For dynamically growing object (e.g., stack or heap)
    Eliminate restriction about overlapping beyond end of object
    But limit size of new superpage by current size of object
      (so don't reserve 4MB for 8KB stack that may only grow to 16KB)

What to do if insufficient memory for a reservation?
  Preempt an old reservation that hasn't used its pages
  Prefer new reservations because useful reservations often populated quickly
  When multiple preemption candidates?
    Chose one which least recently allocated a new page

How does superpage system keep track of reservations?
  Population map (see Fig. 3, p. 8)
  Basically a trie that uses 3-bit VA snippet to select children
    Each node has #child with a page "sompop", #child with all pages "fullpop"
    Quick lookup to see if faulting page has already been reserved
    Quick check if one reservation would overlap with another
    Page promotion determination easy (when fullpop == 8)
    Reservation useful for preemption when somepop == 0 at right size
      System keeps reservations in lists
      One list for each size you could get by shattering
        (i.e., all but largest available superpage size)
      Each list in LRA (L.R. allocated a page) order

How does evication work in unmodified FreeBSD?
  Pages on four lists, last three in approximate LRU order:
   - free: Pages that correspond to nothing (e.g., exited process's memory)
   - cache: clean and unmapped (file data)
   - inactive: mapped, but not recently referenced (dirty or clean)
   - active: accessed recently but may not have ref bit set
  Under memory pressure, page daemon is invoked:
   - moves clean inactive -> cache
   - page out (i.e., clean) dirty inactive
   - deactivate unreferenced pages from active list

How does superpage-enabled FreeBSD handle memory or contiguity pressure?
  All cache pages available for reservations
  Page daemon invoked on memory or contiguity shortage
  All clean pages backed by file moved to inactive list when file closed

What about wired pages?  (Sec 5.2)
What about multiple mappings?  (Sec 5.3)
  Good think mmap can select the address

What questions should we ask of evaluation?
  1. Does this really matter to the performance of real workloads?
  2. Are there alternative ways of addressing the problem?
  3. Will this work on other hardware, like the x86?
  4. Are there further improvements to the proposed technique?
  5. Does what they did hurt performance in some situations?

1. Does it matter?
   Matrix - 7.5x speedup.  Yes!
   Linker - still good speedup, 32%
   Mesa - slowdown (but modest)

2. Are there alternative ways of addressing the problem?
   Hardware - sub-block TLBs, or second level of remapping
   Maybe should sacrifice associativity for TLB size???

3. Will this work on x86?
   Have 2MB or 4MB pages
   Figure 2 - looks like it helps, but not as good as on alpha

4. Further improvements?
   Hash based dirty pages using cheaper technique than SHA-1
   Re-write PAL code for more efficient page tables (don't replicate PTEs)
   Do better job with pre-zeroing pages
   Coalesce related small objects such as files into superpages

5. How might this work make performance *worse*?
   Overhead of extra computation & data structures
   Worse page-out choices to get contiguity (e.g., flush closed file blocks)
   Worse page-out choices because only one accessed/dirty bit per superpage