Quick BSD VM implementation overview
====================================

Each process has a vmspace structure specifying address space. Contains:
  - vm_map: machine-independent virtual address space
      This is where most of the "action" is
  - vm_pmap: machine-dependent virtual address space
      Installs mappings, protects/unprotects pages, returns dirty/accessed bits
      Layer is lazy and is allowed to discard mappings (basically just a cache)
  - statistics (e.g., for syscalls like getrusage)

The vm_map structure has a linked list of *vm_map_entry* structures
  Each describes contiguous virtual memory w. same protection+inheritence
  Can point to a memory object
    E.g., vnode of a memory mapped file (including text segment)
          or swap-backed memory for anonymous mmap
          also used for program heap, and for program stack
  Can also point to a *shadow object*
    Used for copy on write after fork, or MAP_PRIVATE file
  Objects point to lists of vm_page structures
    Corresponds to physical page (or couple pages if larger software page size)
    Links for various queues (like the free list)

What happens on a page fault?
  Traverse vm_map_entries in vm_map to get appropriate entry
    If nothing at this virtual address, send process a SIGSEGV
  Traverse list of shadow objects and final object
    For each object, traverse vm_page structures to find appropriate one
  If first vm_page found in first object in chain, map page
    Else if read fault, map page read only
    Else, make copy of page and install in first (shadow) object read-write
  If vm_page not found, request page from object
    Will page in from file or swap, or zero-fill new page, etc.

Practical, transparent OS support for superpages
================================================

Look at Fig 1.  Why is TLB coverage decreasing over time?  Do we care?
  Physically addressed caches put TLB on critical path for CPU
  Memory latencies not improving nearly as quickly as CPU speeds
  TLB is often fully associative - harder to scale than on-chip cache
    (Note newer processors not always fully associative)
  We care because:
    Relative cost of TLB misses is increasing
    If TLB coverage smaller than cache, cache hits may cause TLB misses!
  What do things look like now?  E.g., My machine (Xeon E5620):
      48 GiB RAM
      64 x 4-KiB D-TLB entries       [All TLBs are 4-way set associative] 
      64 x 4-KiB I-TLB entries
      512 x 4-KiB L2 TLB entries
    So L1 D-TLB coverage (w/o large pages) = 0.0005%
    L2 TLB coverage                        = 0.0041%

Good news:  Hardware supports superpages
  OSes typically only use for kernel memory & frame buffer, but use for more
  E.g., On my Xeon, have 7 x 2-MiB I-TLB entries, 32 x 2-MiB D-TLB entries
    Closer to .4% TLB coverage
  But other CPUs have even more flexibility
    (and sounds like same TLB entries can hold multiple size pages)
  E.g., on alpha in paper have 8KiB, 64KiB, 512KiB, and 4MiB translations

Should we just use 64KB pages instead of 8KB?  Instant 8x coverage gain
  Changes semantics (e.g., of guard pages and mprotect)
  Would waste a lot of memory and cause more paging
  Would waste a lot I/O bandwidth for writing out larger dirtied pages
  Would Potentially also hurt cache hit rate with low set associativity cache
  And should use even larger superpages if possible to further reduce miss rate
Hence, need more intelligent use of superpages.  Will need to:
  1. Track available superpages, so we can allocate them
  2. Make more superpages available under memory pressure
  3. Figure out when to promote pages into superpages, or demote back to pages

Authors track memory using a buddy allocator - what's this?
  Binary buddy: keep one bitmap for each power of from min to max chunk size
    When freeing chunk, if its "binary buddy" is free, coalesce to larger size
    Can do same for each power of 8 --> if other 7 chunks free, coalesce
  Allocate chunk of appropriate size, or if none available, break larger chunk
  Alternative:  Keep two trees to index free chunks by location and size
    Often better than binary buddy if you don't care about alignment
    But here virtual and physical pages need same alignment, so buddy better

To make more superpages available, could relocate existing pages.  Drawbacks?
  Copying pages is not free.  Uses CPU, potentially trashes your cache
  Past work used cost-benefit analysis to decide when to copy
    Cost estimated based on TLB misses--so have get more misses in first place
    Worse yet:  complicates TLB miss handler and makes it even slower

What to do instead?  Reservation.  How does this work?
  If might later want a superpage, reserve physical memory surrounding page

So when to promote pages?
  When all pages in a candidate superpage have been accessed and are clean
  Or when all pages in a candidate superpage have been dirtied
When to demote?
  Can speculatively demote to track usage (re-promote if all subpages used)
  Shatter clean superpage whenever application dirties a page.  Why?
    Sec 6.7 shows up to a factor of 20 performance penalty for not doing this
  What is hash trick?  Why not use?
    SHA-1 collision resistant hash, but too expensive to compute
    Note:  Should have used Rabin or some cheap message authentication code!

Promotion policy simple, but:  How much to reserve?  (Sec 4.2)
  Philosophy:  Best to err on side of reserving too much
  For fixed-size object (e.g., mmapped file):
    Pick largest superpage that won't overlap other reservation or beyond end
  For dynamically growing object (e.g., stack or heap)
    Eliminate restriction about overlapping beyond end of object
    But limit size of new superpage by current size of object
      (so don't reserve 4MB for 8KB stack that may only grow to 16KB)

What to do if insufficient memory for a reservation?
  Preempt an old reservation that hasn't used its pages
    because useful reservations often populated quickly
  When multiple preemption candidates?
    Chose one which least recently allocated a new page

How does superpage system keep track of reservations?
  Population map (see Fig. 3, p. 8)
  Basically a trie that uses 3-bit VA snippet to select children
    Each node has #child with a page "somepop", #child with all pages "fullpop"
    Quick lookup to see if faulting page has already been reserved
    Quick check if one reservation would overlap with another
    Page promotion determination easy (when fullpop == 8)
    Reservation useful for preemption when somepop == 0 at right size
      System keeps reservations in lists
      One list for each size you could get by shattering a superpage
        (i.e., all but largest available superpage size)
      Each list in LRA (L.R. allocated a page) order

How does eviction work in unmodified FreeBSD?
  Pages on four lists, last three in approximate LRU order:
   - free: Useless pages (e.g., deleted file, exited process's memory)
       Note that some of these pages may be pre-zeroed in idle loop
   - cache: clean and unmapped (file data)
   - inactive: mapped, but not recently referenced (dirty or clean)
   - active: accessed recently but may not have ref bit set
  Under memory pressure, page daemon is invoked:
   - moves clean inactive -> cache
   - page out (i.e., clean) dirty inactive
   - move unreferenced pages from active list -> inactive

How does superpage-enabled FreeBSD handle memory or contiguity pressure?
  All cache pages available for reservations
    Why not merge cache and free?  Might still resurrect
  Page daemon invoked on memory or contiguity shortage
  All clean pages backed by file moved to inactive list when file closed

What about wired pages?  (Sec 5.2)
What about multiple mappings?  (Sec 5.3)
  Good thing mmap can select the address

What questions should we ask of evaluation?
  1. Does this really matter to the performance of real workloads?
  2. Are there alternative ways of addressing the problem?
  3. Will this work on other hardware, like the x86?
  4. Are there further improvements to the proposed technique?
  5. Does what they did hurt performance in some situations?

1. Does it matter?
   Matrix - 7.5x speedup.  Yes!
   Linker - still good speedup, 32%
   Mesa - slowdown (but modest) - why?
     New system less likely to allocate pre-zeroed pages

2. Are there alternative ways of addressing the problem?
   Hardware - subblock TLBs (translation w. hole), or second level of remapping
   Maybe should sacrifice associativity for TLB size???

3. Will this work on x86?
   Have 2MB pages
   Table 2 - looks like it helps, but not as good as on alpha

4. Further improvements?
   Hash based dirty pages using cheaper technique than SHA-1
   Re-write PAL code for more efficient page tables (don't replicate PTEs)
   Do better job with pre-zeroing pages
   Coalesce related small objects such as files into superpages

5. How might this work make performance *worse*?
   Overhead of extra computation & data structures
   Worse page-out choices to get contiguity (e.g., flush closed file blocks)
   Worse page-out choices because only one accessed/dirty bit per superpage

Any lessons here for hardware designers?
  Multiple superpage sizes important (alpha good, x86 not as good)
  Would be useful to have 8 dirty (+ accessed) bits in superpage translations
  Replicating PTEs bad way to specify superpages (here x86 better than alpha)