An analysis of Linux Scalability to Many Cores
==============================================

What's the motivation for this work?
  Lots of research on "many-core" operating system architectures
    Generally try to rely much less on shared memory than traditional OSes
    E.g., Factor operating system into separate services exchanging messages
  Worth stepping back and asking... is this necessary?

What guided the selection of the Mosbench programs?
  Programs with scalability issues
  Programs with lots kernel time even on uniprocessors

What are reasons not to expect linear speedup?
  Single-core linux optimizes away many locks
  Synchronization adds costs (more wait time with more cores)
  Cache coherence adds cache misses
  Other parts of hardware may be bottlenecks
  Not enough parallelism in task (e.g., stragglers, synchronization points)

How does memory work on the AMD opterons in this paper?
  Memory itself is controlled by individual CPUs in NUMA architecture
  Cache coherence:  Each cache line has four associated bits - MESI
      M = modified, E = exclusive, S = shared, I = invalid
    Multiple CPUs can cache same memory in S state
    But if any CPU has the data in M or E state, others must be I
  CPUs communicate over point-to-point Hypertransport protocol
    Bad spinlocks (e.g., xchgs on same line) can saturate inter-core network!
  1MiB (of 6MiB L3 cache) is used as a directory to track who owns what memory
     "HT Assist probe filter"

What is false sharing?
  Have read-mostly data on same cache line as written data
  So writer invalidates cache line in other cores, causing gratuitous misses
  This happens with device and net_device structures
  How to fix?  add padding or reorganize structure around cache line size

How does the 10Gb NIC card work?
  Multiple transmit queues (avoids coordination on send)
    Stock kernel puts all skbuffs in memory close to PCIe card
  Multiple receive queues--but how to route traffic?
    Card samples 1/20 outgoing TCP packets, updates rules for matching incoming
    That's bad for short flows & incoming connections
  What do we need for scalability?  usually process each packet on one core

How did the authors NIC packet routing to meet goal?
  Distribute incoming packets to cores based on hash of flow-defining fields
  Also modify accept to accept SYN packets delivered to local core
  Assumes processes will use socket on core that accepted in
  Hence, no need to rely on modifying NIC rules through sampling
    Already just accept connections whose packets will be delivered locally

What's a sloppy counter?
  Idea:  For many counters okay to overestimate or make reading expensive
  E.g., consider reference counts
    - If count goes to 0, resource may be garbage collected--bad if in use
    - But okay if garbage collector does something slow to check if 0
      (garbage collector runs much less often than code to inc/dec counter)
  Sloppy counter = base (backwards-compatible) counter + per-core counter
    Actual value is base - Sum of per-core counters (like borrowed count)
    Need to increment?  First try local core's value, then get spinlock for base
    Need to decrement?  Add to local core's value
    Test if zero?  Now use expensive bus-locked xchg 0 w. each "local" value
  Note this is nicely backwards compatible
    OK if some old inc/dec code uses base value--will be slower but correct

What is a dentry?
  Cache of a (directory-inode, filename) -> file-inode mapping
  Looking up "/usr/bin/ls"?  (usr-inode, "bin") probably in cache
    But need to compare inode number and string to be sure
    Don't want the dentry changed under you while doing this comparison
    So put a spinlock on every dentry--hold briefly while doing compare
    Turns out to be a bottleneck even for read-mostly dentry like /usr/bin
  Solution?  "Lock-free comparison" (4.4)  Add generation counter to dentry
    To compare:  First read counter.  Is 0?  Get spinlock and revert to old way
    Then copy interesting fields of dentry to local memory
    Then read counter again.  Changed?  Get spinlock and revert to old way
    When updating, hold spinlock, set counter to 0, then to old-value+1
    Note on some architectures would need a memory barrier (maybe not x86)

What's wrong with...
  - Per-superblock list of open files (to track if read-only remount OK)
  - vfsmount table (used to determine mount points in path lookup, keep refcnt)
  - pool of free skbuffs?
  All have global locks, which is not scalable
  Even if locks made finer-grain, all are updated by multiple cores
  How to fix?  Partition across cores
    - Keep one superblock list on each core, usually use the local one
      (If a process migrates, need to lock another core's list to remove files)
    - Keep a core-local cache of the vfsmount table
    - Use per-core freelist for skbuffs

Let's go over each of the benchmarks...

* exim (Mail server)
  Mail server delivering mail (into tmpfs--why?)
  - Fix BerkeleyDB /proc/stat stupidity
  - Multiple spool directories
  - Avoid an exec (deliver_drop_privilege saves exim from re-invoking itself)
  - spin lock on vfsmount table kills stock
  - Potential future PK issue when parent & child on different cores

* memcached
  Popular <key, value> cache (e.g., used extensively by facebook)
  Spends 80% of time in kernel on a single core 
  - skbuff problem
  - net_device/device false sharing
  - dst_entry refcount spinlock (protect with a sloppy counter)

* apache
  Serve static file over HTTP to many concurrent clients on 25 machines
  Stock apache uses mutex to call accept in one thread at a time
    So for stock, authors run one instance on each core with dedicated port
    PK can use single port system-wide
  Disable logging (very important when using apache as a benchmark)
  - Same issues as memcached (skbuff, false sharing, dst_entry refcount)
  - dentry issues
  What's the scalability limitation for apache?
    Eventually, network card can't keep up (CPUs 18% idle at 48 cores)

* PostgreSQL
  Database actually intended to scale to multiple cores
  Uses one process per connection, relying heavily on shared memory
  Stresses kernel interfaces by storing databases as files in file system
  1.5% kernel time on 1-core, 82% with 48 cores!
  Benchmark runs two workloads, read-only and read-write
  - Cache freelist protected by single lock (avoid with huge cache)
  - Maps row- and table-level locks only 16 user-level mutexes!
    Visibly hurts read-write performance around 28 cores
    Fix by making PostgreSQL lock-free w/o contention and using 1,024 mutexes
  - Both workloads call lseek a lot, which in linux acquires a mutex
    (p. 12 sys time) Turns out mutex was unnecessary, so PK just removes it

* gmake
  Build the kernel with gmake -j<2*cores>
  This is the #1 benchmark kernel developers care about!
  Lots of time spent in userland rather than kernel
  - Slight dentry issue mentioned in text (but barely shows in F. 9)
  - Certain build stages serialized, also have straggling processes
  Basically there's no issue here

* Pedsort (psearchy)
  Build index for kernel source tree
  Two stages: 1) index files in parallel, 2) merge phase 1 outputs
  Big files first (to reduce stragglers)
  Output indexes capped at 200,000 to behave independently of #cores
  Initially used many threads in one process
  - mmap/munmap serialized in kernel, but glibc uses for file reads!
    Had to break into multiple processes to avoid 20x system time blowup
    Note: fix speeds up single-core too (glibc fast paths w/o threads)
  - With multiple cores, RR allocates to increase total L3 cache size
    Basically the benchmark benefits from as much L3 cache as possible
  What is remaining bottleneck?  Size of L3 cache

* Metis (map/reduce)
  Generate inverted index for 2GiB file
  - Lots of zero-fill-on-demand page faults for memory allocated with mmap
    Faults happen concurrently, handling requires lock on memory region
    Switch to using 2MiB superpages, and to using finer-grained locking
  What is remaining bottleneck?  Memory bandwidth

Are the Mosbench applications realistic?  No.  Do we care?
  Point is to stress and isolate kernel
  E.g., use tmpfs file system so disk not bottleneck
  Use single file with apache to avoid benchmarking disk

What could go wrong in the future with further scaling?
  Process forking across cores (and shared memory)