An analysis of Linux Scalability to Many Cores ============================================== What's the motivation for this work? Lots of research on "many-core" operating system architectures Generally try to rely much less on shared memory than traditional OSes E.g., Factor operating system into separate services exchanging messages Worth stepping back and asking... is this necessary? What guided the selection of the Mosbench programs? Programs with scalability issues Programs with lots kernel time even on uniprocessors What are reasons not to expect linear speedup? Single-core linux optimizes away many locks Synchronization adds costs (more wait time with more cores) Cache coherence adds cache misses Other parts of hardware may be bottlenecks Not enough parallelism in task (e.g., stragglers, synchronization points) How does memory work on the AMD opterons in this paper? Memory itself is controlled by individual CPUs in NUMA architecture Cache coherence: Each cache line has four associated bits - MESI M = modified, E = exclusive, S = shared, I = invalid Multiple CPUs can cache same memory in S state But if any CPU has the data in M or E state, others must be I CPUs communicate over point-to-point Hypertransport protocol Bad spinlocks (e.g., xchgs on same line) can saturate inter-core network! 1MiB (of 6MiB L3 cache) is used as a directory to track who owns what memory "HT Assist probe filter" What is false sharing? Have read-mostly data on same cache line as written data So writer invalidates cache line in other cores, causing gratuitous misses This happens with device and net_device structures How to fix? add padding or reorganize structure around cache line size How does the 10Gb NIC card work? Multiple transmit queues (avoids coordination on send) Stock kernel puts all skbuffs in memory close to PCIe card Multiple receive queues--but how to route traffic? Card samples 1/20 outgoing TCP packets, updates rules for matching incoming That's bad for short flows & incoming connections What do we need for scalability? usually process each packet on one core How did the authors NIC packet routing to meet goal? Distribute incoming packets to cores based on hash of flow-defining fields Also modify accept to accept SYN packets delivered to local core Assumes processes will use socket on core that accepted in Hence, no need to rely on modifying NIC rules through sampling Already just accept connections whose packets will be delivered locally What's a sloppy counter? Idea: For many counters okay to overestimate or make reading expensive E.g., consider reference counts - If count goes to 0, resource may be garbage collected--bad if in use - But okay if garbage collector does something slow to check if 0 (garbage collector runs much less often than code to inc/dec counter) Sloppy counter = base (backwards-compatible) counter + per-core counter Actual value is base - Sum of per-core counters (like borrowed count) Need to increment? First try local core's value, then get spinlock for base Need to decrement? Add to local core's value Test if zero? Now use expensive bus-locked xchg 0 w. each "local" value Note this is nicely backwards compatible OK if some old inc/dec code uses base value--will be slower but correct What is a dentry? Cache of a (directory-inode, filename) -> file-inode mapping Looking up "/usr/bin/ls"? (usr-inode, "bin") probably in cache But need to compare inode number and string to be sure Don't want the dentry changed under you while doing this comparison So put a spinlock on every dentry--hold briefly while doing compare Turns out to be a bottleneck even for read-mostly dentry like /usr/bin Solution? "Lock-free comparison" (4.4) Add generation counter to dentry To compare: First read counter. Is 0? Get spinlock and revert to old way Then copy interesting fields of dentry to local memory Then read counter again. Changed? Get spinlock and revert to old way When updating, hold spinlock, set counter to 0, then to old-value+1 Note on some architectures would need a memory barrier (maybe not x86) What's wrong with... - Per-superblock list of open files (to track if read-only remount OK) - vfsmount table (used to determine mount points in path lookup, keep refcnt) - pool of free skbuffs? All have global locks, which is not scalable Even if locks made finer-grain, all are updated by multiple cores How to fix? Partition across cores - Keep one superblock list on each core, usually use the local one (If a process migrates, need to lock another core's list to remove files) - Keep a core-local cache of the vfsmount table - Use per-core freelist for skbuffs Let's go over each of the benchmarks... * exim (Mail server) Mail server delivering mail (into tmpfs--why?) - Fix BerkeleyDB /proc/stat stupidity - Multiple spool directories - Avoid an exec (deliver_drop_privilege saves exim from re-invoking itself) - spin lock on vfsmount table kills stock - Potential future PK issue when parent & child on different cores * memcached Popular cache (e.g., used extensively by facebook) Spends 80% of time in kernel on a single core - skbuff problem - net_device/device false sharing - dst_entry refcount spinlock (protect with a sloppy counter) * apache Serve static file over HTTP to many concurrent clients on 25 machines Stock apache uses mutex to call accept in one thread at a time So for stock, authors run one instance on each core with dedicated port PK can use single port system-wide Disable logging (very important when using apache as a benchmark) - Same issues as memcached (skbuff, false sharing, dst_entry refcount) - dentry issues What's the scalability limitation for apache? Eventually, network card can't keep up (CPUs 18% idle at 48 cores) * PostgreSQL Database actually intended to scale to multiple cores Uses one process per connection, relying heavily on shared memory Stresses kernel interfaces by storing databases as files in file system 1.5% kernel time on 1-core, 82% with 48 cores! Benchmark runs two workloads, read-only and read-write - Cache freelist protected by single lock (avoid with huge cache) - Maps row- and table-level locks only 16 user-level mutexes! Visibly hurts read-write performance around 28 cores Fix by making PostgreSQL lock-free w/o contention and using 1,024 mutexes - Both workloads call lseek a lot, which in linux acquires a mutex (p. 12 sys time) Turns out mutex was unnecessary, so PK just removes it * gmake Build the kernel with gmake -j<2*cores> This is the #1 benchmark kernel developers care about! Lots of time spent in userland rather than kernel - Slight dentry issue mentioned in text (but barely shows in F. 9) - Certain build stages serialized, also have straggling processes Basically there's no issue here * Pedsort (psearchy) Build index for kernel source tree Two stages: 1) index files in parallel, 2) merge phase 1 outputs Big files first (to reduce stragglers) Output indexes capped at 200,000 to behave independently of #cores Initially used many threads in one process - mmap/munmap serialized in kernel, but glibc uses for file reads! Had to break into multiple processes to avoid 20x system time blowup Note: fix speeds up single-core too (glibc fast paths w/o threads) - With multiple cores, RR allocates to increase total L3 cache size Basically the benchmark benefits from as much L3 cache as possible What is remaining bottleneck? Size of L3 cache * Metis (map/reduce) Generate inverted index for 2GiB file - Lots of zero-fill-on-demand page faults for memory allocated with mmap Faults happen concurrently, handling requires lock on memory region Switch to using 2MiB superpages, and to using finer-grained locking What is remaining bottleneck? Memory bandwidth Are the Mosbench applications realistic? No. Do we care? Point is to stress and isolate kernel E.g., use tmpfs file system so disk not bottleneck Use single file with apache to avoid benchmarking disk What could go wrong in the future with further scaling? Process forking across cores (and shared memory)