Read-Log-Update =============== What is Read-Copy-Update (RCU)? An API for synchronizing access to read-mostly data structures Basic API for reader: rcu_read_lock() - start read-side critical section (can nest) rcu_dereference() - get RCU-protected global pointer rcu_read_unlock() - end read-side critical section Basic API for writer: rcu_assign_pointer() - set RCU-protected pointer to new value synchronize_rcu() - wait for all readers to be done with old ptr call_rcu() - register async. callback for when readers are done Example: Global: global_obj *gp; spinlock gp_lock; Reader: rcu_read_lock(); use(rcu_dereference(gp)); rcu_read_unlock(); Writer: lock(&gp_lock); global_obj *old = gp; rcu_assign_pointer(gp, alloc_updated(old)); unlock(&gp_lock); synchronize_rcu(); obj_free(old); What guarantees does RCU make? See http://www.efficios.com/pub/rcu/urcu-main.pdf (Desnoyers) fig. 2 A read-side critical section can overlap at most two of following in writer: +----------------+--------------+-------------+ | assign_pointer | grace period | free object | +----------------+--------------+-------------+ How would you implement RCU? Multiple strategies: 1. Disable preemption in readers, run on all CPUs in synchronize rcu_read_lock() disables preemption rcu_dereference() is memory_order_consume load rcu_synchronize() { for (i = 0; i < ncpus; ++i) run_on_cpu(i); } 2. Periodically call rcu_quiescent() in readers not in critical section _Atomic uint64_t rcu_ctr; mutex_t rcu_lock; thread_local _Atomic uint64_t local_ctr; void rcu_synchronize() { lock(&rcu_lock); local_ctr = ++rcu_ctr; for_each_thread { while (thread's local_ctr < local_ctr) yield(); } unlock(&rcu_lock); } void rcu_quiescent() { local_ctr = global_ctr; } 3. Use fences or synchronized atomics inside rcu_read_lock/rcu_read_unlock rcu_read_lock() looks like rcu_quiescent() which no longer exists Last rcu_read_unlock() sets local_ctr = MAXINT Note you don't have to copy the whole data structure with RCU Can you remove item from linked list without copying list? Yes. See RLU Figure 1 and Desnoyers Figure 3 What happens if reader traverses linked list forwards & backwards? Writer can't atomically swap two pointers simultaneously Risk seeing invariant violations like item->next->prev != item -- oops So works for some structures (singly-linked list) & not others (doubly) What about concurrent updates? Also quite dangerous The upshot: Need to be very clever for each data structure Goal of read-LOG-update (RLU) is to fix all of this Instead of swapping one pointer atomically, swap many items atomically Maybe even atomically make updates by multiple concurrent writers! If every problem in CS can be fixed with a level of indirection ... what's the magic level of indirection in RLU? We will actually update a bunch of data structures non-atomically But, some GLOBAL CLOCK makes readers ignore all of these updates Then atomically update the clock and new readers will see new values In-progress readers will continue to use old clock value! What is RLU API? rlu_reader_lock() / rlu_dereference(obj) - like RCU rlu_reader_unlock() - Like RCU, but might also commit writes you made rlu_try_lock(ptr) - May return locked private object copy you can modify rlu_cmp_objs(ptr, ptr) - because unlike RCU "obj1 == obj2" may not work rlu_assign_pointer(lptr, ptr) - deals with multiple pointer values properly rlu_abort() - abort current transaction and undo any pending writes rlu_malloc()/rlu_free()? (Reserve space for log_ptr) What data structures do we need (Sec. 3.4)? * Per thread: - 2 Write logs: one active, one left over from previous epoch - run_counter: odd when using RLU, even when outside critical section - local_clock: Time at which your view of data should be frozen - write_clock: Infinity until you want to commit - is_writer: Flag is true if you have made modifications * Global: - global_clock: Used to initialize local clock each transaction - List of threads providing access to per-thread info for all threads * Per object in heap - log_ptr: pointer to copy of object in a log (or NULL if none) * Per object in log - Thread identifier of thread that owns log - A pointer to the original object of which this is a copy - The size of the object (so you know where next log entry is) - Special reserved value (where log_ptr would be in heap object) How does RLU work with serial writers (assume external big write lock)? For reader: rlu_reader_lock() - bump run_counter to something odd - local_clock = global_clock - this determines version of data you see - is_writer = false rlu_reader_unlock() - bump run_counter to something even rlu_dereference(ptr) - unlocked heap object? return as-is - log object? return as-is (you had the pointer, must be okay) Otherwise, we must have a locked heap object - Did we lock it? Then return corresponding log object - Is local_clock >= log owner's write_clock? Then return log object - Otherwise, locked after rlu_reader_lock(), so return heap object rlu_assign_pointer(lptr, ptr) - always translate ptr to heap if log For writer: rlu_try_lock(ptr) - Initialize log header for object copy in our log - Set heap object log_ptr to point to our header - Copy heap object into log rlu_synchronize() - Record run_counter of all active threads - For each thread T, spin while all of: 1. T's run_counter is odd (so T using RLU), and 2. T's run_counter is the same as we recorded before the loop, and 3. T's local_clock < our write_clock (T ignoring our log) rlu_reader_unlock() - set write_clock = global_clock + 1 (was infinity) - ++global clock - now new readers will use our log - rlu_synchronize() - set write_clock = infinity - Swap our two logs (readers might still use recent one - 3.6.2) What happens if you add concurrent writers? We can use log_ptr as a per-object lock Just modify rlu_try_lock(ptr) - Already locked by us? return log_ptr - Already lockec by another thread? rlu_abort() - Initialize log header for object copy in our log - Try to lock with atomic compare-and-swap on heap header - Success? Copy object from heap to log. Failure? rlu_abort() Drawbacks of concurrent writer version? Aborts (don't happen in serial writer version) More complicated semantics with concurrent writers, e.g.: With serial writer RLU, following code guarantees a == b: T1: reg = *rlu_dereference(a); *rlu_try_lock(b) = reg; T2: reg = *rlu_dereference(b); *rlu_try_lock(a) = reg; With concurrent writer version, might just swap a's and b's values Why do you want RLU deferring (3.7) and what is it? rlu_synchronize() is expensive. When do you really need it? After copying object into heap (to ensure old log_ptr not used) Therefore if another thread wants to write same object Also if you want another thread to see changes--why? Say just you wrote back log object w/o incrementing counter Then set log_ptr to NULL to release the lock Now readers whose local_clock < your write_clock will see new version Can't set all log_ptrs simultaneously - violates atomicity A readers may see both versions in one transaction - violates isolation How to reduce calls to rlu_synchronize()? Omit bumping global_clock and calling rcu_synchronize in rcu_read_unlock Instead send "sync request" to other thread after write-write conflict Anticipate fewer sync requests than is_writer transactions Bonus: reduces contention on global clock (to write-write conflicts) Bonus 2: fewer conflict cache messes in readers Yes, but because they are seeing stale data Drawbacks? Usage scenario must tolerate added update latency What if sync request takes too long? Maybe write back other thread's log? Only a big win with one benchmark at (Citrus to 80 cores) More modest win with only 16 cores What evaluation questions should be asking? 1. How is performance compared to RCU today? On average systems today When scaled to many cores 2. How is correctness? Is the approach correct (these things are tricky)? Is it easy to implement correctly, or bug prone? 3. Does it simplify/allow more data structures than RCU? What RCU do they compare to--is this fair? "State-of-the art" urcu library (Desnoyers) and linux kernel RCU Include numbers for incorrect but faster Harris linked list w. mem. leak Authors even fixed a performance bug in linux list_entry_rcu (4.6) But... authors don't support all features of more mature RCUs * Fig. 4--What is the best linked list? RLU unless you want memory leaks Note how concurrent updates kill RCU scalability! Don't show deferred RLU but claim it is the same * Fig. 5 (hash table) Why does RCU dominate? RCU readers do less constant work than Harris readers (and RLU) Use RCU on individual buckets--means very few actual write conflicts Why does deferred update win here but not in linked list? Fewer write-write conflicts in bucketed hash table than in single list * Fig. 6 (resizable hash table) How does a resizable hash table even work in RCU? Super complicated because can only update one pointer at a time So first create expanded hash table where two buckets share each list Assumes you aren't completely rehashing, but just splitting buckets Not useful, yet, but at least lookups in new table correct Then go through and "unzip" the buckets so they share tails Eventually tails will be NULL and you are done How does a resizable hash table work in RLU? Just like a regular hash table + some rcu_dereference/rcu_try_lock Which performs better RCU or RLU for resizable hash table? Figure 6 appears to show slight win for RCU. Is this fair? Resizable hash table doesn't support concurrent insert/remove Benchmarks is from other paper, so designed to make RCU look good Might well see win for RLU under different workload * Fig. 7 (stress test)--Why is RLU twice as slow even for 1 CPU? 100% update, one item per bucket is best case for RCU Cost of copying objects in RLU causes 2x slowdown At least this likely bounds worst-case for RLU * What do we learn from Citrus search tree (Sec 4.5)? Complex structures possible with RCU, but complicated Compare listing 3 to listing 4--RLU wins on simplicity Plus RLU performs beter Do authors answer questions 2 and 3? 2 (correctness): passes applicable subset of kernel torture test 3 (functionality): Already saw how much simpler Citrus code got Kyoto cabinet Cache DB RLU speeds up application where RCU is intractable