Scalable Synchronization ======================== Shared memory machines - bunch of CPUs, sharing physical memory For memory consistency, want cache coherence. How to achieve? Bus-based schemes Any CPU can access "dance with" any memory equally -> "dance hall arch" Use "Snoopy" protocols - Each CPU's cache listens to the memory bus With write-through architecture, invalidate copy when see a write Or can have "ownership" scheme with write-back cache E.g., Pentium cache have MESI bits (modified, exclusive, shared, invalid) If E bit set, CPU caches exclusively and can do write back But bus places limits on scalability More scalability w. NUMA schemes (non-uniform memory access) Each CPU has fast access to some "close" memory Slower to access memory that is farther away Use a directory to keep track of who is caching what COMA - cache-only memory architecture Each CPU has local RAM, treated as cache Cache lines migrate around to different nodes based on access pattern Data only lives in cache, no permanent memory location (These machines aren't too popular any more.) Synchronization techniques for parallel programs We have been mostly been considering uniprocessors Today, we are dealing with machines that have multiple CPUs For parallel scientific apps, need direct synchronization (without OS) But within OS kernel, also need to worry about the same issues Spin Locks Barriers Atomic instructions fetch_and_store (xchgl on x86) Also, can have Fetch_and_add, fetch_and_Phi for arbitrary Phi test_and_set spin lock + Uses single memory location for arbitrarily many threads - Lots of write traffic over memory buss - Not necessarily fair ticket lock + Just two memory locations for arbitrarily (exponentially) many threads + Fair (FIFO) + Probes with read operations only (unlike t&s which issues writes) - Still requires a lot of bandwidth everyone reading now_serving all the time, updated w. each switch + Can maybe estimate how long wait is to reduce reads, but hard high penalty if you guess wrong, because of perfect FIFO order - Requires fetch_and_add if implemented with a test_and_set spin lock, adds more memory traffic array-based locks Look at Graunke and Thakkar (because anderson needs fetch and add) Each proc has a bool (call it mybool[vpid]) on its own cache line Global: bool *who; bool locked_value; To lock: bool *prev = &mybool[vpid]; bool val = mybool[vpid]; atomic_swap: (prev, val) <-> (who, locked_value) while (*prev == val); To unlock: mybool[vpid] = !mybool[vpid]; + Fair (FIFO) - Requires space per lock proportional to number of threads can't easily recycle mybool, must stay unchanged till next guy releases + Less memory traffic than ticket lock if you have a coherent cache - Memory traffic than ticket if you don't have cache because can't back off as accurately; no estimate of wait + In NUMA, could arrange to have mybool[vpid] local to proc MCS list-based queuing lock Go over algorithm 5: Each thread has a qnode typedef struct qnode { struct qnode *next; bool locked; } *lock; acquire (lock *L, qnode *I) { I->next = NULL; qnode *predecessor = I; ATOMIC_SWAP (predecessor, *L); if (predecessor != NULL) { I->locked = true; predecessor->next = I; while (I->locked); } } release (lock *L, qnode *I) { if (I->next) I->next->locked = false; else { qnode *old_tail = NULL; ATOMIC_SWAP (*L, old_tail); if (old_tail == I) return; qnode *userper = old_tail; ATOMIC_SWAP (*L, userper); while (I->next == NULL); if (userper != NULL) userper->next = I->next; else I->next->locked = false; } } + Fair (almost FIFO, is FIFO if you have compare_and_swap) + Only spin on your own qnode I, which can be in local memory + Small amount of network traffic + Requires constant space per lock (plus each thread has a qnode) Sense-reversing centralized barrier Go over Algorithm 7 + Constant space + Good on cache-coherent machines with broadcast invalidation everyone just caches sense until a final broadcast invalidation - Not so good if you don't have coherent caches "sync" loads would flood memory system with traffic - Can maybe delay with exponential backoff, but will add latency New tree-based barrier Go over algorithm 11 Construct two trees with all nodes One tree with 4 children per node, for "fan-in" of 4 One tree with 2 children per node, for "fan-out" of 2 Loop while one or more children of "fan-in" tree not ready Then notify set notready to false in parent in fan-in tree If not root, wait for sense in fan-out tree to flip Flip sense in child fan-out children Why use different fan-in, fan-out? Can probably check four bools with a single load instruction When waking up, fan-out two gives maximum concurrency Total wakeup time is (time at one node) * (height of tree)