Scalable Synchronization
========================

Shared memory machines - bunch of CPUs, sharing physical memory

For memory consistency, want cache coherence.  How to achieve?
  Bus-based schemes
    Any CPU can access "dance with" any memory equally -> "dance hall arch"
    Use "Snoopy" protocols - Each CPU's cache listens to the memory bus
    With write-through architecture, invalidate copy when see a write
    Or can have "ownership" scheme with write-back cache
      E.g., Pentium cache have MESI bits (modified, exclusive, shared, invalid)
      If E bit set, CPU caches exclusively and can do write back
    But bus places limits on scalability
  More scalability w. NUMA schemes (non-uniform memory access)
    Each CPU has fast access to some "close" memory
    Slower to access memory that is farther away
    Use a directory to keep track of who is caching what
  COMA - cache-only memory architecture
    Each CPU has local RAM, treated as cache
    Cache lines migrate around to different nodes based on access pattern
    Data only lives in cache, no permanent memory location
    (These machines aren't too popular any more.)

Synchronization techniques for parallel programs
  We have been mostly been considering uniprocessors
    Today, we are dealing with machines that have multiple CPUs
    For parallel scientific apps, need direct synchronization (without OS)
    But within OS kernel, also need to worry about the same issues
  Spin Locks
  Barriers

Atomic instructions
  fetch_and_store (xchgl on x86)
  Also, can have Fetch_and_add, fetch_and_Phi for arbitrary Phi

test_and_set spin lock
  + Uses single memory location for arbitrarily many threads
  - Lots of write traffic over memory buss
  - Not necessarily fair

ticket lock
  + Just two memory locations for arbitrarily (exponentially) many threads
  + Fair (FIFO)
  + Probes with read operations only (unlike t&s which issues writes)
  - Still requires a lot of bandwidth
      everyone reading now_serving all the time, updated w. each switch
  + Can maybe estimate how long wait is to reduce reads, but hard
      high penalty if you guess wrong, because of perfect FIFO order
  - Requires fetch_and_add
      if implemented with a test_and_set spin lock, adds more memory traffic

array-based locks
  Look at Graunke and Thakkar (because anderson needs fetch and add)
    Each proc has a bool (call it mybool[vpid]) on its own cache line
    Global:  bool *who; bool locked_value;
    To lock:
      bool *prev = &mybool[vpid];
      bool val = mybool[vpid];
      atomic_swap: (prev, val) <-> (who, locked_value)
      while (*prev == val);
    To unlock:
      mybool[vpid] = !mybool[vpid];
  + Fair (FIFO)
  - Requires space per lock proportional to number of threads      
      can't easily recycle mybool, must stay unchanged till next guy releases
  + Less memory traffic than ticket lock if you have a coherent cache
  - Memory traffic than ticket if you don't have cache
      because can't back off as accurately; no estimate of wait
  + In NUMA, could arrange to have mybool[vpid] local to proc

MCS list-based queuing lock
  Go over algorithm 5:

    Each thread has a qnode
      typedef struct qnode {
        struct qnode *next;
        bool locked;
      } *lock;

      acquire (lock *L, qnode *I) {
        I->next = NULL;
        qnode *predecessor = I;
        ATOMIC_SWAP (predecessor, *L);
        if (predecessor != NULL) {
	  I->locked = true;
	  predecessor->next = I;
	  while (I->locked);
        }
      }

      release (lock *L, qnode *I) {
        if (I->next)
	  I->next->locked = false;
	else {
	  qnode *old_tail = NULL;
	  ATOMIC_SWAP (*L, old_tail);
	  if (old_tail == I)
	    return;

	  qnode *userper = old_tail;
          ATOMIC_SWAP (*L, userper);
	  while (I->next == NULL);
          if (userper != NULL)
	    userper->next = I->next;
	  else
	    I->next->locked = false;
	}
      }

  + Fair (almost FIFO, is FIFO if you have compare_and_swap)
  + Only spin on your own qnode I, which can be in local memory
  + Small amount of network traffic
  + Requires constant space per lock (plus each thread has a qnode)

Sense-reversing centralized barrier
  Go over Algorithm 7
  + Constant space
  + Good on cache-coherent machines with broadcast invalidation
      everyone just caches sense until a final broadcast invalidation
  - Not so good if you don't have coherent caches
      "sync" loads would flood memory system with traffic
  - Can maybe delay with exponential backoff, but will add latency

New tree-based barrier
  Go over algorithm 11
    Construct two trees with all nodes
      One tree with 4 children per node, for "fan-in" of 4
      One tree with 2 children per node, for "fan-out" of 2
    Loop while one or more children of "fan-in" tree not ready
    Then notify set notready to false in parent in fan-in tree
    If not root, wait for sense in fan-out tree to flip
    Flip sense in child fan-out children
  Why use different fan-in, fan-out?
    Can probably check four bools with a single load instruction
    When waking up, fan-out two gives maximum concurrency
      Total wakeup time is (time at one node) * (height of tree)