Synchronization background
==========================

What happens if you run this function in two threads concurrently?
	void f () { x++; }
  Undefined -- e.g., could increment x by 1 or 2 depending on interleaving
    Note even one instruction (incl) not atomic on multiprocessor
    (x86 allows atomic accesses, but requires expensive lock prefix)

What might p2 return if run concurrently with p1?
	int data = 0, ready = 0;
	void p1 () { data = 2000; ready = 1; }
	int p2 () { while (!ready); return data; }
  Undefined -- e.g., could be 0 or could be 2000
    Depends on memory model of the machine
    If hardware has *sequential consistency*, always returns 2000
  What is Sequential consistency?
      Definition:  Sequential consistency means the result of
      execution is as if all operations were executed in some
      sequential order, and the operations of each processor occurred
      in the order specified by the program. [Lamport]
    Boils down to two requirements:
     1. Maintaining program order on individual processors
     2. Ensuring write atomicity

Need synchronization primitives to write correct concurrent code
* Monitor: exclusive code and associated data
    Shared variables in monitor can only be accessed by its methods
    System ensures only one method of a given monitor executes at a time
    Accesses to variables ordered across method invocations
* Mutex (lock):  bracket access to shared variable w. lock (m) & unlock (m)
    System ensures only one lock (m) returns before next unlock (m)
     memory ops after lock (m) appear to happen after ones before lock (m)
     memory ops before unlock (m) happen before ones after it
* Semaphore: Two functions P(S) [a.k.a wait] and V(S) [a.k.a. signal]
    Only N more waits will return than signals (for initialization parm N)
    If N == 1, will act like a mutes
* Interrupt masking:  bracket code w. x = spl{high,net,...}()  and  splx(x)
    For 1-CPU Unix kernel, where threads non-preemptive, but interrupts not
    Hierarchical -- e.g., splhigh implies splnet


Eraser
======

What is goel of this work?
  Find data races by checking program follows locking discipline
  What is a data race?
    Two accesses to same memory location in different threads
    At least one is a write
    No synchronization mechanism to prevent simultaneous execution
  Why are races a hard problem?
    Highly timing dependent
    Often has to do with interaction of two different modules
    Manifestation of bug may be far away from or long after buggy code
      E.g., Linked list corrupted, discover next time you traverse

Happens before model [Lamport]:
  Processes are an ordered sequence of events, including send & receive
  A happens before B (A -> B) if:
    (1) A & B are events in same process P, and A precedes B, or
    (2) A is send of message m by P1, and B is receive of same m by P2
  A and B concurrent if neither happened before the other
    Idea: neither event could have influenced the other

What does happens-before have to do with race conditions?
  Idea:  Say that action A happens before B if:
    (1) A & B are events in same thread T, and A precedes B, or
    (2) A & B are accesses to same synchronization object, and A preceded B
  If A and B access same variable and neither happens before other, it's a race
    So run program over test cases, and see if you get any such races
  Example:  Look at figure 1
  What are deficiencies of happens before approach?
    Hard to implement efficiently (how would you detect races?)
    Depends on particular interleaving of events -- See Figure 2

So what *locking discipline* does Eraser enforce?
  Every shared variable must be protected by some lock
  How can we detect this?  Use ATOM to implement lockset algorithm

What is ATOM?  Allows you to re-write programs
  Create two .c files:
    eraser.inst.c - allows you to navigate structure of program including
                 adding calls to your own functions
    eraser.anal.c - contains functions you can call when modified program run

What is lockset algorithm?
  For each variable, keep track of set of candidate locks
  If set becomes empty, no lock protecting data, so flag error
  Example:  Figure 3
  Why is this better than happens before?
    Because Eraser detects violations of locking discipline, not races,
    it can detect possible races even if they never occur during testing!

What does eraser instrument with ATOM?
  Every load & store (except stack-relative)
  Every mutex operation (so you know what locks a thread has)
  Thread initialization and finalization code
  Malloc & free operations (so you know when memory reused)

But locking discipline only necessary for data accessed by more than one thread
  What memory locations should not be subject to lockset algorithm?
    Initialization - data only accessed by one thread during initialization
      Example? Initialize thread control block, but on run queue
    Read-shared data - no one writes, so no lock
      Examples? Version number of program, global configuration parameter,
        Constant string entered into hash table, ...
    Read-write locks
  How to avoid generating false positives for these?
    Keep track of state of each variable (Figure 4, p. 4)
    Modify lockset algorithm slightly for read-write locks
      On read of v by t, C(v) gets intersected with locks_held(t)
      On write of v by t, C(v) gets intersected with write_locks_held(t)

p. 4 bottom:  "Our support for initialization makes Eraser's checking
more dependent on the scheduler than we would like."  Why?
  If variable made available before initialization complete, might not detect

What does program report?  How to find the bugs?  (Sec 3)
  Reports line of code (backtrace, regs) where lockset becomes emptyset
  Is that enough?
    Say in 100 places var accessed w. correct lock, in 1 its not
    Likely lockset becomes empty on offending unless it was first access
  But what if unlucky?
    Ask eraser to log all accesses that change a variable's lockset
    One of them will have to be incorrect line of code

Implementation details:
  How much memory?  Slightly more than doubles heap
    For each 32-bit word, keep extra 32-bits of state:
      2 bits for state - Initialization, Exclusive, Shared, Shared-Modified
      30 bits for thread ID (Initialization) or lockset index number
  What is lockset index number and why?
    Number of locksets much smaller than maximum possible
    Only keep one, sorted copy of lockset in a table
    Sorting helps for comparison/intersection; intersection also cached
    Hash list of locks to look up possible index number in hash table
    Cache intersection of different locks
  So small amount of memory needed for lockset and hash tables
  Why does this work?  Greatest number of locksets seen 10,000
    Could have been exponential in number of locks--why not?
    Lock usage very stylized--same patterns, sets of locks, etc.
      E.g., each instance of object foo might have an internal lock
            but foo objects don't call each other's methods
            so will only lock one foo at a time (# lock sets = # locks)

What other possible causes of false positive are there?
  Memory reuse - private allocators
    Example:  Vesta Log flush routine makes all entries private & empties list
  Private locks - Why would you do this?
    Pthreads does not include shared/exclusive locks
  Benign races - Examples?
    Set kill_queries = 1, then when threads notice they exit
  Multiple locks - any required for read, all required for write
    Design pattern for callbacks, helps avoid deadlocks
    How to fix Eraser?
      Only reduce lockset on writes, just check lockset on reads
      But dangerous because might produce false negatives
  Producer/consumer (post/wait) synchronization - Example?
    Pass buffer to disk driver, interrupt handler notifies you when read done
    Might use P/V-style semaphore mechanism to accomplish this
    Why is this hard?  Don't know what semaphores are owned by current thread

Partial solution?  Allow annotations
  - EraserIgnoreOn (), EraserIgnoreOff ()
  - EraserReuse (address, size)
  - EraserReadLock (lock), EraserReadUnlock (lock)
    EraserWriteLock (lock) EraserWriteUnlock (lock)

How does Eraser work on kernel with splhigh () ... splx (), etc?
  Pretend that each spl level has a lock
  At particular spl or in interrupt, hold locks on current and all lower levels

p. 7 (Sec 4.1):  "... it might seem safe to access the p->ip_fp field
in the rest of the procedure... But in fact this would be a mistake."
  Why?  p->ip_fp might point to unitialized data
    alpha architecture does not offer sequential consistency
    There is no ordering guarantee on stores w/o the mb instruction in UNLOCK
p. 8 (Sec 4.2):  Why is Combine::XorFPTag::FPVal incorrect?
  Again, no sequential consistency; need mb before setting this->validFP = true

What to do about deadlock?
  Add partial order checking to enforced locking discipline
  Seemed to find one deadlock easily this way

How is evaluation?
  Performance?  Factor of 10-30 slowdown (Sec 3.2)
  Utility?
    No serious bugs in AltaVista
      but might have found bugs that earlier took several months to fix
    One bug in Vesta
    No serious bugs in Petal
  How about claim that less sensitive to schedule than happens before?
    Found same bugs in AltaVista & Vesta using 2 or 10 threads--good sign
  Undergraduate programs?  10% of seemingly working programs had bugs