Synchronization background ========================== What happens if you run this function in two threads concurrently? void f () { x++; } Undefined -- e.g., could increment x by 1 or 2 depending on interleaving Note even one instruction (incl) not atomic on multiprocessor (x86 allows atomic accesses, but requires expensive lock prefix) What might p2 return if run concurrently with p1? int data = 0, ready = 0; void p1 () { data = 2000; ready = 1; } int p2 () { while (!ready); return data; } Undefined -- e.g., could be 0 or could be 2000 Depends on memory model of the machine If hardware has *sequential consistency*, always returns 2000 What is Sequential consistency? Definition: Sequential consistency means the result of execution is as if all operations were executed in some sequential order, and the operations of each processor occurred in the order specified by the program. [Lamport] Boils down to two requirements: 1. Maintaining program order on individual processors 2. Ensuring write atomicity Need synchronization primitives to write correct concurrent code * Monitor: exclusive code and associated data Shared variables in monitor can only be accessed by its methods System ensures only one method of a given monitor executes at a time Accesses to variables ordered across method invocations * Mutex (lock): bracket access to shared variable w. lock (m) & unlock (m) System ensures only one lock (m) returns before next unlock (m) memory ops after lock (m) appear to happen after ones before lock (m) memory ops before unlock (m) happen before ones after it * Semaphore: Two functions P(S) [a.k.a wait] and V(S) [a.k.a. signal] Only N more waits will return than signals (for initialization parm N) If N == 1, will act like a mutes * Interrupt masking: bracket code w. x = spl{high,net,...}() and splx(x) For 1-CPU Unix kernel, where threads non-preemptive, but interrupts not Hierarchical -- e.g., splhigh implies splnet Eraser ====== What is goel of this work? Find data races by checking program follows locking discipline What is a data race? Two accesses to same memory location in different threads At least one is a write No synchronization mechanism to prevent simultaneous execution Why are races a hard problem? Highly timing dependent Often has to do with interaction of two different modules Manifestation of bug may be far away from or long after buggy code E.g., Linked list corrupted, discover next time you traverse Happens before model [Lamport]: Processes are an ordered sequence of events, including send & receive A happens before B (A -> B) if: (1) A & B are events in same process P, and A precedes B, or (2) A is send of message m by P1, and B is receive of same m by P2 A and B concurrent if neither happened before the other Idea: neither event could have influenced the other What does happens-before have to do with race conditions? Idea: Say that action A happens before B if: (1) A & B are events in same thread T, and A precedes B, or (2) A & B are accesses to same synchronization object, and A preceded B If A and B access same variable and neither happens before other, it's a race So run program over test cases, and see if you get any such races Example: Look at figure 1 What are deficiencies of happens before approach? Hard to implement efficiently (how would you detect races?) Depends on particular interleaving of events -- See Figure 2 So what *locking discipline* does Eraser enforce? Every shared variable must be protected by some lock How can we detect this? Use ATOM to implement lockset algorithm What is ATOM? Allows you to re-write programs Create two .c files: eraser.inst.c - allows you to navigate structure of program including adding calls to your own functions eraser.anal.c - contains functions you can call when modified program run What is lockset algorithm? For each variable, keep track of set of candidate locks If set becomes empty, no lock protecting data, so flag error Example: Figure 3 Why is this better than happens before? Because Eraser detects violations of locking discipline, not races, it can detect possible races even if they never occur during testing! What does eraser instrument with ATOM? Every load & store (except stack-relative) Every mutex operation (so you know what locks a thread has) Thread initialization and finalization code Malloc & free operations (so you know when memory reused) But locking discipline only necessary for data accessed by more than one thread What memory locations should not be subject to lockset algorithm? Initialization - data only accessed by one thread during initialization Example? Initialize thread control block, but on run queue Read-shared data - no one writes, so no lock Examples? Version number of program, global configuration parameter, Constant string entered into hash table, ... Read-write locks How to avoid generating false positives for these? Keep track of state of each variable (Figure 4, p. 4) Modify lockset algorithm slightly for read-write locks On read of v by t, C(v) gets intersected with locks_held(t) On write of v by t, C(v) gets intersected with write_locks_held(t) p. 4 bottom: "Our support for initialization makes Eraser's checking more dependent on the scheduler than we would like." Why? If variable made available before initialization complete, might not detect What does program report? How to find the bugs? (Sec 3) Reports line of code (backtrace, regs) where lockset becomes emptyset Is that enough? Say in 100 places var accessed w. correct lock, in 1 its not Likely lockset becomes empty on offending unless it was first access But what if unlucky? Ask eraser to log all accesses that change a variable's lockset One of them will have to be incorrect line of code Implementation details: How much memory? Slightly more than doubles heap For each 32-bit word, keep extra 32-bits of state: 2 bits for state - Initialization, Exclusive, Shared, Shared-Modified 30 bits for thread ID (Initialization) or lockset index number What is lockset index number and why? Number of locksets much smaller than maximum possible Only keep one, sorted copy of lockset in a table Sorting helps for comparison/intersection; intersection also cached Hash list of locks to look up possible index number in hash table Cache intersection of different locks So small amount of memory needed for lockset and hash tables Why does this work? Greatest number of locksets seen 10,000 Could have been exponential in number of locks--why not? Lock usage very stylized--same patterns, sets of locks, etc. E.g., each instance of object foo might have an internal lock but foo objects don't call each other's methods so will only lock one foo at a time (# lock sets = # locks) What other possible causes of false positive are there? Memory reuse - private allocators Example: Vesta Log flush routine makes all entries private & empties list Private locks - Why would you do this? Pthreads does not include shared/exclusive locks Benign races - Examples? Set kill_queries = 1, then when threads notice they exit Multiple locks - any required for read, all required for write Design pattern for callbacks, helps avoid deadlocks How to fix Eraser? Only reduce lockset on writes, just check lockset on reads But dangerous because might produce false negatives Producer/consumer (post/wait) synchronization - Example? Pass buffer to disk driver, interrupt handler notifies you when read done Might use P/V-style semaphore mechanism to accomplish this Why is this hard? Don't know what semaphores are owned by current thread Partial solution? Allow annotations - EraserIgnoreOn (), EraserIgnoreOff () - EraserReuse (address, size) - EraserReadLock (lock), EraserReadUnlock (lock) EraserWriteLock (lock) EraserWriteUnlock (lock) How does Eraser work on kernel with splhigh () ... splx (), etc? Pretend that each spl level has a lock At particular spl or in interrupt, hold locks on current and all lower levels p. 7 (Sec 4.1): "... it might seem safe to access the p->ip_fp field in the rest of the procedure... But in fact this would be a mistake." Why? p->ip_fp might point to unitialized data alpha architecture does not offer sequential consistency There is no ordering guarantee on stores w/o the mb instruction in UNLOCK p. 8 (Sec 4.2): Why is Combine::XorFPTag::FPVal incorrect? Again, no sequential consistency; need mb before setting this->validFP = true What to do about deadlock? Add partial order checking to enforced locking discipline Seemed to find one deadlock easily this way How is evaluation? Performance? Factor of 10-30 slowdown (Sec 3.2) Utility? No serious bugs in AltaVista but might have found bugs that earlier took several months to fix One bug in Vesta No serious bugs in Petal How about claim that less sensitive to schedule than happens before? Found same bugs in AltaVista & Vesta using 2 or 10 threads--good sign Undergraduate programs? 10% of seemingly working programs had bugs