The Scalable Commutativity Rule

--------------------------------------------------------------------------------

  * First and foremost, MIDTERM on Monday.
    - covering all material that we've covered
    - actually reading the papers is EXTREMELY important
      - we obviously don't cover everything in lecture (see last lecture)
    - look at previous years for examples
    - can bring/use any printed material
      - papers, lecture notes, lab code, wikipedia, etc.
      - won't have enough time to read paper on the spot
    - excellent performance on final can make up for poor midterm
      - not totally, of course
  * Lab grades will be posted tonight.

Last lecture ending a bit abruptly.
  KEY POINTS:
    1) minimize sharing => achieve scalability
      - focus is on false sharing, where allocate can induce but shouldn't
    2) reuse unused resources => bound blow-up
      - strategy based on empty fraction

--------------------------------------------------------------------------------

We continue to discuss scalability.

  - What we want (scalability):

      |           /
   w  |         /
   o  |       /
   r  |     /
   k  |   /
      | /
      ----------------
        worker count

  - What we actually (usually) get:

   w  |
   o  |
   r  |     _____
   k  |   /       \
      | /           \
      ----------------
        worker count

  Q) Why? (assuming there's more than enough work to do)
    A) Workers are not working independently; time is spent communicating
       instead of performing useful work.

--------------------------------------------------------------------------------

The Scalable Commutativity Rule
  MIT CSAIL + Harvard (Eddie); SOSP 2013
  Best Paper Award SOSP 2013
  Best Thesis Award OSDI 2014

  Q) What does the paper have to say about scalability?
    A) Paper provides a _rule_ that we can use to know when there exists a
       scalable implementation.

  informally:

    interfaces operations commute => they can be implemented in a scalable way

  formally:

        H = X || Y      H ∈ S       Y SIM-commutes in H
    ------------------------------------------------------
        ∃ implementation of S where Y is conflict-free

    (don't actually need H ∈ S there since SIM-commuting requires that as well)

    * we have histories (H with concatenation of histories X and Y)
    * we have specifications (S)
    * we have "SIM-commutativity"
    * we have conflict freedom

  Q) What does it mean for operations to commute?
    A) It means that there's no way to distinguish the order in which they
       executed, given some interface.

    A) Regardless of the order that those operations are performed in, the output
       will always be "the same" (as specified by some specification).

       Example: creat("/a/b") creat("/a/c")
       Example: open(x) => rand_fd open(y) => rand_fd
       Not Example: close(x), open(x)

  Q) Why is this interesting? What might we do different because of this rule?
    A) We'd likely take a completely (!!!) different approach to designing a
       scalable system. Prior to this rule, designing for scalability was
       iterative: plot, fix, plot, fix. Now, design for scalability _before
       writing a single line of code!_ Very, VERY different.

  Q) Doesn't "what scales" depend on the hardware/architecture?
    A) Yes. Recall that we get scalability when workers are performing work
       independently. Dependence in modern system occurs through cache coherence
       (MESI). So as long as we don't have cache coherence occuring, we're okay.
       If that happens, workers can perform reads/writes independently:
       conflict-free.

--------------------------------------------------------------------------------

Quick Review of MESI and Cache Coherence
=========================================

Let's see why reading/writing to unshared cache lines leads to conflict freedom
and scalability in MESI-like architectures.

A simplified view of memory hierarchy looks as follows:

         -----------------------------------------
         |              main memory              |
         -----------------------------------------
                             |
                             |
         -----------------------------------------
         |                                       |
     P1 CACHE                                P2 CACHE
         |                                       |
    -----------                             -----------
    |         |                             |         | <- cache line = 64 bytes
    -----------                             -----------
    |         |                             |         |
    -----------                             -----------
    |         |                             |         |
    -----------                             -----------
    |         |                             |         |
    -----------                             -----------
    |         |                             |         |
    -----------                             -----------
    |         |                             |         |
    -----------                             -----------

  * All processors can communicate with eachother as well as main memory.
  * Each processor has its local caches.
  * Cache is seperated into cache lines.
  * Data is read from main memory in cache-line >= chunks into local cache.
  * Data is read and written directly from cache.

--------------------------------------------------------------------------------

MESI Cache Coherence
================================================

  * Goal of cache coherence is to keep caches in-sync (coherent).
    - Want a coherent view of memory in an SMP system.
  * MESI is a message-based (really, snooping) protocol for cache coherence.
  * Each cache line is annotated with one of 4 states: M, E, S, or I.
    (I) Invalid (can be reused)
        The cache line is present only in the current cache, and is dirty.
    (S) Shared (read only)
        Indicates that this cache line may be stored in other caches of the
        machine and is clean.
    (E) Exclusive (can read/write)
        The cache line is present only in the current cache, and is clean.
    (M) Modified (can read/write)
        The cache line is present only in the current cache, and is dirty.
  * Caches start in the I state.

         -----------------------------------------
         |              main memory              |
         -----------------------------------------
                             |
                             |
         -----------------------------------------
         |                                       |
     P1 CACHE                                P2 CACHE
         |                                       |
    -----------                             -----------
    | I       |                             | I       | <- cache line = 64 bytes
    -----------                             -----------
    | I       |                             | I       |
    -----------                             -----------
    | I       |                             | I       |
    -----------                             -----------
    | I       |                             | I       |
    -----------                             -----------
    | I       |                             | I       |
    -----------                             -----------
    | I       |                             | I       |
    -----------                             -----------

--------------------------------------------------------------------------------

  Let's walk through one scenario:

  * P1 reads 0x00 (CL0) -> change CL to E in P1.
  * P2 reads 0x10 (CL0) -> change CL to S in P1.
    * P1 notices, changes CL to S, probably forward line
  * P1 wants to write to 0x04 (CL0):
    - P1 communicates that it wants to upgrade CL to E
    - P1 waits for everyone to report they've invalidated
    - P2 marks CL as I
    - P2 reports invalidation
    - P2 writes to CL, marking it as M

  Q) What would happen if P2 wants to write to 0x14?
    A: We need to go through another round of coherence.

 KEY POINT:

   A processor cannot perform work independently if it sharing a cache line with
   another processor. Scalability is not possible without performaing work
   independently!

--------------------------------------------------------------------------------

  Let's walk through a different scenario:

  * P1 reads 0x00 (CL0) -> change CL0 to E in P1.
  * P2 reads 0x80 (CL2) -> change CL2 to E in P2.
  * P1 wants to write to 0x04 (CL0):
    Q) What happens?
      A) P1 just writes to CL0 -> changes CL0 to M.
  * P2 wants to write to 0x80 (CL1):
    Q) What happens?
      A) P2 just writes to CL1 -> changes CL1 to M.

 KEY POINTS:

   A processor CAN perform work independently if it reading and writing for
   exclusive (non-shared) cache lines. Scalability IS possible.

   Adding more cores to a system running software that has conflict-free memory
   accesses should result in linear scaling since work can be performed
   independently on those new cores.

--------------------------------------------------------------------------------

  SO FAR, we've built up:

    conflict-free implementation => implementation scales

  But the rule is actually:

    interfaces operations commute => they can be implemented in a scalable way

  KEY INSIGHT:

    interfaces operations commute => conflict-free implementation exists

  We'll spend most of the rest of lecture building this up.

--------------------------------------------------------------------------------

To build this up, we need a more rigirous understand of commutativity and
conflict freedom. Let's start with the former.

ACTIONS

  An action is either an invocation or a response.
    - in an OS, an action is: a system call with arguments
    - in an OS, a response is: the return value of the syscall

  An action is comprised of exactly the following:
    1) an operation class (e.g., which syscall)
    2) arguments for actions, return value for responses
    3) the relevant thread
    4) a tag for uniqueness

  Actions come in pairs: invoked/responded by same thread.

    We'll represent the invocation as a lower case letter and the response to
    that invocation as the same lower case letter with a '.

      a a' is the pair for action a

    We'll offset actions vertically when they occur in different threads.

      t1: a     a'
      t2:   b b'

    This means that a was invoked in t1 but before it could response, b was
    invoked and responsed t2. Finally a responses in t1.

HISTORIES

  A history is defined as a sequence of actions 'H'. Common letters used to
  represent a history are 'H', 'X', 'Y', 'Z'.

    H = X || Y (|| means concatenation: history X then history Y)

  So for actions a and b, we might have:

    H = t1: a   a'
        t2:   b    b'

  Only consider well-formed histories: one outstandard invocation at a time.

PER-THREAD SUBHISTORIES

  Histories can be filtered for a single thread 't':

    H | t

  This gives you only the histories for that thread.

    H = t1: a    a'
        t2:   b    b'

    H | t1 == a b

REORDERINGS

  A history can be "reordered". A history H' is a "reordering" of history H when

    FOR ALL t, H | t = H' | t

  In other words, same actions in both histories in same order on each thread.
      but threads can interleave differently

SPECIFICATION

  A "specification" declares which histories are "correct".
    - a prefix-closed set of well-formed histories
      a subset S ⊆ P is prefix closed if for all s∈P t∈S, s≤t => s∈S

  A1 = getpid(), A1' = 0 is not in the Unix spec since no process has id 0

SI COMMUTATIVITY

  Given H = X || Y, Y SI-commutes in H when:

    given any reordering Y' of Y, and any sequence Z

      (X || Y || Z) in spec IFF (X || Y' || Z) in spec

    - presence of any Z requires reordering are indistinguishable by future
    - in particular, these means the state created by Y (or any Y') has to be
      the same (as far as the specification goes) since Z could look at any
      component of it (commutativity)
    - ensuring all reorderings of Y when concatenated with X meet spec ensures
      return value is valid regardless of order of action (interface-based)
    - doesn't depend on implementation, just specification! (interface-based)
    - X sets up the state (state-dependent commutativity)

  - Note that state-dependence means Y can commute with some Xs but not others
    - good because we don't get an all or nothing
    - can expand situations that commute
    - we say that these "conditionally commute"
    - example: open("a", O_CREAT | O_EXCL)

SIM-COMMUTATIVITY

  SI-commutativity doesn't suffice!

    - Y is a concatenation of things, Y = Y1 || Y1
    - possible for Y1 not to SI-commute in H by itself, even though Y does
    - example:
      Y1 = [(A1 = set(1), A1'), (B2 = set(2), B2')]
                                              Y2 = [(C1 = set(2))]
      - all reordering of Y leave value at 2, so Y SI-commutes in any H
      - but Y1 does not SI-commute: could be 1 or 2 depending on order
      - we say Y is does not monotonically commute, increasingly commute
    - need to ensure that every prefix P of Y SI-commutes in H = X || P
      - if they do, then Y SIM-commutes in H = X || Y
    - so, the full SIM-commutativity statement is:
      Y SIM-commutes in a history H = X || Y when
        for any prefix P of any* reordering of Y, P SI-commutes in X || P

IMPLEMENTATION

  An implementation is a step function in S x I -> S x R

    m(state, invocation) -> (state, response)
    S = set of implementation states
    I = set of valid invocations
    R = set of valid responses

  Each state is some tuple s IN S = (s.0, s.1, ..., s.m)
    - s(i<-x) is component replacement. IE, s.i = x, like a write

  Consider a step m(s, a) producing (s', r)
    - a step `writes` state component i when s.i != s'.i
    - a step 'reads' a state component when for some y:
      m(s(i<-y), a) != (s'(i<-u), r)
      - that is, s.i being different changes the response, so it observes it

  Access conflict defined as:
    - two steps on different threads: one writes or reads, the other writes
  So a set of steps is conflict-free when...
    - no pair of steps in the set has an access conflict

  MESI-based cache-coherent machines are usually scalable for conflict-free

PROOF

  Proof is by construction. Show that because Y SIM-commutes in H = X || Y, we
  can construct an implementation that uses per-thread histories in the region
  of 'Y'. It then handles requests for responses from different cores entirely
  independently. This is only possible because the region is commutative; so
  even if it gives responses in a different order than the "initial history",
  it'll be correct since the order is irrelevant.

--------------------------------------------------------------------------------

  1) ANALYZER takes symbolic model of interface
    - computes under which conditions the interface's operations commute

  2) TESTGEN converts ANALYZER's commutativity conditions into concrete tests
    - assigns concrete values to the symbolic state + arguments

  3) MTRACE checks whether a particular implementation is conflict-free
    - runs the test cases generated by TESTGEN
    - runs the entire OS in QEMU
    - at start of each test, MTRACE hypercalls to enable memory access recording
    - MTRACE logs all reads and writes by each core along with
      - info about currently executing thread (to filter irrelevant background)
      - also a stack trace
    - after execution, MTRACE analyzes log and reports conflicts
      - using DWARF debug to resolve data type and name, etc.