Hive
====

What are the top-level goals of HIVE and FLASH projects?
  provide a huge shared-memory multiprocessor
    only justified if good support for shared memory
    and flexible allocation of CPUs

Hive is an OS for the FLASH multiprocessor machine
  Who needs to use a machine like FLASH?
    1. Highly parallel scientific applications
    2. Very high performance servers
         Want to support many users/Unix processes
  In generall, allows you to spread computation/memory over many nodes

What goes wrong if you try to use a regular SMP OS on FLASH?
  Performance goes down, because memory accesses are more expensive
  Failures go up.  Why?  (Shouldn't big expensive hardware me more reliable?)
  DISCO was written by some of the same authors several years later...
    Should we just use DISCO w. SplashOS for #1 and IRIX for #2 above?

What does the FLASH hardware look like?
  System is structured as a set of nodes in a grid
    Each node has a CPU, memory, cache, and cache controller (Fig 2.1)
    Obviously some pairs of nodes have much higher latency than others
  What memory consistency model?
    Provides sequential consistency, using a cc-NUMA architecture
    As in most cc-NUMA machines, every node can access any physical address
      Gives single image of physical memory across all cells
  What other features does hardware offer
    Firewall (more on this later)
    Special support for RPCs support
      Separate request/reply receive queues
      How does this compare to standard IPIs?

What are the key failures Hive must deal with
  A node stops working (fail stop)
    Makes that node's memory inaccessible
  A nodes return bad values for memory reads
  Software Issues wild writes that corrupt other nodes' memory

What would a truly fault-tolerant hardware design look like?
  Have to replicate all state
    Two memory banks for every region of memory
    At least two CPUs executing same instruction stream in sync
    Would need three CPUs for non-fail-stop behavior... why?

Goal: Not fault tolerance but "fault containment".  What's this?
  If a node fails, they are willing to lose the programs/data on that node
  They don't want the problem to spread
    and they'd like policies that make a 1% failure affect only 1% of apps

How does fault-containment help big parallel scientific applications?
  Probably not so much, as single process spread over all machines
  Fortunately, authors of such software are already used to checkpointing
    (On conventional hardware, could still lose 3/4 of month-long computation.)

How does fault-containment help time-sharing apps?
  If app's memory not affected by fault, no reason to kill it
  When might an app's memory be affected by the fault?
    If app is running on faulty node (obviously); then you are toast
    If app is sharing memory with app on a faulty node; what kind of sharing?
      Sounds like they care about file-system based sharing
        Standard way for apps to interact in Unix
	Where does the disk reside?  Sounds like some nodes just have disks
        So node with disk could fail, too
        Or some other node writing to buffer
    If app's node has borrowed physical memory from another cell
      (happens under memory pressure)
    Other uses of shared memory
      Copy-on-write fork where child is in different cell
      Memory mapped files or otherwise explicitly shared memory

What are they willing to give up to make Hive happen?
  SMP-style single kernel
    though they heavily re-use code from a traditional operating system
    and they hack cell kernels to present single system image
  Maybe a small performance penalty for dealing w. fault containment

What mechanisms do they propose?
  careful reads.  How does this work?  (sec 4.1)
    point: detect kernel data mangling due to nodes failing
      not really protecting against arbitrary failures
      is the point really crash while updating some kernel data structure?
    Basically like call-by-value-result in Nooks

Firewall hardware helps protect against wild writes
  Where does the firewall hardware sit?
    guards memory module against remote writes
  What's in the firewall hardware?
    64 bits per phys mem page, one bit per node (or cell on huge machines)
  When does the system set the firewall to allow writes?
    When any CPU has mapped that page
    So really just protects against wild writes
  OK, firewall protects against some wild writes
    but what about pages a failed node was allowed to write?
    they might have been corrupted before the crash!
    firewall ensures you know what pages a node might have written
  What about DMA?  Is it like Nooks, where bad driver can corrupt kern w. DMA?
    No!  Especially since they are trying to deal with bad hardware, too
    Each device belongs to a node, and firewall treats DMA and writes same

How does Hive deal w. wild writes to allowed pages after a crach?
  They detect damaged files
    all user-level pages writeable by failed node?
  Give I/O errors to processes that had those files open and try to use them
  Presumably including LD/ST as well as read()/write()
    looks like shared memory only occurs via shared mmap()ed files
  What semantic changes might be visible to non-failed applications?
    Node A opens, writes file F
    Node B opens, reads, closes file F
    Node A fails.  Buffers containing F are lost.
    Node B opens, reads F, sees older data than last time it read
  What if a page is recycled and some nodes still thing it has old meaning?
    Could be bad, so recovery involves a "double barrier" approach

Why is firewall better than VM protection?
  VM enforced by potentially faulty h/w and o/s
  firewall enforced by memory owner

How do they detect failed cells?
  Every cell is supposed to update timer counter,
    Bus error, violate firewall, etc.
  What if bad node claims other nodes have failed?

How to minimize impact of faults?  (policies in 5.6):
  try to place a process's pages on few cells
    to minimize the number of nodes that could crash a process
  try to place a file's pages on few cells
    since entire files are marked bad if bad cell could write one page
  How does VM system work?  (Fig 5.3)
    client cell
    memory home
    data home
  Optimization - what if you lend a logical page back to physical owner (s 5.5)

What happens when a node thinks it detects a fault?

they talk about a memory fault model
  what is the model?  firewall & no partitions (p. 2)

What kinds of faults might go undetected?
  Before a fault is detected, node might do wild write, then
    - unmap page before being detected, so don't know page was corrupted
    - non-fault node reads and acts on bad data, propagating failure

How do you evaluate a system like this?
  can it contain faults?