Hive ==== What are the top-level goals of HIVE and FLASH projects? provide a huge shared-memory multiprocessor only justified if good support for shared memory and flexible allocation of CPUs Hive is an OS for the FLASH multiprocessor machine Who needs to use a machine like FLASH? 1. Highly parallel scientific applications 2. Very high performance servers Want to support many users/Unix processes In generall, allows you to spread computation/memory over many nodes What goes wrong if you try to use a regular SMP OS on FLASH? Performance goes down, because memory accesses are more expensive Failures go up. Why? (Shouldn't big expensive hardware me more reliable?) DISCO was written by some of the same authors several years later... Should we just use DISCO w. SplashOS for #1 and IRIX for #2 above? What does the FLASH hardware look like? System is structured as a set of nodes in a grid Each node has a CPU, memory, cache, and cache controller (Fig 2.1) Obviously some pairs of nodes have much higher latency than others What memory consistency model? Provides sequential consistency, using a cc-NUMA architecture As in most cc-NUMA machines, every node can access any physical address Gives single image of physical memory across all cells What other features does hardware offer Firewall (more on this later) Special support for RPCs support Separate request/reply receive queues How does this compare to standard IPIs? What are the key failures Hive must deal with A node stops working (fail stop) Makes that node's memory inaccessible A nodes return bad values for memory reads Software Issues wild writes that corrupt other nodes' memory What would a truly fault-tolerant hardware design look like? Have to replicate all state Two memory banks for every region of memory At least two CPUs executing same instruction stream in sync Would need three CPUs for non-fail-stop behavior... why? Goal: Not fault tolerance but "fault containment". What's this? If a node fails, they are willing to lose the programs/data on that node They don't want the problem to spread and they'd like policies that make a 1% failure affect only 1% of apps How does fault-containment help big parallel scientific applications? Probably not so much, as single process spread over all machines Fortunately, authors of such software are already used to checkpointing (On conventional hardware, could still lose 3/4 of month-long computation.) How does fault-containment help time-sharing apps? If app's memory not affected by fault, no reason to kill it When might an app's memory be affected by the fault? If app is running on faulty node (obviously); then you are toast If app is sharing memory with app on a faulty node; what kind of sharing? Sounds like they care about file-system based sharing Standard way for apps to interact in Unix Where does the disk reside? Sounds like some nodes just have disks So node with disk could fail, too Or some other node writing to buffer If app's node has borrowed physical memory from another cell (happens under memory pressure) Other uses of shared memory Copy-on-write fork where child is in different cell Memory mapped files or otherwise explicitly shared memory What are they willing to give up to make Hive happen? SMP-style single kernel though they heavily re-use code from a traditional operating system and they hack cell kernels to present single system image Maybe a small performance penalty for dealing w. fault containment What mechanisms do they propose? careful reads. How does this work? (sec 4.1) point: detect kernel data mangling due to nodes failing not really protecting against arbitrary failures is the point really crash while updating some kernel data structure? Basically like call-by-value-result in Nooks Firewall hardware helps protect against wild writes Where does the firewall hardware sit? guards memory module against remote writes What's in the firewall hardware? 64 bits per phys mem page, one bit per node (or cell on huge machines) When does the system set the firewall to allow writes? When any CPU has mapped that page So really just protects against wild writes OK, firewall protects against some wild writes but what about pages a failed node was allowed to write? they might have been corrupted before the crash! firewall ensures you know what pages a node might have written What about DMA? Is it like Nooks, where bad driver can corrupt kern w. DMA? No! Especially since they are trying to deal with bad hardware, too Each device belongs to a node, and firewall treats DMA and writes same How does Hive deal w. wild writes to allowed pages after a crach? They detect damaged files all user-level pages writeable by failed node? Give I/O errors to processes that had those files open and try to use them Presumably including LD/ST as well as read()/write() looks like shared memory only occurs via shared mmap()ed files What semantic changes might be visible to non-failed applications? Node A opens, writes file F Node B opens, reads, closes file F Node A fails. Buffers containing F are lost. Node B opens, reads F, sees older data than last time it read What if a page is recycled and some nodes still thing it has old meaning? Could be bad, so recovery involves a "double barrier" approach Why is firewall better than VM protection? VM enforced by potentially faulty h/w and o/s firewall enforced by memory owner How do they detect failed cells? Every cell is supposed to update timer counter, Bus error, violate firewall, etc. What if bad node claims other nodes have failed? How to minimize impact of faults? (policies in 5.6): try to place a process's pages on few cells to minimize the number of nodes that could crash a process try to place a file's pages on few cells since entire files are marked bad if bad cell could write one page How does VM system work? (Fig 5.3) client cell memory home data home Optimization - what if you lend a logical page back to physical owner (s 5.5) What happens when a node thinks it detects a fault? they talk about a memory fault model what is the model? firewall & no partitions (p. 2) What kinds of faults might go undetected? Before a fault is detected, node might do wild write, then - unmap page before being detected, so don't know page was corrupted - non-fault node reads and acts on bad data, propagating failure How do you evaluate a system like this? can it contain faults?