Hypervisor-based Fault-tolerance Bressoud and Schneider SOSP 1995 Why are we reading this paper? We're about to look at a bunch of systems that replicate for fault-tolerance The hypervisor is at one extreme of the spectrum Two machines run the exact same instructions If one fails, the other just keeps going, no time/data/work is lost Totally transparent, runs existing O/S and apps Only a factor of two slower Seems like an amazingly general/easy/practical solution to fault-tolerance Where they are coming from Banks, telephone exchanges, NASA need fault-tolerant computers 1980s saw lots of *hardware* fault-tolerant machines Assumption: CPU most likely part to fail due to complexity So two CPUs connected to bus, single memory, disk, &c system All in the same cabinet. Not distributed over the Internet. ------------------------------------------------------------------------ goal: general purpose replication of computations that can fail indepdently. recipe: - have pair. - start in same state. - make code deterministic. - feed same inputs in the same order. - will reach same state. - can always switch from dead primary to backup. - restrict output so you can't observe. -------------------------------------------------------------------- Weird: they slip underneath OS and fake out the entire machine to add replication functionality. this is probably not your first reaction to fix the problem. what is the story? 1. general: implement in hardware. - expensive - custom hardware lags general purpose performance curve. [this was a common point]. 2. hacking OS difficult since complicated. must do for every OS. 3. doing in application: - difficult, programmers not so sharp. - each app must do. so slip VMM underneath. [claimed] benefits? 1. available to all HW in same family 2. single implementation works for all OSes 3. all apps get it for free, instantly. cost? - 2x hardware - 2x slowdown. - have to build hypervisor. they say won't speed up. when dup does speed up? - disk, can service reads - atms, can service people - network: could put traffic on both [sort of internet] - web servers: service multiple requests concurrently failure just leads to slowdown. why is emulating the entire machine easier than hacking the os in a couple of places? you've never done this: if you want to add functionalty, just go and hack your app. why can they do this interposition? instruction set has a very precise interface, so can compose at that level. ------------------------------------------------------------------------- do they need such strong determinism? interface filters. web server? read only. don't even need inputs in same way. but implementation not deterministic! memory not in same state! doesn't this screw up? no: *interface* is deterministic. doesn't matter that it gets implementation state this paper focuses on making *implementation* deterministic which is much much harder. * [pull this point out more] ------------------------------------------------------------------------- Straw man: two identical machines in different locations... start with same memory contents (program and data) just start them executing! if one fails, the other keeps going are we done? Q: Will they perform the same computation? ADD instruction &c probably deterministic instructions to read time-of-day register, cycle counter, priv level not deterministic memory system? local devices like hard disk? low level: interrupt timing high level: whether disk reads/writes are deterministic external input devices like network, keyboard? both CPUs need to get each input output devices like network, display, printer? exactly one CPU should produce each output [*** look at picture *** ] Let's figure out how to deal with local devices (e.g. hard disk) why not replicate all I/O hardware on each machine? e.g. let both machines read/write their own disk writes should be identical, thus reads also maybe some h/w replicas don't behave the same. for example, o/s knows how to re-map a failed disk sector, but the two disks won't fail at the same sectors so why did they replicate the main-memory system? doesn't fail? would be too slow to share? So they hook up all devices to both machines: shared I/O bus sending commands and data to the device only the "primary" writes the device hypervisor ignores backup's I/O writes reading results back from the device only primary reads primary hypervisor copies to the backup backup hypervisor stalls until correct value arrives from primary handling interrupts only primary CPU sees real h/w interrupts hypervisor sends them to the backup ------------------------------------------------------------------------- use primary/backup - assume failstop. - so failure is PR that both are down. - go through figure: play find single point of failure. ------------------------------------------------------------------------- using these assumptions for hyypervisor what do we get? failure prob of system = P(A) then prob that both fail = P(A) * P(A) = P(A)^2. so exponential! of course this is a upper bound. since some of our assumptions are somewhat dubious. one: time to repair is << time to failure. this is actually defensible. if they are near the same order then you are kind of in trouble. lets go with the non-independent first. ------------------------------------------------ play: where are the single points of failure? 1 thing that kills everything. connector, cable. the disk. screen? power since they don't talk about it. software i think: runs same copy, so 2 = 1. power. you better duplicate where the failures are so you can divide MTTF otherwise improvements are neglible. error rate = E1 + E2 + E3 ... if you make E1 go to 0 but it didn't contribute to E, then pointless. ------------------------------------------------ apply to software: correlated failure. how to make non-correlated? N version programming? expensive: completion = max of all projects. : how independent? mean IQ problem. most people get easy. some will get hard only one will get hardest. but will then get voted off the show. [repair can be nontrivial.] - bohr bug: nice deterministic, always hit. - heisenbug: you run, shows up, run again, goes away. what's cool about heisen? rerun and goes away. what about in this context? - even non-determinism is deterministic. ugh. ------------------------------------------------------------------------- protocols enusre: 1. primary up? backup does not interact w/ env. - do we need this for det-SM? no: for observer. - can they guarantee this? no. backup can decide to become primary if ethernet down. then starts interacting. their assumption: does not happen. ;-) 2. after primary fails exactly one backup becomes primary. in such a way env does not know. - true? well, if duplicate interrupts don't matter. - actually: how to ensure that request not done twice? [for disk, for net] basic idea: execute exactly the same sequnce of instructions. is this true? no: hypervisor does different. why have I/O dev accessibility assumption? - seems obvious: backup must be able to do what primary does. - but also defines away single point of failure ;-) virtual machine assumption: correct under hypervisor. - can observe different things. - assume doesn't matter. ------------------------------------------------------------------------- instruction fun. internal state: virtual machine state: VM can control. main mem program counter registers not: time of day interval timer. dev I/O what does ordinary mean? run at diff time, different machine, diff priv, on vm or not = same result. same operands = same result. - exceptions at same point. precise. alu, load/store to normal mem. what are environmental instructions? - environment instuctions: can read things VM does not control. time of day I/O information. priv level. how handle? pick first [is a legal value] ship to backup. mechanics: - intercept - record value at primary. - ship to backup to replay. what assumption about env instructions? trap i believe. or mmaped and you can intercept. at backup: interpret instruction, write the value in the destination register, jump to the instruction after it, or target of branch basically: most general: if you could observe, must get a trap. when not true? how about for time? is the machine virtualizable? we can always interpret SLOWLY real question: can I run o/s and apps on the real hardware? will the hardware trap to hypervisor on *every* tricky instruction? i.e. any instruction that might be non-deterministic OR reveal the existence of the hypervisor? time-of-day register memory-mapped I/O loads and stores need to trap and substitute load values HP branch-and-link instruction HP TLB replacement HP "space registers" (page mapping tables?) writable by users (!) ------------------------------------------------------------------------- interrupts? problem: can't control when you get. how do you control? - defer [batch up] - at same point in instruction sequence, deliver all in same order. recover register hack: yield to same point by setting counter on both, when it goes to zero get. throw exception when 0. assumptions: - when counter goes to zero, interrupts are not off. [could not deliver] - somehow must filter out the hypervisor instructions from those counted. - guest OS does not use recover register why assume channel is fifo? want to get events in same order. if not? just put a sequence #. ------------------------------------------------------------------------- High level goal of the P0-6: basically trying to give total order on inputs, and make sure inputs the same [and that they are read at same point]. - interrupts in same order - delivered at same time - env instructions same - time same - get total order on input b/c they use FIFO ;-) P0: just says to record env, and send epoch + val to backup. - do we need epoch? i don't think so. backup is just going to listen on channel til it gets an env instruction. P6 just says backup delivers this P1: similar: send INT. P2: end of epoch: - send time so can sync TOD clock. - waits for ack of everyting[* need?] - adds to buffer any interrupts based on Tme(p) --- don't get this. NOTE: does not seem to send these to backup! i think they mean this is a timer interrupt? e.g., you say: interrupt every 30ms? - pushes the rest of the interrupts: note: more - do we need this? [i don't think so. just to reduce latency. we won't get all anyway.] - sends [end,E] P3: ignore all interrupts from env: terminal, disk, network [ugh: don't think so: otherwise how to get? must see what the dest is] P4: paired with P1 + P2 P5: epoch end at backup sets time. drains FIFO [can we pipeline? i *think* so] delivers. delivers all basedon time[?] *** this is confusing. - suppose P fails b4 delivering [end,e+1] what does backup do? detects. drains, delivers, continues. guarantee: backup "extends the sequence of instructions executed by the primary" intuition? did all events up to the last one in the same order. then went forward. what is the problem? what about fail-over? if primary fails, backup must start doing I/O how does backup know the primary has failed? what about I/O writes in the last epoch: might or might not have happened. what if backup is waiting for an I/O read? did the primary even issue the read request? did the result come back, but not yet sent to backup? in the end, the O/S has to re-try the I/O! -------------------------------------------------------------------- interrupt w/ crash: what is the problem? P does int, sends ack. - crash before send: don't know if did. P sends ack, does int. - get ack, don't know if did before crash. hack? have uncertain interrupts. rule: got one? repeat i/o instruction. - we record all the I/O initiation done by backup. - give uncertain for all *outstanding* ones. must be some way to tie request to response. what devices support this? - retry must be ok: disk, network - not for keyboard, line printer, ATM where have we seen this problem? RPC: exactly once. can't do. we know primary did : at most once if didn't do we're stuck. so do at least once (actually: at most twice) let's say RPC was: "lower reactor rod". what is good about this? - if we can check! - note: works if say: put reactor at 3 meters not + 2 latter assumes context. former idempotent. and i lied: we don't actually know if did, since it could have already been there at 3 meters. but doesn't matter. -------------------------------------------------------------- what is this two generals problem? example: A and B try to agree on "attack at noon" trouble: i never know if you got the last message. means we cannot know for sure that we're going to attack. proof sketch: must terminate in some finite number of messages. there is a last message. sent by one of them. but from his point of view everything looks the same whether it got through or not (since no ack). so this couldn't be the decider since otherwise he'll be out of luck. -------------------------------------------------------------- ------------------------------------------------------------------------- issues w/ PA-risc. have to kick OS out of most-privileged mode (so we can intercept when it does privileged operation since these typically let it observe real machine rather than our faked out version). VMM runs in ring 0. guest runs in ring 1. can't run priv inst. user stays in ring 3. problems: - priv level visible to users! - probe, gate, branch and link. branch and link: lower bits. how to fix? they hacked kernel ;-) assume no one else watches: vulnerable - can see space registers: sounds like they do segmentation. can read these. can write them! why difficult to support multiple machines? can't you context swith them? or are they in terms of raw physical addresses? i don't get this: say if not invoked when change (not read) then can infer. cswitch handles? hypervisor included in address space of VM kernel. r/w to memory mapped i/o? the usual: setup so you get traps. basic performance intuition: - traps expensive the more you do, the more slowdown. when do you do? i/o - epochs expensive the more you do the more slowdown. the less you do, the more interrupt latency [reads] TLB problem? replacement non-determ. so? different number of instructions run. solution? eliminate guest viible TLB faults. ------------------------------------------------------------------------- performance models CPU intensive: best case. what is disturbing about this graph? 4K epoch = 6.5X slowdown. 32K = 1.84 (about 2x) how many instructions is 32K? .02 usec * 32,000 = 640usec this doesn't seem so bad. EL = # instructions. I/O - more instructions emulated. - disk read propagation between primary and backup - sensitive to latency how much matter? 4K = 1.67X pretty good. good thing: 26ms to do raw disk write. 27ms to do w/ their system 24.2ms to do raw disk read. 33.4ms on their system need 9 messages to data + 1 for ack. does this matter? doesn't seem to that much: not much diff slight drift up.