------------------------------------------------------------------------- how to manufactor reliability? general idea either have perfect components or redundancy. perfect components are too hard. and often not good enough: can just smash the assembly. lose it. throw away by mistake (rm -rf). which leaves redundancy simplest form is redundancy: have several copies/replicas, make sure all non-faulty do the same thing. * any computation can be expressed as a state machine. so can duplicate any computation. replicated state machine: - if transition is deterministic F(state, input) -> (new state, output) then several copies of F that start in the same state S0 and see the same input sequence will do the same thing. thus can replicate arbitrary computation. trivial to state. very powerful in reality. [note: need to make sure deterministic and inputs in same order] -------------------------------------------------------------------- fundamentally need duplication: in time, space, money. - trade latency for reliabilty. [TCP] - trade bandwidth for reliability. [checksum, ECC] - CPU cycles for reliability [vmware: run one copy of windows with each app rather than many apps on one windows --- more reliable] - trade space for reliability [RAID, backups] - money? [insurance: van gogh painting] - vote! CPUs. just use imperfect servers/cpus/whatever and vote on output. [assume: failure not correlated] can turn anything into reliable *if* you can make failures non-dependent. - *NOTE* this requires a copy is as good as original (exchangable). does not work as well for a van gogh painting. duplication vs perfection 1. make a perfect disk that never blows up, never loses a sector, never miswrites. good luck. plus, this is actually weak: just prevents data stored on disk from getting lost. 2. keep a backup of all data. restore as necessary. this protects against bad sectors. Also protects against all other ways to lose data! E.g., fire, theft of disk, legitimate (but accidental) deletion of a file, mis-edits of data, OS corruption, ... (given enough backwards state) * key: prevent a bunch of things *you did not anticipate*. perfection is expensive, often impossible and fragile. imperfection is cheap and most probably the only way. ---------------------------------------------------------------------- lets do some simple fault tolerance [ripped off from gray] they assume fail stop: if X breaks, it stops talking to you. what if device isn't fail stop. are we out of luck? no: duplicate + vote. if A != B stop: A or B broken! else continue. this is nice: clean semantics. easy to reason about. the alternative v hard to reason about. specification = output could be correct or could give any wrong answer whatsoever. what is bad? duplication cost. failure rate what happens to failure rate? increases by almost a factor of two. so almost 2x more likely to fail than one by itself. assume: prob that A fails = P(A) prob that B fails = P(B) say P(A) = .02 P(B) = .03 then prob that at least one is broken = P(A) + P(B) - P(A)*P(B) = .02 + .03 - .02 * .03 = .05 + .0006 note: in general, if failure rate << 1 (which you expect) that you can drop the last term since the others dominate. this is really useful for back-of-the-envelop when someone talking at you. the assumption here are the usual ones: 1. memoryless 2. independent 3. pr of failure << 1 given these than the mean-time-to-failure (MTTF) is 1 / P(A) so: A will fail in 50 days. B will fail in 33 days. mean for any to fail (also: the mean time for the first to fail): 20 mean time for both of them to fail: 1 / P(A)*P(B) = 1666 this last is really suggestive. ------------------------------------------------------------------------- why not independent? maintance: guy opens up and starts fussing. falls into it, cleaning lady kicks power cord. operations: type "rm -rf" by mistake. breakin: steals machine. hacks in and goes wild. environment: power failure (in US). building burns down. space aliens nuke planet. [social too: strike, company goes bankrupt, sabotage, laywers halt napster] hardware software ------------------------------------------------------------------------- N-plex idea: how to build failfast modules - N-plex (duplicate N times) - vote on output. now Pr that *some* up exponentially higher, Pr that *all* down exponentially lower. most common: - N = 1: "good luck" which inductively becomes N = 0 "too bad" - N = 2: pairing - N = 3: triple module redundancy : TMR trouble: - N = 10? then 5 dead = system dead. some tricks. first sense which are avail: then use output from these. - if dead, then don't count in N. - similarly, if gave bad output in past, drop from voting until repaired/reset. so if one goes flaky, drop. then if one dies, can still make progress. note: still play single point of failure game: comparator. connectors powersupply ... so far assume that if dead, stays dead. recall: MTTF: 1 / P(A) if N different identical modules = 1 / P(A) + P(A) ... = 1 / N * P(A) recall MT(A) = 1/P(A) so this becomes 1 / N * 1/MT(A) = MT(A) / N or, in other words, reliability goes down linearly w/ N. so if MTTF for one is 10 yrs: and you have pair: then MTTF = 5 days. TMR: 3.3 + 5 days. (first failure = 3.3 then 5 days) sucks: makes failure more likely. big caveat: if most failures are transient then it actually makes things much more reliable. repair: fast enoug = all failures transient. how to fix? repair. if time to repair << MTTF, then ignore. all three have to fail: P(A)^3 PR of .1 1,000 years! fix + resynchronize. soft errors: broke, then immediately "repaired" hard error: fix before goes further. ---------------------------------------------------------------------------