-------------------------------------------------------------------------
how to manufactor reliability?

   general idea
        either have perfect components or redundancy.

	perfect components are too hard.  and often not good enough:
	can just smash the assembly.  lose it.	throw away by mistake
	(rm -rf).

        which leaves redundancy

        simplest form is redundancy: have several copies/replicas, make
        sure all non-faulty do the same thing.

        * any computation can be expressed as a state machine.  so can
        duplicate any computation.

        replicated state machine:
                - if transition is deterministic
                        F(state, input) -> (new state, output)

                then several copies of F that start in the same state S0
                and see the same input sequence will do the same thing.
        
		thus can replicate arbitrary computation.

		trivial to state.  very powerful in reality.

        [note: need to make sure deterministic and inputs in same order]
--------------------------------------------------------------------
   fundamentally need duplication: in time, space, money.
        - trade latency for reliabilty.  [TCP]
        - trade bandwidth for reliability. [checksum, ECC]

        - CPU cycles for reliability [vmware: run one copy of windows
		with each app rather than many apps on one windows ---
		more reliable]
        - trade space for reliability [RAID, backups]
        - money?  [insurance: van gogh painting]

        - vote!  CPUs.  just use imperfect servers/cpus/whatever and
           vote on output.  [assume: failure not correlated]

           can turn anything into reliable *if* you can make failures
           non-dependent.

        - *NOTE* this requires a copy is as good as original (exchangable).
          does not work as well for a van gogh painting.


  duplication vs perfection
                1. make a perfect disk that never blows up, never loses a
                   sector, never miswrites.  good luck.  plus, this is
                   actually weak: just prevents data stored on disk from
                   getting lost.

                2. keep a backup of all data.  restore as necessary.
                   this protects against bad sectors.  Also protects
                   against all other ways to lose data!  E.g., fire,
                   theft of disk, legitimate (but accidental) deletion
                   of a file, mis-edits of data, OS corruption, ...
                   (given enough backwards state)

                   * key: prevent a bunch of things *you did not anticipate*.

           perfection is expensive, often impossible and fragile.
           imperfection is cheap and most probably the only way.

----------------------------------------------------------------------
lets do some simple fault tolerance [ripped off from gray]

	they assume fail stop: if X breaks, it stops talking to you.

	what if device isn't fail stop.  are we out of luck?
		
		no: duplicate + vote.

		if A != B 
			stop: A or B broken!
		else
			continue.

	this is nice: clean semantics.	easy to reason about.  the
	alternative v hard to reason about.  specification = output
	could be correct or could give any wrong answer whatsoever.

	what is bad?
		duplication cost.
		failure rate

	what happens to failure rate?  
	 	increases by almost a factor of two.
		so almost 2x more likely to fail than one by itself.

		assume: 
			prob that A fails = P(A)
			prob that B fails = P(B)

		say P(A) = .02
		    P(B) = .03

	then prob that at least one is broken
		= P(A) + P(B) - P(A)*P(B)
		= .02 + .03 - .02 * .03
		= .05 + .0006

	note: in general, if failure rate << 1 (which you expect) that you
	can drop the last term since the others dominate.  this is really
	useful for back-of-the-envelop when someone talking at you.

	the assumption here are the usual ones:
		1. memoryless
		2. independent
		3. pr of failure << 1

	given these than the mean-time-to-failure (MTTF) is
		1 / P(A)

	so: A will fail in 50 days.
	    B will fail in 33 days.

	mean for any to fail (also: the mean time for the first to fail):
		20

	mean time for both of them to fail:
		1 / P(A)*P(B) = 1666

	this last is really suggestive.


-------------------------------------------------------------------------
why not independent?
	maintance: guy opens up and starts fussing.  falls into it, 
	cleaning lady kicks power cord.

	operations: type "rm -rf" by mistake.

	breakin: steals machine.  hacks in and goes wild.

	environment: power failure (in US).  building burns down.
	space aliens nuke planet.  [social too: strike, company goes
	bankrupt, sabotage, laywers halt napster]

	hardware

	software 

-------------------------------------------------------------------------
N-plex idea: how to build failfast modules

	- N-plex (duplicate N times)
	- vote on output.

	now Pr that *some* up exponentially higher, Pr that *all* down
	exponentially lower.

	most common:
		- N = 1: "good luck"  which inductively becomes N = 0
		        "too bad"
		- N = 2: pairing
		- N = 3: triple module redundancy : TMR

	trouble:
		- N = 10?
		  then 5 dead = system dead.
	some tricks.

	first sense which are avail: then use output from these.

		- if dead, then don't count in N.

		- similarly, if gave bad output in past, drop
		  from voting until repaired/reset.

		  so if one goes flaky, drop. then if one dies,
		  can still make progress.
		
	note: still play single point of failure game:
		comparator.
		connectors
		powersupply
		...


  so far assume that if dead, stays dead.
	recall: MTTF:
		1 / P(A)

	if N different identical modules
		= 1 / P(A) + P(A) ...
		= 1 / N * P(A)

	recall
		MT(A) = 1/P(A)

	so this becomes
		1 / N * 1/MT(A) = MT(A) / N

	or, in other words, reliability goes down linearly w/ N.

	so if MTTF for one is 10 yrs:
		and you have pair: then MTTF = 5 days.
		             TMR:              3.3 + 5 days.
					(first failure = 3.3 then 5 days)

	sucks: makes failure more likely.

		big caveat: if most failures are transient then it actually
		makes things much more reliable.

		repair: fast enoug = all failures transient.  

	how to fix?
		repair.

		if time to repair << MTTF, then ignore.  
		
		all three have to fail: P(A)^3

		PR of .1
	
		1,000 years!

		fix + resynchronize.

	 soft errors: broke, then immediately "repaired"

	 hard error: fix before goes further.
---------------------------------------------------------------------------