Why disk arrays?
  CPUs improving faster than disks -- disks will increasingly be bottleneck
  New applications (audio/video) require big files (motivation for XFS)
  Disk arrays - make one logical disk out of many physical disks
  More disks results in two benefits
   - Higher data transfer rates on large data accesses
   - Higher I/O rates on small data accesses

What reliability implications of a disk array?
  JBOD array with 100 disks is 100 times more likely to fail!
    Each disk 200,000 hours MTBF -> array 2,000 hours (3 months)
      [approximately -- double counts two disks failing]
  Use redundancy to improve reliability
    But makes writes slower, since redundant info must be updated
    Raises issues of consistency in the face of power failures

Disk array basics:
  Data striping - balances load across disks
  Redundancy - improve reliability
  Many different schemes, depending on
   - granularity of data interleaving
      fine grained -> high transfer rates for all requests
      course grained -> allow parallel small requests to different disks
   - method in which redundant information computed & distributed

RAID 0 - nonredundant storage (JBOD - just a bunch of disks)
  E.g., Stripe sequential 128K logical regions to different disk
  Offers best possible write performance (only one write per write)
  Offers best possible storage efficiency (no redundancy)
  Offers good read performance
  Use if speed more important that reliability (scientific computing)

RAID 1 - mirrored storage
  Each block stored on two disks
  Writing slower (twice as many operations)
  Storage efficiency 1/2 optimal
  Small reads better than other scheme (can read on disk with shortest seek)
  
RAID 2 - memory-style ECC
  Use Hamming codes, like ECC memory.  Example:

    Generator matrix G is I (identity) and a parity generation matrix A:

    G = [ I : A ]

        1 0 0 0  1 1 1
    G = 0 1 0 0  0 1 1
        0 0 1 0  1 0 1
        0 0 0 1  1 1 0

    D = d1 d2 d3 d4

    E = G x D = (d1, d2, d3, d4, d1+d3+d4, d1+d2+d4, d1+d2+d3) [column vector]

    H = [ A^T : I ]

        1 0 1 1 1 0 0
    H = 1 1 0 1 0 1 0
        1 1 1 0 0 0 1

    H x E = (d1, d2, d3, d4, 0, 0, 0)

  Small reads about like RAID 0 (more contention for data disks)
  Writes must update multiple parity disks
  Storage more efficient than RAID 1 -- uses more than half of disks
  Can recover from any single bit error (RAID 1 assumes fail-stop)
  Is this overkill?  Most disks are fail-stop

RAID 3 - bit interleaved parity
  Assume fail stop disks.  Just add one parity disk
    D = (d1, d2, d3, d4)
    E = (d1, d2, d3, d4, d1+d2+d3+d4)
    Can recover, e.g., d2 = d1+d3+d4+(d1+d2+d3+d4)
  Any read touches all data disks
  Any write touches all data disks plus parity disk

RAID 4 - block-interleaved parity
  Interleave data in blocks instead of bits
  Reads smaller than striping unit access only one disk
  Writes must update data and compute and update parity block
    Small writes require two reads plus two writes
    Contention for parity disk

RAID 5 - block-interleaved distributed parity
  Distribute parity uniformly over all the disks
  Want to access disks sequentially when sequentially accessing logical blocks:
       0   1   2   3  p0
       5   6   7  p1   4
      10  11  p2   8   9
      15  p3  12  13  14
      p4  16  17  18  19
  Better load balancing than RAID 4
  Small writes still require read-modify-write

RAID 6 - P+Q Redundance (rarely implemented)
  Have two parity blocks, instead of one
  With Reed-Solomon codes, can lose any two blocks
  Now must read-modify-write two parity blocks plus data

Observations:
  RAID 2 and RAID 4 are strictly inferior to RAID 5
  RAID 1 is just RAID 5 with parity group size G=2
  RAID 3 is just like RAID 5 with a very small stripe unit

SUMMARY of maximum throughput:
		Sm R	Sm W		Big R 	Big W	Space
	RAID 0	1	1		1	1	1
	RAID 1	1	1/2		1	1/2	1/2
	RAID 3	1/G	1/G		(G-1)/G	(G-1)/G	(G-1)/G
	RAID 5	1	max(1/G, 1/4)	1	(G-1)/G	(G-1)/G
	RAID 6	1	max(1/G, 1/6)	1	(G-2)/G	(G-2)/G

When and how do disks fail?
  Disks can fail very early in their lifetimes (manufacturing errors)
  Also tend to fail late in their lifetimes (when disk wears out)
  Systematic manufacturing defect can make entire batch of disks fail early
    Beware disks with consecutive serial numbers!
  Environmental factors can kill a bunch of disks (air conditioning failure)
  Disks can fail when a bad block is read
    Bad block may exist for a while before being detected

What happens when a disk fails?
  Add new disk, reconstruct failed disk's state on new disk
  Must store metadata information during recovery
   - Which disks are failed?
   - How much of failed disk has been reconstructed?
  System crashes become very serious in conjunction with disk failure
    Parity may be inconsistent (particularly bad for P+Q)
    You could lose a block other than the one you were writing
    MUST log in NVRAM enough info to recover parity
      Makes software-only implementation of RAID risky
  Operating with a failed disk:
    Demand reconstruction
      Assumes spare disk immediately (or already) installed
      Reconstruct blocks as accessed
      Background thread reconstructs all blocks
    Parity sparing
      Replaces parity block with reconstructed data block
      Need extra metadata to keep track of this

When does a RAID array fail unrecoverably?
  Double disk failures (or tripple, if P+Q redundancy)
  System crash followed by disk failure
  Disk failure, then read and discover bad block during reconstruction

Other RAID improvement techniques
  Parity logging
  Declustered parity -- many parity groups, spread over many disks
  Parity sparing -- use spare disks to improve performance by spreading load

Tuning RAID
  What is optimal size of data stripe in RAID 0 disk array?
    Sqrt((PX(L-1)Z)/N)
      P - average positioning time
      X - disk transfer rate
      L - concurrency of workload
      Z - request size
      N - size of array in disks
  What about in RAID 5?
    Reads - similar to RAID 0
    Writes - optimal is a factor of 4 smaller than for reads (for 16 disks)
             seems to vary WITH #disks, while reads vary inversely!
  Conclusion:  Very workload dependent!