AutoRAID
========

Motivation:  RAID systems hard to use and tune
  Wrong tuning gives bad performance
  Reformatting is painful and error-prone
  Goal:  Get the best of all worlds automatically

Approach:
  Assume part of data in disk array is write active, part inactive
  Assume active subset changes slowly over time
  Store active subset in RAID 1, remainer in more space-efficient RAID 5

How to migrate data between RAIDs 1 and 5?
  Manually?  Yuck
  In the file system?
    Good idea, but want to work with existing file systems
    Know about files, which may be a good indication of data activity
  In the the array controller?
    Convenient place for users (unobtrusive)
    Lose knowledge about files and directories

AutoRAID features
  Mapping logical block numbers to underlying RAID
  Atomatic adaptation between Mirroring and RAID 5
    Based on amount of data stored and access patterns
  Hot-pluggable disks, fans, power supplies, and controllers
    Including on-line storage capacity expansion -- just add a disk
    Easy disk upgrades--remove old disk, add new one
  Active hot spare
    Get performance out of your spare disk (more space for Mirroring)
    Also makes it more likely to notice a disk when it goes bad
     (as opposed to noticing it when you really need the disk)
  Log structured RAID 5 writes -- try to avoid extra 2 reads

What's in the box (figure 2)?
  Bunch of SCSI controllers and disks
  DRAM and NVRAM
  SCSI connector -- to appear like a single disk to host computer system

Data layout
   - PEX (physical extent) -- 1MB on a single disk
   - PEG (PEX group) -- a group of 3 or more PEXes in (RAID 5 or RAID 1)
   - segment -- RAID 5 stipe unit or half of mirroring unit (128KB)
   - stripe -- Bunch of segments in a PEG
   - RB (relocation block) -- unit of data migration (64KB)
  Mapping structures
    Virtual device tables (one per LUN):  maps RB -> PEG table
    PEG table:  List of RBs in PEG, list of PEXes used to store them
    PEX table (one per drive)
    Also track:  access times, free space in PEGs, other stats
  RB->PEG mapping is delayed until first write, saving space

Disk I/O:
  Reads -- straight forward (either satisfy from cache, or make room)
    If reading RAID 5 in degraded mode (disk failed), may need to reconstruct
  Writes
    Invalidate cached copy of data
    Copy data into NVRAM (might need to wait for space)
    Return "OK" to host system
    Might initiate back-end write, depending on policy
  Flushing data to back end
    In mirrored storage class, just write both copies
    In RAID 5:
      "promote" RB from RAID 5 to mirrored
         May require old mirrored data to become RAID 5 to get space
      Write both mirrored copies
      Update virtual device, PEG, and PEX tables
  Flushing demoted data to RAID 5
    Technique 1:  Per-RB writes
      Hold old parity block in NVRAM
      Write data block to disk (holding it in NVRAM too?)
      XOR block with old value in cache and apply to parity
      Write parity block
    Technique 2:  Logging
      Append demoted RBs to "current RAID 5 PEG"
      Write PEGs in batches
        Often will write complete PEGs
	May begin in the middle of PEG from previous batch
          Hold old parity and index of highest-numbered valid RB in NVRAM
  In cases of space crunch, may resort to traditional read-modify-write

Cleaning and hole plugging
  Demoted RBs create holes in mirrored PEGs
  Mirrored PEGs can be cleaned and converted to RAID 5
    (stick the blocks in holes in other PEGs)
  Promoted RBs create holes in RAID 5 PEGs
    For mostly empty PEGs, copy data to "current RAID 5 PEG"
    If RAID 5 PEG is almost full, clean by hole-plugging
      Saves copying the bulk of the data
      Requires parity & old value to be read

Background load-balancing when new disks added
  Migrate PEXes to new disks to balance load
  When new disk added, old PEGs will not be full width
    Copy data from old PEGs to new, full-width ones

How is performance?
  OLTP--much better than Clariion RAID, noticeably slower than JBOD
  Fig. 6b:  Explain shape
    Up to 8 drives, increasing room for RAID 1 decreases migration costs
    After 8 disks, smaller benefit from having more spindles
      (two reads are more likely to be able to be done in parallel)
  Micro benchmarks (Fig. 7) -- all kill Clariion
    Sequential reads & writes good
    Random writes not so good (but competing w. non-redundant JBOD)
    Random reads -- worse than JBOD
      Why?  Mostly measure of controller overhead
      AutoRAID has slow cache searching algorithm
    Random writes -- 1:2:4 ration of I/Os
      JBOS - 1 write
      AutoRaid - RAID 1 requires 2 writes for mirrored storage
      Clariion - RAID 5 requires 2 reads, 2 writes
  Mirrored data selection for mirroring (Fig. 9)
      Possibilities:  Alternate, inner/outer, shortest Q, SPTF
    SPTF best, but hard to implement.  Shortest Q almost as good.
      random is fallback.
  Hole plugging a big win (Sec 4.6)!
    reduced number of moved RBs by 93% and 96%
    (mean I/O time improved 8.4% 3.2%)

How should file system get performance improvement from mirroring
  Say you create a big file system but it is never more than 30% full
      ... will you get mostly mirrored PEGs?
    Depends on file system, but probably not.
      File system will eventually touch every disk block
      AutoRAID doesn't know which data blocks the file system considers free