AutoRAID ======== Motivation: RAID systems hard to use and tune Wrong tuning gives bad performance Reformatting is painful and error-prone Goal: Get the best of all worlds automatically Approach: Assume part of data in disk array is write active, part inactive Assume active subset changes slowly over time Store active subset in RAID 1, remainer in more space-efficient RAID 5 How to migrate data between RAIDs 1 and 5? Manually? Yuck In the file system? Good idea, but want to work with existing file systems Know about files, which may be a good indication of data activity In the the array controller? Convenient place for users (unobtrusive) Lose knowledge about files and directories AutoRAID features Mapping logical block numbers to underlying RAID Atomatic adaptation between Mirroring and RAID 5 Based on amount of data stored and access patterns Hot-pluggable disks, fans, power supplies, and controllers Including on-line storage capacity expansion -- just add a disk Easy disk upgrades--remove old disk, add new one Active hot spare Get performance out of your spare disk (more space for Mirroring) Also makes it more likely to notice a disk when it goes bad (as opposed to noticing it when you really need the disk) Log structured RAID 5 writes -- try to avoid extra 2 reads What's in the box (figure 2)? Bunch of SCSI controllers and disks DRAM and NVRAM SCSI connector -- to appear like a single disk to host computer system Data layout - PEX (physical extent) -- 1MB on a single disk - PEG (PEX group) -- a group of 3 or more PEXes in (RAID 5 or RAID 1) - segment -- RAID 5 stipe unit or half of mirroring unit (128KB) - stripe -- Bunch of segments in a PEG - RB (relocation block) -- unit of data migration (64KB) Mapping structures Virtual device tables (one per LUN): maps RB -> PEG table PEG table: List of RBs in PEG, list of PEXes used to store them PEX table (one per drive) Also track: access times, free space in PEGs, other stats RB->PEG mapping is delayed until first write, saving space Disk I/O: Reads -- straight forward (either satisfy from cache, or make room) If reading RAID 5 in degraded mode (disk failed), may need to reconstruct Writes Invalidate cached copy of data Copy data into NVRAM (might need to wait for space) Return "OK" to host system Might initiate back-end write, depending on policy Flushing data to back end In mirrored storage class, just write both copies In RAID 5: "promote" RB from RAID 5 to mirrored May require old mirrored data to become RAID 5 to get space Write both mirrored copies Update virtual device, PEG, and PEX tables Flushing demoted data to RAID 5 Technique 1: Per-RB writes Hold old parity block in NVRAM Write data block to disk (holding it in NVRAM too?) XOR block with old value in cache and apply to parity Write parity block Technique 2: Logging Append demoted RBs to "current RAID 5 PEG" Write PEGs in batches Often will write complete PEGs May begin in the middle of PEG from previous batch Hold old parity and index of highest-numbered valid RB in NVRAM In cases of space crunch, may resort to traditional read-modify-write Cleaning and hole plugging Demoted RBs create holes in mirrored PEGs Mirrored PEGs can be cleaned and converted to RAID 5 (stick the blocks in holes in other PEGs) Promoted RBs create holes in RAID 5 PEGs For mostly empty PEGs, copy data to "current RAID 5 PEG" If RAID 5 PEG is almost full, clean by hole-plugging Saves copying the bulk of the data Requires parity & old value to be read Background load-balancing when new disks added Migrate PEXes to new disks to balance load When new disk added, old PEGs will not be full width Copy data from old PEGs to new, full-width ones How is performance? OLTP--much better than Clariion RAID, noticeably slower than JBOD Fig. 6b: Explain shape Up to 8 drives, increasing room for RAID 1 decreases migration costs After 8 disks, smaller benefit from having more spindles (two reads are more likely to be able to be done in parallel) Micro benchmarks (Fig. 7) -- all kill Clariion Sequential reads & writes good Random writes not so good (but competing w. non-redundant JBOD) Random reads -- worse than JBOD Why? Mostly measure of controller overhead AutoRAID has slow cache searching algorithm Random writes -- 1:2:4 ration of I/Os JBOS - 1 write AutoRaid - RAID 1 requires 2 writes for mirrored storage Clariion - RAID 5 requires 2 reads, 2 writes Mirrored data selection for mirroring (Fig. 9) Possibilities: Alternate, inner/outer, shortest Q, SPTF SPTF best, but hard to implement. Shortest Q almost as good. random is fallback. Hole plugging a big win (Sec 4.6)! reduced number of moved RBs by 93% and 96% (mean I/O time improved 8.4% 3.2%) How should file system get performance improvement from mirroring Say you create a big file system but it is never more than 30% full ... will you get mostly mirrored PEGs? Depends on file system, but probably not. File system will eventually touch every disk block AutoRAID doesn't know which data blocks the file system considers free