Project 8: Journaling File System (Write-Ahead Log)

In this project you will implement file system recovery based on journaled metadata. We will be using the Unix V6 file system format from 1975 that you learned about in the previous project, but bringing it (at least partially) into the 21st century by journaling all metadata updates to a write-ahead log and tracking free blocks in a bitmap instead of a linked list.

The goal of this project is to teach you about logging, which is an important concept in file systems, databases, and any flash devices that implement wear leveling.

Project overview

We will supply you with an implementation of the V6 file system using FUSE (file system in userspace). FUSE is a kernel feature allowing deveopers to implement file systems in ordinary user processes known as fuse drivers, without modifying the kernel. When applications make file system calls, the kernel passes the system calls through as messages to the user-level fuse driver, which then passes the results back to the kernel. Though this architecture incurs a bit of performance overhead, it is used by production file systems such as NTFS-3G and the exFAT implementation commonly used on linux.

Our fuse driver is called mountv6. After building the project files, if you have a V6 file system image called v6.img, you will be able to run ./mountv6 v6.img ./mnt to make this file system image v6.img available under the local directory ./mnt. (Don’t forget to make create an empty mnt directory first by running mkdir mnt.)

The catch is that the file system we provide you has an inode and block cache, and writes data back to the underlying file system images in an arbitrary order. If the file system crashes, this is likely to result in metadata inconsistencies. For example, blocks still allocated in files might be marked free, posing a danger of cross-allocation, or directory entries might point at unallocated inodes. You can force the program to crash by setting a write count in the CRASH_AT environment variable. For example, running CRASH_AT=1000 ./mountvf v6.img mnt causes mountv6 to crash abruptly at the 1000th block-write operation.

We supply you with a file system scavenger program fsckv6 that can restore a file system image to a consistent state, but often with heavy data loss. If there have been a lot of metadata changes, you will likely lose files. fsckv6 does not implement a lost+found mechanism but simply throws away unreachable files. In the presence of cross-allocation, it somewhat arbitrarily punches holes in files to fix the cross-allocation (without even checking which file has a later modification time, which might lessen the damage).

The good news is that we’ve augmented the file system to keep a write-ahead log of all metadata changes made. If you run mountv6 with the -j option (e.g., mountv6 -j v6.img mnt), it enables this journaling mode.

Your task: You must implement a program apply that repairs V6 file system images by applying the log to the file system from the point of the last checkpoint.

Journaling V6 file system layout

The on-disk layout of the journaling V6 file system consists of the following sections:

  1. The boot block (1 sector)
  2. The superblock (1 sector)
  3. Inodes (variable-length)
  4. Data blocks (variable-length)
  5. The log header (1 sector)
  6. The free-block bitmap or freemap (variable-length)
  7. The log (variable-length)

Sections 1-4 are identical or nearly identical to the original V6 file system layout. Their layout is defined by data structures and constants in the layout.hh header file.

Sections 5-7 were added for this project. Their layout is defined by the logentry.hh header file. Since nobody runs the V6 file system on raw hard disks any more—V6 file system images exist only as files on other file systems—we simply extend the file containing a V6 file system image to accommodate the new areas required by journaling.

The boot block

This block is essentially unused, but the first two bytes must be 4 and 7 or mountv6 will not recognize a file system image.

The superblock

The superblock is essentially the same as in 1975, except that we’ve coopted two of the unused bytes to add two more fields:

struct filsys {
    ...
    uint8_t  s_uselog;        // Use log
    uint8_t  s_dirty;         // File system was not cleanly shut down
    ...
};

The s_uselog byte, when non-zero, indicates that our new 21st-century logging functionality has been enabled for this particular file system volume.

The s_dirty field is set to 1 whenever the file system is opened, and set to 0 when it is closed cleanly. If your file system crashes, s_dirty will be 1, and mountv6 will refuse to mount it again until you have repaired the damage. (If you want to see how bad a corrupt file system can be in action, you can forcibly mount a dirty volume with the --force flag, as in ./mountv6 --force v6.img mnt.)

There’s one more change to the superblock. Originally, V6 stored free blocks in a linked list 100-wide. The superblock stored s_nfree free blocks (0–100) in an array s_free, with s_free[0] (if non-zero) pointing to a disk block containing 100 pointers to other free blocks and so on. This format is not convenient for journaling, so we set s_nfree to zero and instead dedicate a portion of the disk to a bitmap, containing one bit per data block, where 1 indicates the block is free and 0 indicates it is allocated. For brevity, we refer to the “free-block bitmap” as the freemap and store it in a structure field called freemap_.

V6 also stores a vector of up to 100 free inodes in the superblock, called s_inode. The number of valid entries is s_ninode, and when it reaches zero the V6 code simply re-scans the inode area to find 100 unallocated inodes. When repairing an unclean volume, we can avoid inconsistencies in inode allocation by just setting s_ninode to 0. As long as we have properly repaired the inodes and the IALLOC flags are all correct in the i_mode fields, the file system will just re-scan the inode area the next time it needs to allocate an inode.

Inodes

There is no change to the inode section. It’s a giant array of inode structures, where the entry 0 of block INODE_START_SECTOR (2) of the file system corresponds to inode number 1.

Data blocks

There is no change to this section of the file system.

Log header

The log header is defined by the loghdr structure in logentry.hh, and has the following fields:

struct loghdr {
    uint32_t l_magic;       // LOG_MAGIC_NUM
    uint32_t l_hdrblock;    // Block number containing loghdr structure
    uint16_t l_logsize;     // Total size (in blocks) of loghdr, freemap, and log area
    uint16_t l_mapsize;     // Number of blocks in freemap
    uint32_t l_checkpoint;  // Disk offset of log checkpoint
    lsn_t l_sequence;       // Expected LSN at checkpoint

    uint32_t mapstart() const { return l_hdrblock + 1; }
    uint32_t logstart() const { return mapstart() + l_mapsize; }
    uint32_t logend() const { return logstart() + l_logsize; }
    uint32_t logbytes() const {
        return SECTOR_SIZE * (l_logsize - l_mapsize - 1);
    }
};

The first two fields are sanity checks to ensure you really have a loghdr. The l_magic field contains the constant LOG_MAGIC_NUM (0x474c0636–a randomly-generated value), while l_hdrblock is the actual block number storing the loghdr structure. The first constant is unlikely to appear in random files, since it was randomly generated. The second constant ensures that if you copy a valid loghdr into a file, it still won’t look like a valid loghdr since it will be at the wrong disk offset.

l_logsize specifies how many blocks have been added to the V6 file system for the new feature set. This size includes sections 5–7 of the file system (the log header, freemap, and journal). With the new feature set enabled, the total number of 512-byte blocks of the disk image should now be s_fsize (from the superblock) plus l_logsize.

l_mapsize specifies how many blocks are used for the freemap. This must be large enough to contain at least 1 bit for each block number from the start of data blocks (at INODE_START_SECTOR+s_isize) to the end of the data blocks (at s_fsize).

Finally, l_checkpoint and l_sequence tell you where to start reading the log and what log sequence number to expect. l_checkpoint is the byte offset of the first log record you should read. (Because it’s a byte offset, it should reside somewhere in the half-open interval [logstart()*SECTOR_SIZE, logend()*SECTOR_SIZE).) Each log entry has a consecutively-assigned 32-bit log sequence number or LSN of type lsn_t. l_sequence tells you the LSN of the first log record you should expect to read at offset l_checkpoint. If you see the wrong LSN, even if it appears to be a valid log record, you must stop processing the log as you may be looking at an old log entry.

Freemap (free-block bitmap)

The free-block bitmap has one bit for every block in section 4 of the file system (data blocks). If the bit is 1, the block is free. If the bit is 0, the block is in use.

The log

The log consists of a series of log entires, each of which begins with a log-entry header, followed by a payload, and then a footer. Conceptually, the three fields look like this:

struct Header {
    lsn_t sequence;         // First copy of the LSN
    uint8_t type;           // What type this is (entry_type index)
} header;

entry_type payload;

struct Footer {
    uint32_t checksum;      // CRC-32 of header and payload
    lsn_t sequence;         // Another copy of the LSN
} footer;

It’s important to note that, while the beginning of the log will generally be valid, there isn’t a clean end. The log just turns to garbage at some point after the last write, which might not even be for a full log record if the file system crashed.

To help identify the end of the log, the header contains a log sequence number, or LSN. This is a consecutively assigned counter that is unique for each log entry. LSNs are very important because the log area is repeatedly recycled. Hence, whatever bytes are left over from the last iteration may be old log entries or may look like valid log entries. These old entries must not be applied to the file system; doing so could actually damage the file system. Even a valid old entry might overwrite file data in a current file data block that was previously a directory or indirect block.

The header also contains a type integer, which specifies the type of log entry, and hence how to parse the payload. The payload section depends on type, and is one one of the structures defined in the following subsections. The payload immediately follows the type.

Finally, the log entry ends with a footer. The footer contains another copy of the sequence number. The second copy helps detect situations in which a log entry spans a sector boundary, and hence a seemingly valid entry could actually consist of the first half of a new log entry followed by the second half of an old one. There’s still the possibility that the “right” log sequence number for the footer could appear in arbitrary payload data from an older log record. Hence, as an additional check, the footer also includes a checksum over the header and payload. The check reduces the probability of interpreting garbage as a log record by and additional factor of 2-32.

Transaction begin

struct LogBegin {
};

Ensuring file system consistency often requires writing to multiple disk locations atomically. For example, creating a file requires writing a directory entry and initialize the inode. Appending a block to a file requires writing a block pointer in the inode or indirect block and marking the block as no longer free in the freemap. Doing only one of these writes is harmful. For example, if an inode points to a block that is still marked free, the block may get cross-allocated to another inode. If a directory entry is written but the inode is not initialized, the directory will contain a link to an uninitialized inode.

To avoid such problems, log entries that modify file system state are grouped together in atomic transactions for which either all the log entries are applied or none are applied. Transactions are bracketed by a LogBegin at the beginning and LogCommit entry at the end. The LogBegin payload itself contains no data members. However, you must not apply any subsequent records unless there is also a matching LogCommit entry. If the LogCommit entry is missing, it means the system crashed in the middle of writing to the log, and hence applying subsequent records will likely put the file system in an inconsistent state.

Note, even though the LogBegin payload contains zero bytes, the entry still has a header and footer. The header contains an LSN and a type byte signifying that this is a LogBegin, while trailer ensures integrity of the log entry as usual.

Note: in the project, we do the check for a LogCommit for you and don’t even call your apply methods unless the log entries are part of a complete transaction. However, you should still understand what is happening.

Patch bytes

struct LogPatch {
    uint16_t blockno;           // Block number to patch
    uint16_t offset_in_block;   // Offset within block of patch
    std::vector<uint8_t> bytes; // Bytes to place into the block
};

This log entry specifies a change to the inode portion or data portion of the file system (sections 3 and 4 of the disk layout). It contains specific bytes that must be written to the disk at position offset_in_block of block blockno.

Allocate block

struct LogBlockAlloc {
    uint16_t blockno;       // Block number that was allocated
    uint8_t zero_on_replay; // Metadata--should zero out block on replay
};

Specifies that block number blockno, which was previously free, should now be marked as allocated (0 or false in the freemap). If zero_on_replay is non-zero, it means that the block is being used for metadata—i.e., as an indirect (or double-indirect) block, or for a directory.

During normal file system operation, any allocated block is first zeroed out using memset. Zeroing out a block is thus the most faithful way to replay an allocation action. It is also necessary to do when replaying metadata blocks. If you allocate an indirect block or directory block, all block pointers and all directory entries that are not explicitly patched must be zero, and there is no guarantee that whatever happened to be on disk at the time of the crash will in fact have zeros where required.

The reason it works to zero out metadata blocks when replaying the log is that all subsequent writes to those blocks are guaranteed to be included in the log. On the other hand, this is not the case for ordinary data blocks, as file writes themselves do not go to the log, only the block allocation. You might allocate a file block, write to the block, flush both the log records and data block to disk, and then crash without having made a checkpoint. In this situation, zeroing out the block will cause data loss. Hence, you must not zero out blocks that have the zero_on_replay == 0.

You may notice that whenever the log contains a LogBlockAlloc entry, there’s also a LogPatch entry setting a pointer to that block in the same transaction (i.e., between the same enclosing LogBegin and LogCommit entries). When allocating an indirect block, you may even find that the block containing the block pointer was just allocated, which all works fine because of atomic transactions.

Free block

struct LogBlockFree {
    uint16_t blockno;           // Block number of freed block
};

Specifies that block number blockno, which was previously allocated, should be marked free (1 or true in the freemap).

Transaction commit

struct LogCommit {
    uint32_t sequence;          // LSN of LogBegin
};

The LogCommit entry indicates the end of an atomic transaction, consisting of all log entries since the previous LogBegin. Note that the sequence field must include the LSN of the corresponding LogBegin log entry. Otherwise, it indicates the log has been corrupted and you must stop processing it.

Log rewind

struct LogRewind {
};

This indicates that the log has wrapped around and the next entry is back at the beginning of the log area. (An alternative is always to write to the very last byte of the log before wrapping around, but this would result in log entries that are not contiguous on disk. Contiguous entires facilitate debugging: you can always start examining the log from the beginning of the log area and see some valid operations that happened, whether before or after the latest checkpoint.)

Library interface

To implement the project, you will need to interact with the file system buffer cache and the freemap.

Buffer cache interface

You will access the file system and buffer cache through the file system interface defined in v6fs.hh. The V6Replay class you are implementing has a V6FS &fs_ data member you will use for all file I/O. The main methods and fields you need are:

struct V6FS {
    ...
    Ref<Buffer> bread(uint16_t blockno);
    Ref<Buffer> bget(uint16_t blockno);
    ...
};

Both of the above methods return a Ref<Buffer>. A Ref is a kind of smart pointer that keeps a reference count, similar to std::shared_ptr. The difference is that a std::shared_ptr deallocates whatever it is pointing to when the reference count goes to zero; Ref doesn’t do anything when the reference count goes to zero—the buffer continues to reside in the cache. However, when the buffer cache is full and must evict a block, only blocks with a zero reference count are eligible for eviction, so as to avoid evicting a buffer in the middle of its use.

A Buffer contains the cached contents of a disk block. The two things you need to do to a buffer are to access the memory, and to tell the system when the buffer is dirty. The memory is in a simple byte array called mem_, while the method bdwrite() is used to tell the system that you have modified a buffer and it should at some point be written back. (bdwrite is a commonly used function name in Unix kernels, where the “d” stands for “delayed.” A name such as mark_dirty() would arguably be more intuitive.)

struct Buffer {
    ...
    char mem_[SECTOR_SIZE];
    void bdwrite();
    ...
};

Bitmap interface

The freemap is a simple bitmap stored in V6Replay::freemap_. The freemap is compact enough to store the entire map in memory. It is read from disk for you in the constructor V6Replay::V6Replay, and written back to disk for you at the end of the V6Replay::replay method.

The freemap is implemented by the Bitmap structure defined in bitmap.hh, which is similar to a std::vector<bool>. Unlike std::vector<bool>, however, Bitmap offers a feature in which valid indices start at an arbitrary number. Since the first data block is INODE_START_SECTOR + s_isize (a.k.a. datastart()) rather than block zero, the first bit physically on disk in the freemap area corresponds to this “datastart” block. The Bitmap structure handles the translation because we construct it with a min_index. What this means in practice is that to mark block bn free, you just say freemap_.at(bn) = true; to mark it allocated, you say freemap_.at(bn) = false. Bitmap itself will do the work of translating bn to the appropriate bit.

The assignment

Your task is to implement the program apply that applies the log to a file system image (e.g., ./apply v6.img). We supply you a reasonable main function in apply.cc, and a framework to extract only the log entries with valid sequence numbers between valid LogBegin and LogCommit log entries. You will only need to edit the code for the various V6Replay::apply methods in replay.hh. There is one method for each type of payload in the log. Some types of log record may not require any action, so some of these methods could stay empty.

Developing and testing

To get started, login to the myth cluster and clone the starter repo with this command:

git clone https://web.stanford.edu/class/cs111/starters/rwfs.git cs111_p8

[Note that the assignment requires fuse3. Depending on your OS distribution, this may mean a package called fuse3-dev or libfuse3-dev or sometimes just fuse3. This is installed for you on the myth cluster but not rice (and since fuse requires a setuid root program, we cannot install it for you on rice).]

Run the following commands to create and mount a file system:

make
./mkfsv6 v6.img
./mountv6 -j v6.img mnt &

You’ve now created a file system image named v6.img and mounted it on mnt. You can see if the file system builds on itself as follows:

cp *.{hh,cc} Makefile mnt
make -C mnt

Please remember to unmount your file system when you are done. You can unmount it with fusermount -u mnt or pkill mountv6. To crash the file system, run pkill -9 mountv6, where the -9 says to use signal 9 (SIGKILL), which instantly terminates a process without allowing any cleanup. You can also use the CRASH_AT environment variable:

fusermount -u mnt
CRASH_AT=10000 ./mountv6 -j v6.img mnt &
cd mnt
touch `seq 1000`
rm `seq 1000`
cd $PWD/..

Now you’ve got a corrupt file system image. If you run ./fsckv6 without the -y flag, it will tell you some things that are wrong with the file system without actually fixing them:

./fsckv6 v6.img

You can fix these by running ./fsck -y v6.img, but then lots of files will be lost. Once you are done with the lab, you should be able to recover much more cleanly by running:

./apply v6.img

Before you start coding, play with the file system and try to understand what it is doing by performing metadata operations then examining the log with the dumplog tool (see below).

We have supplied some small file system images on which to test your code. As usual, run ./run_tests to invoke these tests. To run a test individually, you can run ./examine.sh -a test-imagetest-case.img and compare what you get to the desired output in test-images/test-case.out.

Optional: When you are done testing apply, if you would like mountv6 to apply the log automatically when given the -j flag, you can uncomment the line flags |= V6FS::V6_REPLAY; in the main function of mountv6.cc. This will integrate the code you developed for your standalone apply program into the file system. Many journaling file systems automatically replay the log on mount with no need to run an external fsck program unless you had media failure or catastrophic software bugs. Integrating your replay code in the file system won’t improve your score on the project, may be satisfying to do.

Helpful tools

Troubleshooting

Submitting Your Work

To submit your solution, cd to the directory containing your work, delete or move any large v6.img file out of the directory to avoid submitting it, then and invoke the command

./submit

If you discover problems or improvements after you have submitted, you may resubmit; only the latest submit will be graded.