Project 8: Journaling File System (Write-Ahead Log)
In this project you will implement file system recovery based on journaled metadata. We will be using the Unix V6 file system format from 1975 that you learned about in the previous project, but bringing it (at least partially) into the 21st century by journaling all metadata updates to a write-ahead log and tracking free blocks in a bitmap instead of a linked list.
The goal of this project is to teach you about logging, which is an important concept in file systems, databases, and any flash devices that implement wear leveling.
Project overview
We will supply you with an implementation of the V6 file system using FUSE (file system in userspace). FUSE is a kernel feature allowing deveopers to implement file systems in ordinary user processes known as fuse drivers, without modifying the kernel. When applications make file system calls, the kernel passes the system calls through as messages to the user-level fuse driver, which then passes the results back to the kernel. Though this architecture incurs a bit of performance overhead, it is used by production file systems such as NTFS-3G and the exFAT implementation commonly used on linux.
Our fuse driver is called mountv6
. After building the project files, if you have a V6 file system image called v6.img
, you will be able to run ./mountv6 v6.img ./mnt
to make this file system image v6.img
available under the local directory ./mnt
. (Don’t forget to make create an empty mnt
directory first by running mkdir mnt
.)
The catch is that the file system we provide you has an inode and block cache, and writes data back to the underlying file system images in an arbitrary order. If the file system crashes, this is likely to result in metadata inconsistencies. For example, blocks still allocated in files might be marked free, posing a danger of cross-allocation, or directory entries might point at unallocated inodes. You can force the program to crash by setting a write count in the CRASH_AT
environment variable. For example, running CRASH_AT=1000 ./mountvf v6.img mnt
causes mountv6
to crash abruptly at the 1000th block-write operation.
We supply you with a file system scavenger program fsckv6
that can restore a file system image to a consistent state, but often with heavy data loss. If there have been a lot of metadata changes, you will likely lose files. fsckv6
does not implement a lost+found
mechanism but simply throws away unreachable files. In the presence of cross-allocation, it somewhat arbitrarily punches holes in files to fix the cross-allocation (without even checking which file has a later modification time, which might lessen the damage).
The good news is that we’ve augmented the file system to keep a write-ahead log of all metadata changes made. If you run mountv6
with the -j
option (e.g., mountv6 -j v6.img mnt
), it enables this journaling mode.
Your task: You must implement a program apply
that repairs V6 file system images by applying the log to the file system from the point of the last checkpoint.
Journaling V6 file system layout
The on-disk layout of the journaling V6 file system consists of the following sections:
- The boot block (1 sector)
- The superblock (1 sector)
- Inodes (variable-length)
- Data blocks (variable-length)
- The log header (1 sector)
- The free-block bitmap or freemap (variable-length)
- The log (variable-length)
Sections 1-4 are identical or nearly identical to the original V6 file system layout. Their layout is defined by data structures and constants in the layout.hh
header file.
Sections 5-7 were added for this project. Their layout is defined by the logentry.hh
header file. Since nobody runs the V6 file system on raw hard disks any more—V6 file system images exist only as files on other file systems—we simply extend the file containing a V6 file system image to accommodate the new areas required by journaling.
The boot block
This block is essentially unused, but the first two bytes must be 4 and 7 or mountv6
will not recognize a file system image.
The superblock
The superblock is essentially the same as in 1975, except that we’ve coopted two of the unused bytes to add two more fields:
struct filsys {
...
uint8_t s_uselog; // Use log
uint8_t s_dirty; // File system was not cleanly shut down
...
};
The s_uselog
byte, when non-zero, indicates that our new 21st-century logging functionality has been enabled for this particular file system volume.
The s_dirty
field is set to 1 whenever the file system is opened, and set to 0 when it is closed cleanly. If your file system crashes, s_dirty
will be 1, and mountv6
will refuse to mount it again until you have repaired the damage. (If you want to see how bad a corrupt file system can be in action, you can forcibly mount a dirty volume with the --force
flag, as in ./mountv6 --force v6.img mnt
.)
There’s one more change to the superblock. Originally, V6 stored free blocks in a linked list 100-wide. The superblock stored s_nfree
free blocks (0–100) in an array s_free
, with s_free[0]
(if non-zero) pointing to a disk block containing 100 pointers to other free blocks and so on. This format is not convenient for journaling, so we set s_nfree
to zero and instead dedicate a portion of the disk to a bitmap, containing one bit per data block, where 1 indicates the block is free and 0 indicates it is allocated. For brevity, we refer to the “free-block bitmap” as the freemap and store it in a structure field called freemap_
.
V6 also stores a vector of up to 100 free inodes in the superblock, called s_inode
. The number of valid entries is s_ninode
, and when it reaches zero the V6 code simply re-scans the inode area to find 100 unallocated inodes. When repairing an unclean volume, we can avoid inconsistencies in inode allocation by just setting s_ninode
to 0. As long as we have properly repaired the inodes and the IALLOC
flags are all correct in the i_mode
fields, the file system will just re-scan the inode area the next time it needs to allocate an inode.
Inodes
There is no change to the inode section. It’s a giant array of inode
structures, where the entry 0 of block INODE_START_SECTOR
(2) of the file system corresponds to inode number 1.
Data blocks
There is no change to this section of the file system.
Log header
The log header is defined by the loghdr
structure in logentry.hh
, and has the following fields:
struct loghdr {
uint32_t l_magic; // LOG_MAGIC_NUM
uint32_t l_hdrblock; // Block number containing loghdr structure
uint16_t l_logsize; // Total size (in blocks) of loghdr, freemap, and log area
uint16_t l_mapsize; // Number of blocks in freemap
uint32_t l_checkpoint; // Disk offset of log checkpoint
lsn_t l_sequence; // Expected LSN at checkpoint
uint32_t mapstart() const { return l_hdrblock + 1; }
uint32_t logstart() const { return mapstart() + l_mapsize; }
uint32_t logend() const { return logstart() + l_logsize; }
uint32_t logbytes() const {
return SECTOR_SIZE * (l_logsize - l_mapsize - 1);
}
};
The first two fields are sanity checks to ensure you really have a loghdr
. The l_magic
field contains the constant LOG_MAGIC_NUM
(0x474c0636
–a randomly-generated value), while l_hdrblock
is the actual block number storing the loghdr
structure. The first constant is unlikely to appear in random files, since it was randomly generated. The second constant ensures that if you copy a valid loghdr
into a file, it still won’t look like a valid loghdr
since it will be at the wrong disk offset.
l_logsize
specifies how many blocks have been added to the V6 file system for the new feature set. This size includes sections 5–7 of the file system (the log header, freemap, and journal). With the new feature set enabled, the total number of 512-byte blocks of the disk image should now be s_fsize
(from the superblock) plus l_logsize
.
l_mapsize
specifies how many blocks are used for the freemap. This must be large enough to contain at least 1 bit for each block number from the start of data blocks (at INODE_START_SECTOR+s_isize
) to the end of the data blocks (at s_fsize
).
Finally, l_checkpoint
and l_sequence
tell you where to start reading the log and what log sequence number to expect. l_checkpoint
is the byte offset of the first log record you should read. (Because it’s a byte offset, it should reside somewhere in the half-open interval [logstart()*SECTOR_SIZE, logend()*SECTOR_SIZE)
.) Each log entry has a consecutively-assigned 32-bit log sequence number or LSN of type lsn_t
. l_sequence
tells you the LSN of the first log record you should expect to read at offset l_checkpoint
. If you see the wrong LSN, even if it appears to be a valid log record, you must stop processing the log as you may be looking at an old log entry.
Freemap (free-block bitmap)
The free-block bitmap has one bit for every block in section 4 of the file system (data blocks). If the bit is 1, the block is free. If the bit is 0, the block is in use.
The log
The log consists of a series of log entires, each of which begins with a log-entry header, followed by a payload, and then a footer. Conceptually, the three fields look like this:
struct Header {
lsn_t sequence; // First copy of the LSN
uint8_t type; // What type this is (entry_type index)
} header;
entry_type payload;
struct Footer {
uint32_t checksum; // CRC-32 of header and payload
lsn_t sequence; // Another copy of the LSN
} footer;
It’s important to note that, while the beginning of the log will generally be valid, there isn’t a clean end. The log just turns to garbage at some point after the last write, which might not even be for a full log record if the file system crashed.
To help identify the end of the log, the header contains a log sequence number, or LSN. This is a consecutively assigned counter that is unique for each log entry. LSNs are very important because the log area is repeatedly recycled. Hence, whatever bytes are left over from the last iteration may be old log entries or may look like valid log entries. These old entries must not be applied to the file system; doing so could actually damage the file system. Even a valid old entry might overwrite file data in a current file data block that was previously a directory or indirect block.
The header also contains a type
integer, which specifies the type of log entry, and hence how to parse the payload. The payload section depends on type
, and is one one of the structures defined in the following subsections. The payload immediately follows the type.
Finally, the log entry ends with a footer. The footer contains another copy of the sequence number. The second copy helps detect situations in which a log entry spans a sector boundary, and hence a seemingly valid entry could actually consist of the first half of a new log entry followed by the second half of an old one. There’s still the possibility that the “right” log sequence number for the footer could appear in arbitrary payload data from an older log record. Hence, as an additional check, the footer also includes a checksum over the header and payload. The check reduces the probability of interpreting garbage as a log record by and additional factor of 2-32.
Transaction begin
struct LogBegin {
};
Ensuring file system consistency often requires writing to multiple disk locations atomically. For example, creating a file requires writing a directory entry and initialize the inode. Appending a block to a file requires writing a block pointer in the inode or indirect block and marking the block as no longer free in the freemap. Doing only one of these writes is harmful. For example, if an inode points to a block that is still marked free, the block may get cross-allocated to another inode. If a directory entry is written but the inode is not initialized, the directory will contain a link to an uninitialized inode.
To avoid such problems, log entries that modify file system state are grouped together in atomic transactions for which either all the log entries are applied or none are applied. Transactions are bracketed by a LogBegin
at the beginning and LogCommit
entry at the end. The LogBegin
payload itself contains no data members. However, you must not apply any subsequent records unless there is also a matching LogCommit
entry. If the LogCommit
entry is missing, it means the system crashed in the middle of writing to the log, and hence applying subsequent records will likely put the file system in an inconsistent state.
Note, even though the LogBegin
payload contains zero bytes, the entry still has a header and footer. The header contains an LSN and a type byte signifying that this is a LogBegin
, while trailer ensures integrity of the log entry as usual.
Note: in the project, we do the check for a LogCommit
for you and don’t even call your apply
methods unless the log entries are part of a complete transaction. However, you should still understand what is happening.
Patch bytes
struct LogPatch {
uint16_t blockno; // Block number to patch
uint16_t offset_in_block; // Offset within block of patch
std::vector<uint8_t> bytes; // Bytes to place into the block
};
This log entry specifies a change to the inode portion or data portion of the file system (sections 3 and 4 of the disk layout). It contains specific bytes that must be written to the disk at position offset_in_block
of block blockno
.
Allocate block
struct LogBlockAlloc {
uint16_t blockno; // Block number that was allocated
uint8_t zero_on_replay; // Metadata--should zero out block on replay
};
Specifies that block number blockno
, which was previously free, should now be marked as allocated (0 or false in the freemap). If zero_on_replay
is non-zero, it means that the block is being used for metadata—i.e., as an indirect (or double-indirect) block, or for a directory.
During normal file system operation, any allocated block is first zeroed out using memset
. Zeroing out a block is thus the most faithful way to replay an allocation action. It is also necessary to do when replaying metadata blocks. If you allocate an indirect block or directory block, all block pointers and all directory entries that are not explicitly patched must be zero, and there is no guarantee that whatever happened to be on disk at the time of the crash will in fact have zeros where required.
The reason it works to zero out metadata blocks when replaying the log is that all subsequent writes to those blocks are guaranteed to be included in the log. On the other hand, this is not the case for ordinary data blocks, as file writes themselves do not go to the log, only the block allocation. You might allocate a file block, write to the block, flush both the log records and data block to disk, and then crash without having made a checkpoint. In this situation, zeroing out the block will cause data loss. Hence, you must not zero out blocks that have the zero_on_replay == 0
.
You may notice that whenever the log contains a LogBlockAlloc
entry, there’s also a LogPatch
entry setting a pointer to that block in the same transaction (i.e., between the same enclosing LogBegin
and LogCommit
entries). When allocating an indirect block, you may even find that the block containing the block pointer was just allocated, which all works fine because of atomic transactions.
Free block
struct LogBlockFree {
uint16_t blockno; // Block number of freed block
};
Specifies that block number blockno
, which was previously allocated, should be marked free (1 or true in the freemap).
Transaction commit
struct LogCommit {
uint32_t sequence; // LSN of LogBegin
};
The LogCommit
entry indicates the end of an atomic transaction, consisting of all log entries since the previous LogBegin
. Note that the sequence
field must include the LSN of the corresponding LogBegin
log entry. Otherwise, it indicates the log has been corrupted and you must stop processing it.
Log rewind
struct LogRewind {
};
This indicates that the log has wrapped around and the next entry is back at the beginning of the log area. (An alternative is always to write to the very last byte of the log before wrapping around, but this would result in log entries that are not contiguous on disk. Contiguous entires facilitate debugging: you can always start examining the log from the beginning of the log area and see some valid operations that happened, whether before or after the latest checkpoint.)
Library interface
To implement the project, you will need to interact with the file system buffer cache and the freemap.
Buffer cache interface
You will access the file system and buffer cache through the file system interface defined in v6fs.hh
. The V6Replay
class you are implementing has a V6FS &fs_
data member you will use for all file I/O. The main methods and fields you need are:
struct V6FS {
...
<Buffer> bread(uint16_t blockno);
Ref<Buffer> bget(uint16_t blockno);
Ref...
};
bread
reads a block from disk, and returns the contents in a buffer.bget
returns a buffer for a particular disk block, but doesn’t actually read it from disk. The contents of the buffer could be garbage. However, in the case that you are going to overwrite an entire block (for instance by zeroing it out), there is no reason to read the old contents from disk, sobget
is preferable.
Both of the above methods return a Ref<Buffer>
. A Ref
is a kind of smart pointer that keeps a reference count, similar to std::shared_ptr
. The difference is that a std::shared_ptr
deallocates whatever it is pointing to when the reference count goes to zero; Ref
doesn’t do anything when the reference count goes to zero—the buffer continues to reside in the cache. However, when the buffer cache is full and must evict a block, only blocks with a zero reference count are eligible for eviction, so as to avoid evicting a buffer in the middle of its use.
A Buffer
contains the cached contents of a disk block. The two things you need to do to a buffer are to access the memory, and to tell the system when the buffer is dirty. The memory is in a simple byte array called mem_
, while the method bdwrite()
is used to tell the system that you have modified a buffer and it should at some point be written back. (bdwrite
is a commonly used function name in Unix kernels, where the “d” stands for “delayed.” A name such as mark_dirty()
would arguably be more intuitive.)
struct Buffer {
...
char mem_[SECTOR_SIZE];
void bdwrite();
...
};
Bitmap interface
The freemap is a simple bitmap stored in V6Replay::freemap_
. The freemap is compact enough to store the entire map in memory. It is read from disk for you in the constructor V6Replay::V6Replay
, and written back to disk for you at the end of the V6Replay::replay
method.
The freemap is implemented by the Bitmap
structure defined in bitmap.hh
, which is similar to a std::vector<bool>
. Unlike std::vector<bool>
, however, Bitmap
offers a feature in which valid indices start at an arbitrary number. Since the first data block is INODE_START_SECTOR + s_isize
(a.k.a. datastart()
) rather than block zero, the first bit physically on disk in the freemap area corresponds to this “datastart” block. The Bitmap
structure handles the translation because we construct it with a min_index
. What this means in practice is that to mark block bn
free, you just say freemap_.at(bn) = true
; to mark it allocated, you say freemap_.at(bn) = false
. Bitmap
itself will do the work of translating bn
to the appropriate bit.
The assignment
Your task is to implement the program apply
that applies the log to a file system image (e.g., ./apply v6.img
). We supply you a reasonable main function in apply.cc
, and a framework to extract only the log entries with valid sequence numbers between valid LogBegin
and LogCommit
log entries. You will only need to edit the code for the various V6Replay::apply
methods in replay.hh
. There is one method for each type of payload in the log. Some types of log record may not require any action, so some of these methods could stay empty.
Developing and testing
To get started, login to the myth cluster and clone the starter repo with this command:
git clone https://web.stanford.edu/class/cs111/starters/rwfs.git cs111_p8
[Note that the assignment requires fuse3. Depending on your OS distribution, this may mean a package called fuse3-dev
or libfuse3-dev
or sometimes just fuse3
. This is installed for you on the myth cluster but not rice (and since fuse requires a setuid root program, we cannot install it for you on rice).]
Run the following commands to create and mount a file system:
make
./mkfsv6 v6.img
./mountv6 -j v6.img mnt &
You’ve now created a file system image named v6.img
and mounted it on mnt
. You can see if the file system builds on itself as follows:
cp *.{hh,cc} Makefile mnt
make -C mnt
Please remember to unmount your file system when you are done. You can unmount it with fusermount -u mnt
or pkill mountv6
. To crash the file system, run pkill -9 mountv6
, where the -9
says to use signal 9 (SIGKILL
), which instantly terminates a process without allowing any cleanup. You can also use the CRASH_AT
environment variable:
fusermount -u mnt
CRASH_AT=10000 ./mountv6 -j v6.img mnt &
cd mnt
touch `seq 1000`
rm `seq 1000`
cd $PWD/..
Now you’ve got a corrupt file system image. If you run ./fsckv6
without the -y
flag, it will tell you some things that are wrong with the file system without actually fixing them:
./fsckv6 v6.img
You can fix these by running ./fsck -y v6.img
, but then lots of files will be lost. Once you are done with the lab, you should be able to recover much more cleanly by running:
./apply v6.img
Before you start coding, play with the file system and try to understand what it is doing by performing metadata operations then examining the log with the dumplog
tool (see below).
We have supplied some small file system images on which to test your code. As usual, run ./run_tests
to invoke these tests. To run a test individually, you can run ./examine.sh -a test-image
test-case.img
and compare what you get to the desired output in test-images/
test-case.out
.
Optional: When you are done testing apply
, if you would like mountv6
to apply the log automatically when given the -j
flag, you can uncomment the line flags |= V6FS::V6_REPLAY;
in the main
function of mountv6.cc
. This will integrate the code you developed for your standalone apply
program into the file system. Many journaling file systems automatically replay the log on mount with no need to run an external fsck
program unless you had media failure or catastrophic software bugs. Integrating your replay code in the file system won’t improve your score on the project, may be satisfying to do.
Helpful tools
mkfsv6
image-file [num-blocks [num-inodes [num-journalblocks]]] – This tool creates a new file system. It uses the maximum size of 65,535 blocks (~32 MiB) and one inode per 2KiB by default, but you can specify different parameters if you prefer. By default it does not create a log area, expecting you to do this separately withmountv6 -j
. However, you can specify a number of journal blocks, and it will create the journal area. If you specify 0, it will use a default size for the journal.dumplog
image-file [offset |c
] – Pretty-prints the log from the beginning, or from a specified numeric file offset. If you use the letterc
instead of a numeric offset, prints from the checkpoint.v6
- A tool with various subcommands for examining the file system state. It expects your file system image to be calledv6.img
. You can change this with theV6IMG
environment variable (e.g.,V6IMG=test.img ./v6 ls /
for a single invocation orexport V6IMG=test.img
for all future invocations). Some useful subcommands:v6 dump
– look at the superblock and log header.v6 stat
{path |#
i-number}… – shows the contents of particular inodes. You can specify inodes as pathnames or as#
prefixed to the i-number. Be aware that#
is a comment character for most shells, so you may need to quote it. (Example:./v6 stat /
or./v6 stat '#1'
to see the root directory inode.)v6 ls
[-a
] path… – lists a file or directory in a format similar tols -alni
. With the-a
flag, shows time of last access instead of modification (similar tols -alniu
).v6 cat
path… – look at the contents of a file.v6 iblock
block-number… – dumps the contents of indirect blocks, showing the non-zero block pointers.v6 block
block-number … – dumps the raw contents of file system blocks in a side-by-size hex and ASCII format with 16 bytes per line. This is convenient for directories because thedirentv6
structure is exactly 16 bytes. (If you find this format generally useful, you can achieve something similar with the unix shell commandod --endian=big -t x4z -N 512 -j
block-numberb
. However,v6 block
replaces unprintable characters with space instead of dot, making it easier to identify the.
and..
entries in directories.)v6 usedblocks
– lists all allocated block numbers. If you are leaking blocks, this can help you identify which blocks you are failing to mark free.v6 usedinodes
– likeusedblocks
, but lists all allocated inodes. Useful if you are leaking inodes.
fsckv6
[-y
] image-file - Reports what’s wrong with the file system. If you use the-y
flag, it will bring the file system back to a consistent state. However, it doesn’t have any support for logging and will disable the log. (You can always recreate the log by supplying the-j
flag tomountv6
.)
Troubleshooting
If your file system ever crashes and gets linux into a bad state, you may need to remove the fuse mount. The command to do so is:
fusermount -u mnt
(assuming
mnt
is the name of your mount point). This won’t work if you still have a process in the directory, so make sure tocd
out of themnt
directory if you are having problems.Remember that
..
won’t work if your file system has crashed, so you’ll need to givecd
an absolute pathname. A convenient choice iscd $PWD
orcd $PWD/..
, since your shell maintains the$PWD
environment variable to the path to the current working directory.seq
is a simple linux utility that prints a series of integers. However, if you want to generate lots of metadata activity, you can create 1000 files withtouch `seq 1000`
and then delete them withrm `seq 1000`
.You can actually build the file system on itself (i.e.,
mountv6
is “self-hosting”). However, depending on the version ofgcc
you use, it may require more than 32MB (the maximum disk space given 16-bit block pointers to 512-byte sectors). If you remove the-ggdb
option fromCXXFLAGS
in the Makefile, it will make the object file sizes smaller and the build should work.If you accidentally modify the test file system images, you can always restore that subdirectory to a pristine state by running:
git checkout origin/master -- test-images
Submitting Your Work
To submit your solution, cd
to the directory containing your work, delete or move any large v6.img
file out of the directory to avoid submitting it, then and invoke the command
./submit
If you discover problems or improvements after you have submitted, you may resubmit; only the latest submit will be graded.