Project 5: Memory-Mapped Encrypted Files

In this project you will implement memory-mapped access to data that is stored in encrypted files. Normally, code like this would be written inside the operating system kernel; for this project, we have used the Linux mmap facility to simulate features that are typically available in the kernel, so you can write your code at user level. The goal of this project is to teach you about the following concepts:

How demand paging works
How page protection can be used to emulate hardware features such as dirty bits
How to use intrusive containers, in contrast to the non-intrusive containers found in the C++ library.

Project Overview

You will implement a class MCryptFile that provides the following methods:

MCryptFile(Key key, std::string path)
Constructs an MCryptFile object that can be used to access data in the file given by path. The file is encrypted. When data is read from the file into memory, it will be decrypted using key; when data is written from memory back to the file, it will be encrypted using key. Key objects can be constructed from strings. If the file doesn’t currently exist, a new file will be created. Throws std::system_error if the file cannot be opened or created. Most of the functionality of this constructor is actually implemented in the CryptFile superclass constructor (see below).
char *map(std::size_t min_size = 0)
Maps the associated file into a contiguous region of virtual memory and returns the virtual address of the first byte of that region. Byte i of the file will henceforth be directly accessible at offset i into the region (in plaintext). The min_size argument can be used to create a new file or grow an existing file. The actual size of the virtual region will be the larger of min_size and the original file length; if bytes are written in the region beyond the original file length, the file will be extended when flush is called. If you want to grow a file beyond the current size of the region, invoke unmap to unmap the region, then invoke map to map it again with a larger size. Note: file sizes are always rounded up to full-page boundaries.
void flush()
Encrypt all pages that have been modified since the last call to flush and write them back to the associated file. All pages currently in memory should remain in memory. This operation may grow the size of the encrypted file in the file system.
void unmap()
Sync any dirty pages and remove the mapping created by map. After this method returns, the caller must no longer use any references into the previously mapped region.
char *map_base()
Returns the address of the first byte of the memory mapped file, or NULL if the associated file is not currently mapped.
std::size_t map_size()
Returns the current size of the mapped region, or 0 if the associated file is not currently mapped.
static void set_memory_size(std::size_t npages)
Invoked to specify how many pages should be in the pool of physical memory that is shared by all MCryptFile objects. This method should only be invoked before the first MCryptFile is created; it will have no effect after that.

Supporting Code

We have written several classes for you to use in implementing MCryptFile:

CryptFile

The CryptFile class provides basic mechanisms for reading and writing encrypted files (but not for memory-mapping them). Your MCryptFile class should be a subclass of CryptFile. The CryptFile class has the following methods:

CryptFile(Key key, std::string path)
The API for this method is identical to that for the MCryptFile constructor (see above).
int aligned_pread(void *dst, std::size_t len, std::size_t offset)

Reads and decrypts len bytes of data at position offset in the file. Both len and offset must be multiples of the AES encryption algorithm’s block size (16), which is accessible via CrypteFile::blocksize. The block size will not be an issue for this project, because you will only be reading and writing full pages aligned on page-size boundaries. Returns the number of bytes read or -1 on error.
int aligned_pwrite(const void *src, std::size_t len, std::size_t offset)

Encrypts len bytes starting at src and writes them to the associated file at position offset. Returns len on success and -1 on error. Both len and offset must be multiples of CryptFile::blocksize.

VMRegion

The VMRegion class provides basic mechanisms for mapping pages into a region of virtual memory, taking page faults, and managing permissions. A VMRegion corresponds roughly to a contiguous range of page table entries for one process in an operating system. You will create one VMRegion object for each mapped MCryptFile. The VMRegion class is defined in the header file vm.hh and has the following methods:

VMRegion(std::size_t nbytes, std::function<void(char *)> handler)
Constructs a VMRegion object and allocates a region of virtual memory in the process that is currently unused. The size of the region will be nbytes; if nbytes isn’t a multiple of the page size, the VMRegion will behave as if nbytes were rounded up to the next higher multiple of the page size. The handler parameter specifies a function to invoke for page faults in the region. Page faults occur whenever an unmapped page is referenced or an attempt is made to write a page that is currently read-only. handler is invoked with a single argument giving the address of the address that triggered the page fault.
VPage get_base()
Returns the address of the first page in the virtual region. VPage is a type that refers to the first byte of a virtual page in a VMRegion. It is equivalent to char *, so you can add an offset to it to get the address of a value in the middle of a page. You can also add multiples of the page size to the value returned by get_base to produce VPages for other pages in the virtual region.
static void map(VPage va, PPage pa, Prot prot)
Sets the mapping for a particular VPage inside a VMRegion, so that accesses to that page will be directed to pa (PPages are obtained using the PhysMem class discussed below). If a different page was previously mapped at VPage, the old mapping is removed. The prot argument specifies what sort of accesses are allowed; it should be either PROT_NONE to prohibit both loads and stores, PROT_READ to allow loads but not stores, or PROT_READ|PROT_WRITE (bitwise OR of two values) to allow both loads and stores. This function’s behavior is equivalent to setting the contents of a page table entry.
static void unmap(VPage va)}
Removes the mapping for va, if there is one; future references to the page will cause page faults. This function’s behavior is roughly equivalent to clearing the present but in a page table entry.

In addition to these methods, VMRegion also exports a variable page_size, which contains the number of bytes in each page on the current machine. Page sizes are 4096 bytes on the myth cluster as well as Windows or MacBook laptops, but you should use the page_size variable to ensure portability; you can assume page_size always be an even power of two.

PhysMem and PPages

The PhysMem class provides a mechanism for allocating and freeing pages of physical memory. It implements PPage objects. A PPage is a token for a physical page, which you can pass to VMRegion::map. In addition, you can use a PPage to access the bytes of the page: a PPage is actually a char * pointer referring to the first byte of the page, and you can add offsets to this to access other values in the page. Unlike VPages, a PPage is always accessible and writable; references to it will never generate page faults. PPages are useful because they allow your MCryptFile implementation to access physical pages that may not currently be mapped, such as when transferring page contents to or from encrypted files. PPages are mapped into virtual memory by the PhysMem class, at virtual addresses different from those in VMRegions. This is an example of aliasing, where the same physical page can appear at multiple virtual addresses.

The PhysMem class is defined in the header file vm.hh and has the following methods:

PhysMem(std::size_t npages)
Allocates npages physical memory pages, each of which may be mapped into any VMRegion.
PPage page_alloc()
Allocates a page and returns its PPage, or nullptr if there are no free pages.
void page_free(PPage p)
Returns p to the free page pool for this PhysMem. The caller must ensure that this page is not mapped in any VMRegion.
std::size_t npages()
Returns the total number of pages in this object (free or allocated).
std::size_t nfree()
Returns the number of pages that are not currently allocated.
PPage pool_base()
Returns the address of the first page in the pool (the pages in the pool occupy a range of contiguous PPage addresses).

Intrusive Containers

An intrusive container is a data structure such as a list or a map that contains data structures that were specifically designed to be included in the contained objects — the container “intrudes” into the design of the element type. By contrast, a non-intrusive container can contain any type meeting some basic requirements (such as being able to be moved). Because non-intrusive containers are more general, the C++ standard template library (STL) provides only non-intrusive containers. For example, you can create a std::list<int>, even though integers were not designed to be on a list and don’t contain a “next pointer.”

There are several reasons that operating systems often make use of intrusive data structures. One is that intrusive data structures require no memory allocation: all the links to insert an object in a list or map are already members of the object. By contrast, when inserting an element in a std::list<int>, the library must dynamically allocate memory for an object that contains the int as well as some next and previous pointers. Because dynamic memory allocation can fail, there are places in the operating system that cannot or should not be dynamically allocating memory, and hence cannot manipulate non-intrusive containers.

A second benefit of intrusive data structures is that the items are the iterators; this simplifies some operations. For example, suppose an object has been inserted in a list and you wish to remove the object from that list. If the list is nonintrusive and all you have is a pointer to the object, you must first get an iterator to the object; this will require searching list, which is expensive. With an intrusive list, the object is the iterator, so unlinking it can be done without searching the list.

We would like you to use intrusive containers in this project, and we have provided you an intrusive list in ilist.hh and an intrusive map in itree.hh. To use these containers, your structures must include a field of type ilist_entry or itree_entry, respectively. The type of the container specifies a member pointer to this field, rather than just the type of the structure. For example:

struct MyStruct {
    int val_;
    ilist_entry list_link_;
    itree_entry itree_link_;
};
ilist<&MyStruct::list_link_> my_list;
itree<&MyStruct::val_, &MyStruct::tree_link_> my_map;

Since the items are the iterators, most methods return pointers rather than references, where nullptr indicates a missing item (e.g., the end of a list). When you delete an object, the destructors of any “entry” fields automatically remove the object from any containers it might be in. (The flip side of this is that it is an error to delete a container that still contains elements.)

To make your life easier, we designed ilist and itree to resemble STL containers. In particular, ilist has the following methods, which behave just like the corresponding methods in std::list: back, begin, delete_all, empty, end, front, insert, pop_back, pop_front, push_back, push_front, remove, and remove_all. itree objects have the following methods, which behave just like the corresponding methods in std::map: begin, delete_all, empty, end, find, insert, lower_bound, operator[], pop_back, pop_front, push_back, push_front, remove, remove_all, and upper_bound.

There are also a few differences. In particular, since the items are the iterators, there is no real end() iterator. There’s a method end() that returns nullptr, just to make range-for syntax work, but you can’t walk an end() iterator backwards to find the last item. Use back() to get the last item in an ilist and max() to get the greatest element in an itree. itree also provides min() to get the smallest element.

Error handling

In general you should use exceptions to handle errors. It makes code more readable by segregating the error handling. More importantly, it reduces the number of occasions for bugs where you forget to check for an error return value. However, make sure that you don’t leak resources when an exception may be thrown. A good way to avoid leaking resources is to rely exclusively on destructors for reclaiming resources. For example, in the implementation of CryptFiloe we have a unique_fd that automatically closes a file descriptor on destruction.

The one place where you must not throw exceptions is inside your page fault handler, unless you also catch the exceptions in the handler. If an exception escapes past your page fault handler, VMRegion will crash the program.

Developing and Testing

To get started on this project, login to the myth cluster and clone the starter repo with this command:

git clone https://web.stanford.edu/class/cs111/starters/cmap.git cs111_p5

This will create a new directory cs111_p5. Do your work for the project in this directory. The files mcryptfile.hh and mcryptfile.cc contain a skeleton for all of the code you need to write. Add to the declarations in mcryptfile.h and fill in the bodies of the methods in mcryptfile.cc to implement the facilities described above. You can also additional methods or classes if needed to implement your solution.

The directory also contains a Makefile; if you type make, it will compile your code along with a test program, creating an executable test. You can then invoke ./run_tests map_tests, which will run a simple set of tests on your code. You can also invoke test with an argument specifying a particular test name; this will run a single test and print out its results. Invoke test with a bogus test name to print out all of the available test names.

As usual, we do not guarantee that the tests we have provided are exhaustive, so passing all of the tests is not necessarily sufficient to ensure a perfect score (CAs may discover other problems in reading through your code).

Miscellaneous Notes

For this project you may assume that there are enough physical pages to accommodate all of the pages in all of the open MCryptFiles: you need not worry about page replacement. You can abort the program if PhysMem::page_alloc returns nullptr.
Use an itree for each MCryptFile to keep track of the VPages in the region for that file. There should only be entries in the itree for VPages that have associated PPages.
Create a single PhysMem object and share it across all MCryptFiles. Don’t create this object until the first time an MCryptFile is mapped: MCryptFile::set_memory_size may be invoked before then to specify how large the pool of physical memory should be. If MCryptFile::set_memory_size has not been invoked, use 1000 pages in the PhysMem object.
Load pages into memory on demand, so that no memory is wasted on pages that are never accessed.
You must ensure that modified pages eventually get written back to the file. To do this, you will need to keep track of which pages are dirty (clean pages should not be written back). Paging hardware usually provides a “dirty” bit in page table entries, which is set by the hardware when a page is written. Unfortunately, this information is not passed through by the mmap mechanism we are using for this project, so you will have to detect when pages are written. You can use page protections for this: if a page’s protection is set to PROT_READ, then a page fault will occur the first time the page is written (and a page fault will not occur unless the page is written).
You may assume that your code is used only in single-threaded environments; you do not need to worry about synchronization for this project.
If you use gdb to debug your project, you will notice that by default it catches the SIGSEGV signals used to signal page faults. As a result, the signals will not get through to your code when you run under the debugger. However, you can disable this behavior with the following gdb command:
```
handle SIGSEGV noprint nostop pass
```
The arguments indicate that, when SIGSEGV signals occur, they should be passed to the application; gdb will not stop the application or print any indication that the signal occurred.

Submitting Your Work

To submit your solution, cd to the directory containing your work and invoke the command

./submit

If you discover problems or improvements after you have submitted, you may resubmit; only the latest submit will be graded.