Project 5: Memory-Mapped Encrypted Files
In this project you will implement memory-mapped access to data that is stored in encrypted files. Normally, code like this would be written inside the operating system kernel; for this project, we have used the Linux mmap facility to simulate features that are typically available in the kernel, so you can write your code at user level. The goal of this project is to teach you about the following concepts:
- How demand paging works
- How page protection can be used to emulate hardware features such as dirty bits
- How to use intrusive containers, in contrast to the non-intrusive containers found in the C++ library.
Project Overview
You will implement a class MCryptFile that provides the following methods:
MCryptFile(Key key, std::string path)
Constructs anMCryptFileobject that can be used to access data in the file given bypath. The file is encrypted. When data is read from the file into memory, it will be decrypted usingkey; when data is written from memory back to the file, it will be encrypted usingkey.Keyobjects can be constructed from strings. If the file doesn’t currently exist, a new file will be created. Throwsstd::system_errorif the file cannot be opened or created. Most of the functionality of this constructor is actually implemented in theCryptFilesuperclass constructor (see below).char *map(std::size_t min_size = 0)
Maps the associated file into a contiguous region of virtual memory and returns the virtual address of the first byte of that region. Byteiof the file will henceforth be directly accessible at offsetiinto the region (in plaintext). Themin_sizeargument can be used to create a new file or grow an existing file. The actual size of the virtual region will be the larger ofmin_sizeand the original file length; if bytes are written in the region beyond the original file length, the file will be extended whenflushis called. If you want to grow a file beyond the current size of the region, invokeunmapto unmap the region, then invokemapto map it again with a larger size. Note: file sizes are always rounded up to full-page boundaries.void flush()
Encrypt all pages that have been modified since the last call toflushand write them back to the associated file. All pages currently in memory should remain in memory. This operation may grow the size of the encrypted file in the file system.void unmap()
Sync any dirty pages and remove the mapping created bymap. After this method returns, the caller must no longer use any references into the previously mapped region.char *map_base()
Returns the address of the first byte of the memory mapped file, or NULL if the associated file is not currently mapped.std::size_t map_size()
Returns the current size of the mapped region, or 0 if the associated file is not currently mapped.static void set_memory_size(std::size_t npages)
Invoked to specify how many pages should be in the pool of physical memory that is shared by allMCryptFileobjects. This method should only be invoked before the firstMCryptFileis created; it will have no effect after that.
Supporting Code
We have written several classes for you to use in implementing MCryptFile:
CryptFile
The CryptFile class provides basic mechanisms for reading and writing encrypted files (but not for memory-mapping them). Your MCryptFile class should be a subclass of CryptFile. The CryptFile class has the following methods:
CryptFile(Key key, std::string path)
The API for this method is identical to that for theMCryptFileconstructor (see above).int aligned_pread(void *dst, std::size_t len, std::size_t offset)Reads and decrypts
lenbytes of data at positionoffsetin the file. Bothlenandoffsetmust be multiples of the AES encryption algorithm’s block size (16), which is accessible viaCrypteFile::blocksize. The block size will not be an issue for this project, because you will only be reading and writing full pages aligned on page-size boundaries. Returns the number of bytes read or -1 on error.int aligned_pwrite(const void *src, std::size_t len, std::size_t offset)Encrypts
lenbytes starting atsrcand writes them to the associated file at positionoffset. Returnslenon success and -1 on error. Bothlenandoffsetmust be multiples ofCryptFile::blocksize.
VMRegion
The VMRegion class provides basic mechanisms for mapping pages into a region of virtual memory, taking page faults, and managing permissions. A VMRegion corresponds roughly to a contiguous range of page table entries for one process in an operating system. You will create one VMRegion object for each mapped MCryptFile. The VMRegion class is defined in the header file vm.hh and has the following methods:
VMRegion(std::size_t nbytes, std::function<void(char *)> handler)
Constructs aVMRegionobject and allocates a region of virtual memory in the process that is currently unused. The size of the region will benbytes; ifnbytesisn’t a multiple of the page size, theVMRegionwill behave as ifnbyteswere rounded up to the next higher multiple of the page size. Thehandlerparameter specifies a function to invoke for page faults in the region. Page faults occur whenever an unmapped page is referenced or an attempt is made to write a page that is currently read-only.handleris invoked with a single argument giving the address of the address that triggered the page fault.VPage get_base()
Returns the address of the first page in the virtual region.VPageis a type that refers to the first byte of a virtual page in aVMRegion. It is equivalent tochar *, so you can add an offset to it to get the address of a value in the middle of a page. You can also add multiples of the page size to the value returned byget_baseto produceVPages for other pages in the virtual region.static void map(VPage va, PPage pa, Prot prot)
Sets the mapping for a particular VPage inside a VMRegion, so that accesses to that page will be directed topa(PPages are obtained using thePhysMemclass discussed below). If a different page was previously mapped atVPage, the old mapping is removed. Theprotargument specifies what sort of accesses are allowed; it should be eitherPROT_NONEto prohibit both loads and stores,PROT_READto allow loads but not stores, orPROT_READ|PROT_WRITE(bitwise OR of two values) to allow both loads and stores. This function’s behavior is equivalent to setting the contents of a page table entry.static void unmap(VPage va)}
Removes the mapping forva, if there is one; future references to the page will cause page faults. This function’s behavior is roughly equivalent to clearing thepresentbut in a page table entry.
In addition to these methods, VMRegion also exports a variable page_size, which contains the number of bytes in each page on the current machine. Page sizes are 4096 bytes on the myth cluster as well as Windows or MacBook laptops, but you should use the page_size variable to ensure portability; you can assume page_size always be an even power of two.
PhysMem and PPages
The PhysMem class provides a mechanism for allocating and freeing pages of physical memory. It implements PPage objects. A PPage is a token for a physical page, which you can pass to VMRegion::map. In addition, you can use a PPage to access the bytes of the page: a PPage is actually a char * pointer referring to the first byte of the page, and you can add offsets to this to access other values in the page. Unlike VPages, a PPage is always accessible and writable; references to it will never generate page faults. PPages are useful because they allow your MCryptFile implementation to access physical pages that may not currently be mapped, such as when transferring page contents to or from encrypted files. PPages are mapped into virtual memory by the PhysMem class, at virtual addresses different from those in VMRegions. This is an example of aliasing, where the same physical page can appear at multiple virtual addresses.
The PhysMem class is defined in the header file vm.hh and has the following methods:
PhysMem(std::size_t npages)
Allocatesnpagesphysical memory pages, each of which may be mapped into anyVMRegion.PPage page_alloc()
Allocates a page and returns itsPPage, ornullptrif there are no free pages.void page_free(PPage p)
Returnspto the free page pool for thisPhysMem. The caller must ensure that this page is not mapped in anyVMRegion.std::size_t npages()
Returns the total number of pages in this object (free or allocated).std::size_t nfree()
Returns the number of pages that are not currently allocated.PPage pool_base()
Returns the address of the first page in the pool (the pages in the pool occupy a range of contiguous PPage addresses).
Intrusive Containers
An intrusive container is a data structure such as a list or a map that contains data structures that were specifically designed to be included in the contained objects — the container “intrudes” into the design of the element type. By contrast, a non-intrusive container can contain any type meeting some basic requirements (such as being able to be moved). Because non-intrusive containers are more general, the C++ standard template library (STL) provides only non-intrusive containers. For example, you can create a std::list<int>, even though integers were not designed to be on a list and don’t contain a “next pointer.”
There are several reasons that operating systems often make use of intrusive data structures. One is that intrusive data structures require no memory allocation: all the links to insert an object in a list or map are already members of the object. By contrast, when inserting an element in a std::list<int>, the library must dynamically allocate memory for an object that contains the int as well as some next and previous pointers. Because dynamic memory allocation can fail, there are places in the operating system that cannot or should not be dynamically allocating memory, and hence cannot manipulate non-intrusive containers.
A second benefit of intrusive data structures is that the items are the iterators; this simplifies some operations. For example, suppose an object has been inserted in a list and you wish to remove the object from that list. If the list is nonintrusive and all you have is a pointer to the object, you must first get an iterator to the object; this will require searching list, which is expensive. With an intrusive list, the object is the iterator, so unlinking it can be done without searching the list.
We would like you to use intrusive containers in this project, and we have provided you an intrusive list in ilist.hh and an intrusive map in itree.hh. To use these containers, your structures must include a field of type ilist_entry or itree_entry, respectively. The type of the container specifies a member pointer to this field, rather than just the type of the structure. For example:
struct MyStruct {
int val_;
ilist_entry list_link_;
itree_entry itree_link_;
};
ilist<&MyStruct::list_link_> my_list;
itree<&MyStruct::val_, &MyStruct::tree_link_> my_map;Since the items are the iterators, most methods return pointers rather than references, where nullptr indicates a missing item (e.g., the end of a list). When you delete an object, the destructors of any “entry” fields automatically remove the object from any containers it might be in. (The flip side of this is that it is an error to delete a container that still contains elements.)
To make your life easier, we designed ilist and itree to resemble STL containers. In particular, ilist has the following methods, which behave just like the corresponding methods in std::list: back, begin, delete_all, empty, end, front, insert, pop_back, pop_front, push_back, push_front, remove, and remove_all. itree objects have the following methods, which behave just like the corresponding methods in std::map: begin, delete_all, empty, end, find, insert, lower_bound, operator[], pop_back, pop_front, push_back, push_front, remove, remove_all, and upper_bound.
There are also a few differences. In particular, since the items are the iterators, there is no real end() iterator. There’s a method end() that returns nullptr, just to make range-for syntax work, but you can’t walk an end() iterator backwards to find the last item. Use back() to get the last item in an ilist and max() to get the greatest element in an itree. itree also provides min() to get the smallest element.
Error handling
In general you should use exceptions to handle errors. It makes code more readable by segregating the error handling. More importantly, it reduces the number of occasions for bugs where you forget to check for an error return value. However, make sure that you don’t leak resources when an exception may be thrown. A good way to avoid leaking resources is to rely exclusively on destructors for reclaiming resources. For example, in the implementation of CryptFiloe we have a unique_fd that automatically closes a file descriptor on destruction.
The one place where you must not throw exceptions is inside your page fault handler, unless you also catch the exceptions in the handler. If an exception escapes past your page fault handler, VMRegion will crash the program.
Developing and Testing
To get started on this project, login to the myth cluster and clone the starter repo with this command:
git clone https://web.stanford.edu/class/cs111/starters/cmap.git cs111_p5
This will create a new directory cs111_p5. Do your work for the project in this directory. The files mcryptfile.hh and mcryptfile.cc contain a skeleton for all of the code you need to write. Add to the declarations in mcryptfile.h and fill in the bodies of the methods in mcryptfile.cc to implement the facilities described above. You can also additional methods or classes if needed to implement your solution.
The directory also contains a Makefile; if you type make, it will compile your code along with a test program, creating an executable test. You can then invoke ./run_tests map_tests, which will run a simple set of tests on your code. You can also invoke test with an argument specifying a particular test name; this will run a single test and print out its results. Invoke test with a bogus test name to print out all of the available test names.
As usual, we do not guarantee that the tests we have provided are exhaustive, so passing all of the tests is not necessarily sufficient to ensure a perfect score (CAs may discover other problems in reading through your code).
Miscellaneous Notes
For this project you may assume that there are enough physical pages to accommodate all of the pages in all of the open
MCryptFiles: you need not worry about page replacement. You can abort the program ifPhysMem::page_allocreturnsnullptr.Use an
itreefor eachMCryptFileto keep track of theVPages in the region for that file. There should only be entries in theitreeforVPages that have associatedPPages.Create a single
PhysMemobject and share it across allMCryptFiles. Don’t create this object until the first time anMCryptFileis mapped:MCryptFile::set_memory_sizemay be invoked before then to specify how large the pool of physical memory should be. IfMCryptFile::set_memory_sizehas not been invoked, use 1000 pages in thePhysMemobject.Load pages into memory on demand, so that no memory is wasted on pages that are never accessed.
You must ensure that modified pages eventually get written back to the file. To do this, you will need to keep track of which pages are dirty (clean pages should not be written back). Paging hardware usually provides a “dirty” bit in page table entries, which is set by the hardware when a page is written. Unfortunately, this information is not passed through by the
mmapmechanism we are using for this project, so you will have to detect when pages are written. You can use page protections for this: if a page’s protection is set toPROT_READ, then a page fault will occur the first time the page is written (and a page fault will not occur unless the page is written).You may assume that your code is used only in single-threaded environments; you do not need to worry about synchronization for this project.
If you use
gdbto debug your project, you will notice that by default it catches the SIGSEGV signals used to signal page faults. As a result, the signals will not get through to your code when you run under the debugger. However, you can disable this behavior with the followinggdbcommand:handle SIGSEGV noprint nostop passThe arguments indicate that, when SIGSEGV signals occur, they should be passed to the application;
gdbwill not stop the application or print any indication that the signal occurred.
Submitting Your Work
To submit your solution, cd to the directory containing your work and invoke the command
./submit
If you discover problems or improvements after you have submitted, you may resubmit; only the latest submit will be graded.