Project 5: Memory-Mapped Encrypted Files
In this project you will implement memory-mapped access to data that is stored in encrypted files. Normally, code like this would be written inside the operating system kernel; for this project, we have used the Linux mmap
facility to simulate features that are typically available in the kernel, so you can write your code at user level. The goal of this project is to teach you about the following concepts:
- How demand paging works
- How page protection can be used to emulate hardware features such as dirty bits
- How to use intrusive containers, in contrast to the non-intrusive containers found in the C++ library.
Project Overview
You will implement a class MCryptFile
that provides the following methods:
MCryptFile(Key key, std::string path)
Constructs anMCryptFile
object that can be used to access data in the file given bypath
. The file is encrypted. When data is read from the file into memory, it will be decrypted usingkey
; when data is written from memory back to the file, it will be encrypted usingkey
.Key
objects can be constructed from strings. If the file doesn’t currently exist, a new file will be created. Throwsstd::system_error
if the file cannot be opened or created. Most of the functionality of this constructor is actually implemented in theCryptFile
superclass constructor (see below).char *map(std::size_t min_size = 0)
Maps the associated file into a contiguous region of virtual memory and returns the virtual address of the first byte of that region. Bytei
of the file will henceforth be directly accessible at offseti
into the region (in plaintext). Themin_size
argument can be used to create a new file or grow an existing file. The actual size of the virtual region will be the larger ofmin_size
and the original file length; if bytes are written in the region beyond the original file length, the file will be extended whenflush
is called. If you want to grow a file beyond the current size of the region, invokeunmap
to unmap the region, then invokemap
to map it again with a larger size. Note: file sizes are always rounded up to full-page boundaries.void flush()
Encrypt all pages that have been modified since the last call toflush
and write them back to the associated file. All pages currently in memory should remain in memory. This operation may grow the size of the encrypted file in the file system.void unmap()
Sync any dirty pages and remove the mapping created bymap
. After this method returns, the caller must no longer use any references into the previously mapped region.char *map_base()
Returns the address of the first byte of the memory mapped file, or NULL if the associated file is not currently mapped.std::size_t map_size()
Returns the current size of the mapped region, or 0 if the associated file is not currently mapped.static void set_memory_size(std::size_t npages)
Invoked to specify how many pages should be in the pool of physical memory that is shared by allMCryptFile
objects. This method should only be invoked before the firstMCryptFile
is created; it will have no effect after that.
Supporting Code
We have written several classes for you to use in implementing MCryptFile:
CryptFile
The CryptFile
class provides basic mechanisms for reading and writing encrypted files (but not for memory-mapping them). Your MCryptFile
class should be a subclass of CryptFile
. The CryptFile
class has the following methods:
CryptFile(Key key, std::string path)
The API for this method is identical to that for theMCryptFile
constructor (see above).int aligned_pread(void *dst, std::size_t len, std::size_t offset)
Reads and decrypts
len
bytes of data at positionoffset
in the file. Bothlen
andoffset
must be multiples of the AES encryption algorithm’s block size (16), which is accessible viaCrypteFile::blocksize
. The block size will not be an issue for this project, because you will only be reading and writing full pages aligned on page-size boundaries. Returns the number of bytes read or -1 on error.int aligned_pwrite(const void *src, std::size_t len, std::size_t offset)
Encrypts
len
bytes starting atsrc
and writes them to the associated file at positionoffset
. Returnslen
on success and -1 on error. Bothlen
andoffset
must be multiples ofCryptFile::blocksize
.
VMRegion
The VMRegion
class provides basic mechanisms for mapping pages into a region of virtual memory, taking page faults, and managing permissions. A VMRegion
corresponds roughly to a contiguous range of page table entries for one process in an operating system. You will create one VMRegion
object for each mapped MCryptFile. The VMRegion
class is defined in the header file vm.hh
and has the following methods:
VMRegion(std::size_t nbytes, std::function<void(char *)> handler)
Constructs aVMRegion
object and allocates a region of virtual memory in the process that is currently unused. The size of the region will benbytes
; ifnbytes
isn’t a multiple of the page size, theVMRegion
will behave as ifnbytes
were rounded up to the next higher multiple of the page size. Thehandler
parameter specifies a function to invoke for page faults in the region. Page faults occur whenever an unmapped page is referenced or an attempt is made to write a page that is currently read-only.handler
is invoked with a single argument giving the address of the address that triggered the page fault.VPage get_base()
Returns the address of the first page in the virtual region.VPage
is a type that refers to the first byte of a virtual page in aVMRegion
. It is equivalent tochar *
, so you can add an offset to it to get the address of a value in the middle of a page. You can also add multiples of the page size to the value returned byget_base
to produceVPage
s for other pages in the virtual region.static void map(VPage va, PPage pa, Prot prot)
Sets the mapping for a particular VPage inside a VMRegion, so that accesses to that page will be directed topa
(PPage
s are obtained using thePhysMem
class discussed below). If a different page was previously mapped atVPage
, the old mapping is removed. Theprot
argument specifies what sort of accesses are allowed; it should be eitherPROT_NONE
to prohibit both loads and stores,PROT_READ
to allow loads but not stores, orPROT_READ|PROT_WRITE
(bitwise OR of two values) to allow both loads and stores. This function’s behavior is equivalent to setting the contents of a page table entry.static void unmap(VPage va)}
Removes the mapping forva
, if there is one; future references to the page will cause page faults. This function’s behavior is roughly equivalent to clearing thepresent
but in a page table entry.
In addition to these methods, VMRegion
also exports a variable page_size
, which contains the number of bytes in each page on the current machine. Page sizes are 4096 bytes on the myth
cluster as well as Windows or MacBook laptops, but you should use the page_size
variable to ensure portability; you can assume page_size
always be an even power of two.
PhysMem and PPages
The PhysMem
class provides a mechanism for allocating and freeing pages of physical memory. It implements PPage
objects. A PPage
is a token for a physical page, which you can pass to VMRegion::map
. In addition, you can use a PPage
to access the bytes of the page: a PPage
is actually a char *
pointer referring to the first byte of the page, and you can add offsets to this to access other values in the page. Unlike VPage
s, a PPage
is always accessible and writable; references to it will never generate page faults. PPage
s are useful because they allow your MCryptFile
implementation to access physical pages that may not currently be mapped, such as when transferring page contents to or from encrypted files. PPage
s are mapped into virtual memory by the PhysMem
class, at virtual addresses different from those in VMRegion
s. This is an example of aliasing, where the same physical page can appear at multiple virtual addresses.
The PhysMem
class is defined in the header file vm.hh
and has the following methods:
PhysMem(std::size_t npages)
Allocatesnpages
physical memory pages, each of which may be mapped into anyVMRegion
.PPage page_alloc()
Allocates a page and returns itsPPage
, ornullptr
if there are no free pages.void page_free(PPage p)
Returnsp
to the free page pool for thisPhysMem
. The caller must ensure that this page is not mapped in anyVMRegion
.std::size_t npages()
Returns the total number of pages in this object (free or allocated).std::size_t nfree()
Returns the number of pages that are not currently allocated.PPage pool_base()
Returns the address of the first page in the pool (the pages in the pool occupy a range of contiguous PPage addresses).
Intrusive Containers
An intrusive container is a data structure such as a list or a map that contains data structures that were specifically designed to be included in the contained objects — the container “intrudes” into the design of the element type. By contrast, a non-intrusive container can contain any type meeting some basic requirements (such as being able to be moved). Because non-intrusive containers are more general, the C++ standard template library (STL) provides only non-intrusive containers. For example, you can create a std::list<int>
, even though integers were not designed to be on a list and don’t contain a “next pointer.”
There are several reasons that operating systems often make use of intrusive data structures. One is that intrusive data structures require no memory allocation: all the links to insert an object in a list or map are already members of the object. By contrast, when inserting an element in a std::list<int>
, the library must dynamically allocate memory for an object that contains the int
as well as some next and previous pointers. Because dynamic memory allocation can fail, there are places in the operating system that cannot or should not be dynamically allocating memory, and hence cannot manipulate non-intrusive containers.
A second benefit of intrusive data structures is that the items are the iterators; this simplifies some operations. For example, suppose an object has been inserted in a list and you wish to remove the object from that list. If the list is nonintrusive and all you have is a pointer to the object, you must first get an iterator to the object; this will require searching list, which is expensive. With an intrusive list, the object is the iterator, so unlinking it can be done without searching the list.
We would like you to use intrusive containers in this project, and we have provided you an intrusive list in ilist.hh
and an intrusive map in itree.hh
. To use these containers, your structures must include a field of type ilist_entry
or itree_entry
, respectively. The type of the container specifies a member pointer to this field, rather than just the type of the structure. For example:
struct MyStruct {
int val_;
list_link_;
ilist_entry itree_link_;
itree_entry };
<&MyStruct::list_link_> my_list;
ilist<&MyStruct::val_, &MyStruct::tree_link_> my_map; itree
Since the items are the iterators, most methods return pointers rather than references, where nullptr
indicates a missing item (e.g., the end of a list). When you delete an object, the destructors of any “entry” fields automatically remove the object from any containers it might be in. (The flip side of this is that it is an error to delete a container that still contains elements.)
To make your life easier, we designed ilist
and itree
to resemble STL containers. In particular, ilist
has the following methods, which behave just like the corresponding methods in std::list
: back
, begin
, delete_all
, empty
, end
, front
, insert
, pop_back
, pop_front
, push_back
, push_front
, remove
, and remove_all
. itree
objects have the following methods, which behave just like the corresponding methods in std::map
: begin
, delete_all
, empty
, end
, find
, insert
, lower_bound
, operator[]
, pop_back
, pop_front
, push_back
, push_front
, remove
, remove_all
, and upper_bound
.
There are also a few differences. In particular, since the items are the iterators, there is no real end()
iterator. There’s a method end()
that returns nullptr
, just to make range-for syntax work, but you can’t walk an end()
iterator backwards to find the last item. Use back()
to get the last item in an ilist
and max()
to get the greatest element in an itree
. itree
also provides min()
to get the smallest element.
Error handling
In general you should use exceptions to handle errors. It makes code more readable by segregating the error handling. More importantly, it reduces the number of occasions for bugs where you forget to check for an error return value. However, make sure that you don’t leak resources when an exception may be thrown. A good way to avoid leaking resources is to rely exclusively on destructors for reclaiming resources. For example, in the implementation of CryptFiloe
we have a unique_fd
that automatically closes a file descriptor on destruction.
The one place where you must not throw exceptions is inside your page fault handler, unless you also catch the exceptions in the handler. If an exception escapes past your page fault handler, VMRegion
will crash the program.
Developing and Testing
To get started on this project, login to the myth cluster and clone the starter repo with this command:
git clone https://web.stanford.edu/class/cs111/starters/cmap.git cs111_p5
This will create a new directory cs111_p5
. Do your work for the project in this directory. The files mcryptfile.hh
and mcryptfile.cc
contain a skeleton for all of the code you need to write. Add to the declarations in mcryptfile.h
and fill in the bodies of the methods in mcryptfile.cc
to implement the facilities described above. You can also additional methods or classes if needed to implement your solution.
The directory also contains a Makefile
; if you type make
, it will compile your code along with a test program, creating an executable test
. You can then invoke ./run_tests map_tests
, which will run a simple set of tests on your code. You can also invoke test
with an argument specifying a particular test name; this will run a single test and print out its results. Invoke test
with a bogus test name to print out all of the available test names.
As usual, we do not guarantee that the tests we have provided are exhaustive, so passing all of the tests is not necessarily sufficient to ensure a perfect score (CAs may discover other problems in reading through your code).
Miscellaneous Notes
For this project you may assume that there are enough physical pages to accommodate all of the pages in all of the open
MCryptFiles
: you need not worry about page replacement. You can abort the program ifPhysMem::page_alloc
returnsnullptr
.Use an
itree
for eachMCryptFile
to keep track of theVPage
s in the region for that file. There should only be entries in theitree
forVPage
s that have associatedPPage
s.Create a single
PhysMem
object and share it across allMCryptFiles
. Don’t create this object until the first time anMCryptFile
is mapped:MCryptFile::set_memory_size
may be invoked before then to specify how large the pool of physical memory should be. IfMCryptFile::set_memory_size
has not been invoked, use 1000 pages in thePhysMem
object.Load pages into memory on demand, so that no memory is wasted on pages that are never accessed.
You must ensure that modified pages eventually get written back to the file. To do this, you will need to keep track of which pages are dirty (clean pages should not be written back). Paging hardware usually provides a “dirty” bit in page table entries, which is set by the hardware when a page is written. Unfortunately, this information is not passed through by the
mmap
mechanism we are using for this project, so you will have to detect when pages are written. You can use page protections for this: if a page’s protection is set toPROT_READ
, then a page fault will occur the first time the page is written (and a page fault will not occur unless the page is written).You may assume that your code is used only in single-threaded environments; you do not need to worry about synchronization for this project.
If you use
gdb
to debug your project, you will notice that by default it catches the SIGSEGV signals used to signal page faults. As a result, the signals will not get through to your code when you run under the debugger. However, you can disable this behavior with the followinggdb
command:handle SIGSEGV noprint nostop pass
The arguments indicate that, when SIGSEGV signals occur, they should be passed to the application;
gdb
will not stop the application or print any indication that the signal occurred.
Submitting Your Work
To submit your solution, cd
to the directory containing your work and invoke the command
./submit
If you discover problems or improvements after you have submitted, you may resubmit; only the latest submit will be graded.