NFS === We are going from threads to distributed systems and back again Both topics are squarely in scope at OS conferences Similar kinds of reasoning apply to parallelism & distributed systems E.g., Already saw happens before applied to race detection Different "gotchas" in each case Parallel systems: weak memory consistency can be very unintuitive Distributed systems: node & network failures complicate reasoning If you like distributed systems, consider CS244b next Fall... How did Unix systems access network storage before NFS? ND - network disk Looks like a disk device on client (e.g., /dev/nd0) IO requests go over net to server, which has file storing disk image Note concept still exists in the form of NBD under linux What were goals of NFS? (p. 119) 1. Machine and OS independence Required adding a new system call, getdirentries (p. 124). Why? Previously, readdir used read(2) to look at raw directory contents getdirentries abstracts directory formats across machines 2. Crash recovery When is a local file system allowed to lose data in a crash? Data written recently (<30 sec ago), and no call to fsync Okay because processes writing files are killed in crash, too When should NFS be allowed to lose data? Client crash? Same as local file system Server crash? Never, because processes on client not killed 3. Transparent access "Programs should not be able to tell whether a file is remote or local." What would be alternative? Hack libc so open("/nfs/server/file") doesn't do real open syscall Would break applications--e.g., inheriting file descriptors 4. UNIX semantics maintained on client Some examples of tricky UNIX semantics? Most permission checks happen when file initially opened Can delete an open file and still use it 5. Reasonable performance No worse than ND, or 80% as good as local disk Which of above goals met by ND? All but #1 Why isn't ND good enough? NFS also wants *sharing* of file system resources Can't share a read-write file system stored on ND Ordinary file systems don't expect disk to change out from under them Might cross-allocate blocks/inodes, create duplicate file names, etc. What does NFS look like to administrator? Unix namespace comprised of *mount points* A root file system mounted on "/" (the root directory) at boot time Other file system mounted on other directories Created with mount command: mount /dev/sda3 /usr Server admin can *export* some file systems over NFS in /etc/exports Client admin can mount remote FS "server:/dir" instead of device E.g.: mount my-server:/disk/u1 /home/u1 Now /home/u1 on client is same FS as /disk/u1 on server! Two kinds of NFS mount: hard and soft (p. 124)--what's the difference? Hard mount hangs forever if server crashes With soft mount, syscalls eventually return error if no server reply Note, more recently NFS also offers intr vs. nointr Kernel assumed disk requests fast, so not FS syscalls interruptible So couldn't Ctrl-C process if accidentally accessed unavailable server Took many years to fix this by adding intr option Until NFS, only one kind of file system in kernel. How to abstract? Two new abstractions: VFS, and vnodes (p. 123) Ersatz C++ abstract classes, implemented with C function pointers Have one VFS (virtual file system) struct per mount point unmount(), root(), statfs(), sync() Have one vnode for each active inode lookup, open, close, create, rdwr, inactive, ... Also contains pointer to its VFS struct Rewrote namei (p.124) [routine mapping (dir vnode, path) -> vnode] Why? Only the client knows which directories are mount points Hence, cannot have server translate more than one directory at a time So namei must walk file system one vnode loopkup() at a time VFS+vnode routines for NFS make RPCs to server--see pp. 120-121 What happens if I say "cat dir/file" on an NFS file system? lookup (fh1, "dir") -> fh2 lookup (fh2, "file") -> fh3 read (fh3, [offset] 0, [len] 8192) read (fh3, [offset] 8192, [len] 8192) ... Now what if I "cd dir; cat file" while contents cached? (fh2, "file") -> fh3 might already be in name cache stat(fh3) -> attr Compare mtime/ctime in attr to cached version. Only read if no match. p. 120: "NFS uses a stateless protocol"--what does this mean? Obviously a file system cannot be stateless But some protocols (e.g., CIFS, 9p) keep client state on server E.g., each file open, close sent to server Server keeps equivalent of file descriptor table for each client Remote descriptors maybe invalidated by server reboot/network outage May be hard for client to rebuild state if files renamed Or... client crashes and server stuck keeping useless descriptors NFS requests are self-contained; no per-client context on server What happens if server crashes? Client keeps retransmitting requests until server answers Once server reboots, will eventually get client's request Because "stateless", server can execute request with no prior context What happens if client->server request dropped by network? Client keeps retransmitting (over UDP), so no issue What happens if server->client reply dropped by network? Client keeps retransmitting (over UDP), sends second copy of request What does server do upon receiving second copy of a request? Each RPC starts with 32-bit XID Server keeps map of XID->reply message in a *replay cache* Reply to duplicate requests from replay cache, rather than reexecuting But the cache is state... protocol not stateless after all! What happens if server->client dropped and server reboots? Reboot wipes server replay cache, so server reexecutes request Saving grace: protocol makes operations as *idempotent* as possible Means executing twice has same effect as executing once Example: "x = 1" is idempotent; "++x" is not Of calls on pp. 120-121, which are *not* idempotent? Not idempotent: remove, rename, link, mkdir, rmdir create is idempotent (makes exclusive create unreliable) Spec suggests symlink idempotent (maybe no error if content matches) What if I say "echo message > file" on NFS file system? create (fh1, "file") -> fh2 write (fh2, 0 [offset], "message") -> 2 What if server crashes during write? Okay, client will keep retransmitting until it gets reply What if server crashes after replying to write? Client will stop retransmitting (since it gets write reply) Means server cannot reply to write until data safely on disk Why did authors need to work on IP fragmentation code? Maximum Ethernet packet size is 1500 bytes [no jumbo frames in 1984] Each request/response constitutes one UDP packet What if you break large sequential writes into ~1400-byte requests? Smaller than FS block size. May have to read surrounding block Will likely lock buffer when first write request comes in Because writes synchronous, have to send to disk Almost certainly pay full disk rotation before servicing next write Solution: up to 9000-byte UDP packets (8192 data plus some header) Must be broken into multiple IP fragments to send over Ethernet Even back-to-back 8KiB synchronous writes likely to be slow For good performance, want more concurrent write requests at server What do they do? Added block I/O daemon on client (p. 125) Not really good support for "asynchronous RPC" in kernel Instead fork off 4-16 block I/O daemons (biod) On write system call: If biod is available, hand it request return from syscall immediately biod will keep retransmitting until it gets a reply On fsync (and later close) Block waiting for all write requests to go through Server hack: nfsd daemon makes system call that never returns--why? Want to handle multiple incoming requests concurrently Again, async IO not so easy, and want to integrate with scheduler Easiest way to integrate with scheduler to be a process But NFS implemented in kernel, so just do it all in a big syscall What happens when you run "mount server:/dir /mnt"? User-level mount program executes MOUNTPROC_MNT("/dir") to server User-level mountd program on server returns 32-byte NFS file handle mount program makes mount(2) syscall with server IP, fhandle, "/mnt" NFS mount handler allocates VFS, hangs it on /mnt's vnode (in root fs) What's in an NFS file handle? Whatever the server wants--it's opaque to the client Must uniquely identify file, so: - File system ID (new field in superblock) - File system major/minor device number (usually) - Inode number - Generation number (new field in inode) What's the point of a generation number? Generation number changes each time inode recycled Say server deletes a file while client has it open If client uses old handle, generation number wrong, gets ESTALE Not great, but better than reading/writing to unrelated file! Also makes file handles hard to guess (access control at mount time) What is the security model? Who enforces what permissions? Assume numeric user/group IDs are same on client and server On server, mountd restricts permitted clients (/etc/exports) If client sends valid file handle, NFS server assumes it is authorized Client tags each client RPC with local user/group IDs Server enforces access control based on claimed credentials But maps root to -2 Does this pose problems for UNIX semantics (goal #4)? UNIX does permission only at file open. How does NFS handle: - fd = creat("lockfile", 0444); write(fd, ...); close(fd); Server allows write if owner of file matches, even if 0444 perms - fd = open(...); setuid(pw->pw_uid) /* drop privs */; read(fd, ...);" Client sends credentials from when file originally opened (p. 127) How is evaluation? Close-to-open consistency