Flash ===== This paper discusses a bunch of web server architectures. What are each of the following? MP MT (user threads? no, kernel threads) SPED (event-driven, like 2nd lab) What do existing web servers use? Apache: MP Zeus: SPED, except can spawn multiple (e.g., 2) processes New proposed architecture: AMPED What is AMPED idea? [draw picture w/ helpers] What operations does the helper perform? disk read() how about open()? stat()? Why is cached vs disk a big issue? Disk accesses much, much slower What do we need to get good performance for disk-bound workloads? Must process cached requests while waiting for disk Must send disk driver parallel requests to get good disk arm scheduling Where might the cache be? Kernel disk block cache, or User-level cache maintained by web server[s] Danger: Caching in two places halves effective amount of memory How does flash avoid this? Use *mmapped* file cache. What is mmap()? shares buffers with the kernel How does main flash process handle requests for URLs? 1. First, use name translation cache (/~bob -> file) 2. Check if response header is cached If not, create response header Make header a multiple of 32 bytes. Why? 3. Write header to socket asynchronously, whenever socket writable 4. Memory map file if it's not already mapped from a previous request 5. Start writing file contents asynchronously Is data in the buffer cache? (how do you know? mincore()) Yes: call write (sockfd, map_addr, size) No: Ask helper process to bring it into memory, repeat 5 How does the helper process mechanism work? Mail process sends requests to helper over pipe Helper accesses necessary page (blocks, causes page to be faulted in) Helper responds to main proc. over pipe. Now page should be resident, so access by main might cause page fault but no disk I/O required How does the main process ask a helper to read data? pipe. why? select()able. What's a reasonable number of helpers? one or two per disk? many per disk for disk-arm scheduling How might OS make helper processes unnecessary? What if we had scheduler activations? Example of how scheduler activations good for more than just threads Could implement asynchronous file system requests on sched act. SPED architecture could be extended to deal with async FS I/O What if we didn't change OS interface? Maybe make write of mmapped file to async TCP socket return immediately? In fact, lately OSes have weird system calls like sendfile... What performance do we expect? Disk-bound: AMPED > MT > MP/Apache >> SPED/Zeus Cacheable: SPED/AMPED/Zeus > MT > MP/Apache. What if Zeus had same number of processes as Flash has helpers? Disk-bound: Would expect same as MP (since procs block on disk requests) Cacheable: Worse than regular Zeus, because cache partitioned What about using async network I/O to access disk through NFS interface? Disk-bound: Would probably smoke AMPED; no need for context switches Cacheable: Not as good as AMPED--extra copies of data Optimizations: Pathname translation caching Response header caching Mapped files Byte position alignment Memory residency testing Feedback-based heuristic on p.7 if no mincore? yuck Evaluation: What's the test setup? Real server and lots of clients. How many clients? Is one enough? Clients run fake web browsers that issue concurrent requests. Figure 6: Why does b/w go up with file size? What's the limiting factor for small files? Disk? Net? RAM? I/O bus? CPU? Client's ability to generate requests? What's the limiting factor for large files? Why does the curve have the shape it does? x = file size a = time to process zero-length request b = bytes-per-second limit y = bytes/time = x / (a + x/b) What are a and b? Figure 6(b) suggests a is about 1 millisecond. Figure 6(a) suggests b is about 100 mbits/second. What new information does Figure 6(b) contain? 1 / (a + x); abstracts away the b, so less information. Shows small-file info more clearly. Why is there no MT line in Figure 7? Why is FreeBSD faster than Solaris? Same hardware... Solaris is a commercial O/S, you'd expect it to be faster? Why does the paper present Figures 6 and 7? Is the workload realistic? no. only one file, no disk... What have we learned? Apache is slow. What would we still like to learn about? Disk-bound performance. "Realistic" performance with typical mix of big/small, cached/disk. Effect of various parameters (mem size, # of processes, &c) Why don't they show us a simple disk-only graph like Figure 6? Maybe the answers are too obvious? But it would show effect of disk scheduling; SPED doesn't allow this. Maybe would require huge # files to defeat caching. No longer a simple experiment... Why is performance only 40 mbits in Figure 8, was ca. 100 in Figure 6? avg file size apparently 10 kBytes. or too many files to fit in cache. They don't tell us. What can we conclude from Figure 8? Realistic traces. Flash is a bit faster, but not radically. Presumably this is a mix of cached/disk requests. But actual mix is not known, so we don't really know what we're testing. How do figures 9 and 10 shed light on cached/disk performance? by varying data set size, control how well data fits in ~100 MB disk cache. How do they vary the data set size? How does that affect cache vs disk? Why is there a discontinuity at around 100 mbytes in Figure 9? Physical memory 128 MB, so buffer cache probably tops out at ~96MB Why is b/w around 50..100 mbits for large data set sizes? What is average file size? Compare figure 9 to 7--looks like maybe 10KB How many requests per second? 500 to 1000... Is this workload diskbound? No. What would b/w be if diskbound? Say 10ms seek for 10KB file -> 8 mbits/second... Do they in fact ever evaluate disk-bound behavior? At right of Figure 9, why is MP < SPED? user-level cache is small in MP Figure 9/10, Flash vs MP. Why does Flash beat MP for small data set? (MP has partitioned cache) Why does Flash beat MP for large data set? (event-driven is more efficient) Flash vs SPED Why is SPED slightly better for small data sets? Why does Flash beat SPED for large data set? Flash vs MT (Figure 10) Flash and MT have about the same behavior for all data set sizes. Why? What does this mean w.r.t. whether Flash is worthwhile? Cynical view: Should just use MT, not Flash. Practical view: Flash far easier to implement then kernel-supported threads! Much better use of programmer time. Is AMPED a general-purpose programming model? Web servers are read-only--could this work for read-write workloads? mmap could backfire--every write to mmapped block would require a read Maybe call write on aligned blocks if mincore fails could block process if kernel runs out of clean buffers What about for CPU-bound workloads? Does AMPED work on SMP machines? wouldn't expect much speedup from SMP hardware