L1: 1/17/2001 G22.3033-010: Advanced Topics in Operating Systems Professor: David Mazières ---- Administrivia * All assignments are on the web page http://www.scs.cs.nyu.edu/G22.3033-010/ * Check web page often for changes * Always read the papers before class - Grade based on participation! * First lab is due Wednesday ---- What is an operating system? * Makes hardware useful to the programmer * Provides abstractions for applications - Manages and hides details of hardware * Provides protection - Prevents one process/user from clobbering another ---- What is a distributed operating system? * The holy grail: Transparency - Have a bunch of machines look just like one machine - As easy to manage as one machine - Save applications & users from worrying about it - Just add more machines to scale to higher workloads * The reality: Numerous complications - Failures, especially partial & network failures - Concurrency - Long latencies - Security issues ---- Successful distributed system architectures * Client/server architecture - Clients request services from servers with network messages - Modular architecture, isolates servers from client faults - Can potentially scale by adding more servers * Peer-to-peer (decentralized) - Ad hoc configuration can survive loss of any machine - Potential scalability problems (e.g., can't broadcast) * Single address space ---- Class topics * Distributed programming - TCP/IP & socket programming - Remote procedure calls * Concurrency - Threads - Asynchronous programming * Kernel design - I/O abstractions & implementations * Storage * Scalability ---- Implementing real systems * Build systems you can use * Design for real workloads - Highly-efficient event-driven programming * Perform actual systems research - Study recent papers - Undertake a research project of your choosing ---- High-performancs servers are an operating system issue Example: A video server * Hardware capabilities - 20 MByte/sec SCSI disk - 100 Mbit/sec Ethernet * Server requirements - 200 Kbit/sec video streams - Many users spread around the Internet - Access control * Maximum capacity: 500 clients ---- In reality, capacity typically much less * CPU bottleneck - Software structure may impose many context switches - Concurrency may introduce lock contention * Disk I/O limitations - Multiple video streams can introduce disk seeks: 5ms seek per 8K read -> 1.6 MByte/sec - Must pipeline disk requests: prefetching - Must deal with OS buffer cache (may fill memory and cause paging) * Network complications - OS may buffer stale data (dropped frames) - Introduces latency to congestion feedback (received packets not prioritized) ---- Concurrency * Goal: Maximize throughput - Service the maximum number of clients over time * Benefit: Overlap latencies - Dedicate CPU time to other clients during network transmission/client computation - Present disk with simultaneous requests -> achieve better disk arm scheduling - Amortize interrupts (e.g., network transmit) over multiple packets * Dangers: Reducing throughput with overload - Introducing context switches - Increasing cache misses - Increased memory/buffer cache usage -> Paging / thrashing * Two basic approaches - Threads - Asynchronous I/O ---- Threads * Write sequential-looking code: for (;;) { read_from_disk; write_to_network; wait_until_next_frame_needed: } * Run multiple instances of code in parallel - While one thread paused/waiting for I/O, schedule another - Protected shared data by locks * Benefit: threaded code can exploit multiprocessor ---- Limitations of threads * High memory overhead - Need one stack per thread * High context switch overhead for kernel threads - But user threads suck, too (no multiprocessing) * Lock contention can kill performance - Even uncontested synchronization operations expensive - Coarse-grained locking kills concurrency - Fine-graned locking costs CPU time - Preemption may happen at inopportune moments -> priority inversion * Brutally hard to program! ---- Why thread programming is hard * Data races * Deadlock * Threads break abstraction - Must worry about what locks modules assume & aquire T1 ---> Module A ---> Module B ---> wait T2 ---> Module A ---> Module B ---> signal DEADLOCK! - Breaks callbacks T1 ---> Module A ---> Module B ---> Module A DEADLOCK! * Hard to debug - Non-determinism based on internal scheduler ---- Asynchronous I/O * I/O operations never block - e.g., if no data, read immediately returns error * Single blocking operation: select/poll - Returns list of I/O operations that are ready * Event driven architecture - Maintain list of callbacks awaiting I/O events - Main dispatch loop makes callbacks when event happens struct callback { void (*cb) (void *); void *arg; }; main () { initialize_callbacks; foreach (pending I/O) { run_callback; } } ---- Benefits of Asynchronous programming * Low overhead - Callback typically much smaller than thread stack - No context switch overhead (just a procedure call) * Implicit coordination - No data races - No deadlock - No priority inversion ---- Limitations of Asynchronous programming * Cannot have long-running callbacks * Not automatically scalable to multiprocessor * Hard to program - Must explicitly package up state across callbacks Cannot share stack-allocated state - Lots of dynamic memory allocation (who is resposible for freeing what?) - Logical flow of events broken into many event handlers ---- Other issues for high-performance servers: * Coordination & scheduling * Disk allocation & scheduling * Memory management (including buffer cache) * Address spaces (VM) * Distributed system abstractions * Efficient data movement - Kernel effectively a data mover - IPC, memory -> network, disk -> network, network -> network, etc. ---- Example: Coordination - Polling vs. Interrupts * Interrupts are expensive (microseconds) - Under heavy load, can spend all time servicing interrupts - Receiver livelock occurs when more packets arrive than can be processed * Polling - Amortize one driver invocation over many packets - Adds latency (unreasonable under low loads) - Fits naturally into asynchronous I/O model * Solution: switch dynamically between interrupts & polling ---- Scaling to multiple CPUs * Multiprocessors help if user-level CPU bottleneck - Might hurt system time, though - Non-linear cost vs. speedup * Server clusters - Inexpensive if scalable * Distributed server clusters - When client-server bandwidth is low ---- Clusters * Naming transparency - Should client be aware of cluster? * Server selection * Consistency - Multiple servers must agree on state of things * Availability - Chances of one node failing increase - Replication helps availability, complicates consistency ---- Distributed clusters * Many issues: - Replication policies - Efficient data distribution - Consistency - Network monitoring and modeling - Global load-balancing * Rethink traditional OS abstractions - File system semantics, etc. - Trade-off between accuracy, latency, and network load ---- Summary * High performance servers an OS issue - Pipelining disk & network requests - Coordination - Caching * True scalability requires distributed system - Reliability/Availability - Security - Consistency - Tolerating latency * Difficult - If a fast server bypasses OS abstractions, can we do it for another application? ---- G22.3033-010 Lab * Multifinger * TCP proxy * web proxy * file server * project ==== BREAK Intro to sockets ---- Intro to libasync * Asynchronous programming hard for several reasons: - Must explicitly package up state across callbacks Cannot share stack-allocated state - Lots of dynamic memory allocation (who is resposible for freeing what?) - Logical flow of events broken into many event handlers * Libasync makes these things much easier ---- Reference counted garbage collection * Problem: async_send (buf, size, cb) - Who is responsible for freeing what when? * Find problems with debugging malloc (http://www.dmalloc.com) * Avoid problems with refcounted template class class foo : public bar { ... }; ... ref f = New refcounted ( ... ); ptr b = f; f = New refcounted ( ... ); b = NULL; ---- Type-safe callbacks (function currying) * callback::ref cb; ret r = (*cb) (a1, a2, a3); * Create callbacks with wrap R fn (A1, A2, A3); wrap (fn) -> callback::ref wrap (fn, a1) -> callback::ref wrap (fn, a1, a2) -> callback::ref wrap (fn, a1, a2, a3) -> callback::ref * Method callbacks: class foo { void bar () { ... setcb (wrap (this, &foo::bar)); ... } }; * Common typedefs: cbv - callback::ref cbi - callback::ref ---- Libasync core: * Register callbacks, call amain () from main - event dispatch loop * fdcb (fd, selread/selwrite, cbv::ptr) * timecb_t *delaycb (time_t sec, cbv cb); * void timecb_remove (timecb_t *); * chlddcb (pid_t, cbi), sigcb (int sig, cbv::ptr), ...