L1: 1/17/2001

G22.3033-010:  Advanced Topics in Operating Systems

Professor:  David Mazières

----

Administrivia

 * All assignments are on the web page
   http://www.scs.cs.nyu.edu/G22.3033-010/
 * Check web page often for changes
 * Always read the papers before class
   - Grade based on participation!
 * First lab is due Wednesday

----

What is an operating system?

 * Makes hardware useful to the programmer
 * Provides abstractions for applications
   - Manages and hides details of hardware
 * Provides protection
   - Prevents one process/user from clobbering another

----

What is a distributed operating system?

 * The holy grail:  Transparency
   - Have a bunch of machines look just like one machine
   - As easy to manage as one machine
   - Save applications & users from worrying about it
   - Just add more machines to scale to higher workloads
 * The reality:  Numerous complications
   - Failures, especially partial & network failures
   - Concurrency
   - Long latencies
   - Security issues

----

Successful distributed system architectures

 * Client/server architecture
   - Clients request services from servers with network messages
   - Modular architecture, isolates servers from client faults
   - Can potentially scale by adding more servers
 * Peer-to-peer (decentralized)
   - Ad hoc configuration can survive loss of any machine
   - Potential scalability problems (e.g., can't broadcast)
 * Single address space

----

Class topics

 * Distributed programming
   - TCP/IP & socket programming
   - Remote procedure calls
 * Concurrency
   - Threads
   - Asynchronous programming
 * Kernel design
   - I/O abstractions & implementations
 * Storage
 * Scalability

----

Implementing real systems

 * Build systems you can use
 * Design for real workloads
   - Highly-efficient event-driven programming
 * Perform actual systems research
   - Study recent papers
   - Undertake a research project of your choosing

----

High-performancs servers are an operating system issue
Example:  A video server

 * Hardware capabilities
   - 20 MByte/sec SCSI disk
   - 100 Mbit/sec Ethernet
 * Server requirements
   - 200 Kbit/sec video streams
   - Many users spread around the Internet
   - Access control
 * Maximum capacity: 500 clients

----

In reality, capacity typically much less

 * CPU bottleneck
   - Software structure may impose many context switches
   - Concurrency may introduce lock contention
 * Disk I/O limitations
   - Multiple video streams can introduce disk seeks:
     5ms seek per 8K read -> 1.6 MByte/sec
   - Must pipeline disk requests:  prefetching
   - Must deal with OS buffer cache
     (may fill memory and cause paging)
 * Network complications
   - OS may buffer stale data (dropped frames)
   - Introduces latency to congestion feedback
     (received packets not prioritized)

----

Concurrency

 * Goal:  Maximize throughput
   - Service the maximum number of clients over time
 * Benefit:  Overlap latencies
   - Dedicate CPU time to other clients during network
     transmission/client computation
   - Present disk with simultaneous requests
     -> achieve better disk arm scheduling
   - Amortize interrupts (e.g., network transmit)
     over multiple packets
 * Dangers:  Reducing throughput with overload
   - Introducing context switches
   - Increasing cache misses
   - Increased memory/buffer cache usage -> Paging / thrashing
 * Two basic approaches
   - Threads
   - Asynchronous I/O

----

Threads

 * Write sequential-looking code:
	for (;;) {
	  read_from_disk;
	  write_to_network;
	  wait_until_next_frame_needed:
	}
 * Run multiple instances of code in parallel
   - While one thread paused/waiting for I/O, schedule another
   - Protected shared data by locks
 * Benefit: threaded code can exploit multiprocessor

----

Limitations of threads

 * High memory overhead
   - Need one stack per thread
 * High context switch overhead for kernel threads
   - But user threads suck, too (no multiprocessing)
 * Lock contention can kill performance
   - Even uncontested synchronization operations expensive
   - Coarse-grained locking kills concurrency
   - Fine-graned locking costs CPU time
   - Preemption may happen at inopportune moments
     -> priority inversion
 * Brutally hard to program!

----

Why thread programming is hard

 * Data races
 * Deadlock
 * Threads break abstraction
   - Must worry about what locks modules assume & aquire
        T1 ---> Module A ---> Module B ---> wait
        T2 ---> Module A ---> Module B ---> signal   DEADLOCK!
   - Breaks callbacks
        T1 ---> Module A ---> Module B ---> Module A   DEADLOCK!
 * Hard to debug
   - Non-determinism based on internal scheduler

----

Asynchronous I/O

 * I/O operations never block
   - e.g., if no data, read immediately returns error
 * Single blocking operation:  select/poll
   - Returns list of I/O operations that are ready
 * Event driven architecture
   - Maintain list of callbacks awaiting I/O events
   - Main dispatch loop makes callbacks when event happens
        struct callback {
          void (*cb) (void *);
          void *arg;
        };
	main () {
          initialize_callbacks;
	  foreach (pending I/O) { run_callback; }
        }

----

Benefits of Asynchronous programming

 * Low overhead
   - Callback typically much smaller than thread stack
   - No context switch overhead (just a procedure call)
 * Implicit coordination
   - No data races
   - No deadlock
   - No priority inversion

----

Limitations of Asynchronous programming

 * Cannot have long-running callbacks
 * Not automatically scalable to multiprocessor
 * Hard to program
   - Must explicitly package up state across callbacks
     Cannot share stack-allocated state
   - Lots of dynamic memory allocation
     (who is resposible for freeing what?)
   - Logical flow of events broken into many event handlers

----

Other issues for high-performance servers:

 * Coordination & scheduling
 * Disk allocation & scheduling
 * Memory management (including buffer cache)
 * Address spaces (VM)
 * Distributed system abstractions
 * Efficient data movement
   - Kernel effectively a data mover
   - IPC, memory -> network, disk -> network, network -> network, etc.

----

Example:  Coordination - Polling vs. Interrupts

 * Interrupts are expensive (microseconds)
   - Under heavy load, can spend all time servicing interrupts
   - Receiver livelock occurs when more packets arrive than can be
     processed
 * Polling 
   - Amortize one driver invocation over many packets
   - Adds latency (unreasonable under low loads)
   - Fits naturally into asynchronous I/O model
 * Solution: switch dynamically between interrupts & polling

----

Scaling to multiple CPUs

 * Multiprocessors help if user-level CPU bottleneck
   - Might hurt system time, though
   - Non-linear cost vs. speedup
 * Server clusters
   - Inexpensive if scalable
 * Distributed server clusters
   - When client-server bandwidth is low

----

Clusters

 * Naming transparency
   - Should client be aware of cluster?
 * Server selection
 * Consistency 
   - Multiple servers must agree on state of things
 * Availability
   - Chances of one node failing increase
   - Replication helps availability, complicates consistency

----

Distributed clusters

 * Many issues:
   - Replication policies
   - Efficient data distribution
   - Consistency
   - Network monitoring and modeling
   - Global load-balancing
 * Rethink traditional OS abstractions
   - File system semantics, etc.
   - Trade-off between accuracy, latency, and network load

----

Summary

 * High performance servers an OS issue
   - Pipelining disk & network requests
   - Coordination
   - Caching
 * True scalability requires distributed system
   - Reliability/Availability
   - Security
   - Consistency
   - Tolerating latency
 * Difficult
   - If a fast server bypasses OS abstractions, can we do it for
     another application?

----

G22.3033-010 Lab

 * Multifinger
 * TCP proxy
 * web proxy
 * file server
 * project

====
BREAK
Intro to sockets 
----

Intro to libasync

 * Asynchronous programming hard for several reasons:
   - Must explicitly package up state across callbacks
     Cannot share stack-allocated state
   - Lots of dynamic memory allocation
     (who is resposible for freeing what?)
   - Logical flow of events broken into many event handlers
 * Libasync makes these things much easier

----

Reference counted garbage collection

 * Problem:  async_send (buf, size, cb)
   - Who is responsible for freeing what when?
 * Find problems with debugging malloc (http://www.dmalloc.com)
 * Avoid problems with refcounted template class
	class foo : public bar { ... };
	  ...
	  ref<foo> f = New refcounted<foo> ( ... );
	  ptr<bar> b = f;
	  f = New refcounted<foo> ( ... );
	  b = NULL;

----

Type-safe callbacks (function currying)

 * callback<ret, arg1, arg2, arg3>::ref cb;
   ret r = (*cb) (a1, a2, a3);
 * Create callbacks with wrap
	R fn (A1, A2, A3);
	wrap (fn) -> callback<R, A1, A2, A3>::ref
	wrap (fn, a1) -> callback<R, A2, A3>::ref
	wrap (fn, a1, a2) -> callback<R, A3>::ref
	wrap (fn, a1, a2, a3) -> callback<R>::ref
 * Method callbacks:
        class foo {
          void bar () { ... setcb (wrap (this, &foo::bar)); ... }
        };
 * Common typedefs:
     cbv - callback<void>::ref
     cbi - callback<int>::ref

----

Libasync core:

 * Register callbacks, call amain () from main
   - event dispatch loop
 * fdcb (fd, selread/selwrite, cbv::ptr)
 * timecb_t *delaycb (time_t sec, cbv cb);
 * void timecb_remove (timecb_t *);
 * chlddcb (pid_t, cbi), sigcb (int sig, cbv::ptr), ...