Eliminating Receive Livelock in an Interrupt-Driven Kernel
==========================================================

What kind of interface does a NIC offer the kernel?
  - Interface to tell card about ring buffers
  - Poke hardware to transmit buffered packets
  - Interrupts when packet received/transmit complete
  What is the point of interrupts?
    Receive interrupt - process packet as soon as it arrives (low latency)
    Transmit interrupt - amortize "poking" over multiple packets (p. 244)
  What types of devices use interrupts?
    Clock/timer, serial port, keyboard/mouse, disk, network
  Note that after an interrupt, interrupts typically disabled
    Have to re-enable explicitly when ready.  Why?
      Avoid arbitrarily nested interrupts (blow out stack)
      Enable exclusive access to data structures without spinlocks

What is the problem the paper addresses?
  Receive livelock:
    Example:  IP forwarding
    Example:  Run tcpdump to see who's hosing network & machine wedges
Explain Figure 2 (p. 228).
  Why do black dots (no screend) go up?  (more packets to forward)
  What determines how high the peak is?  (when CPU saturates)
  Why do they go down?  (wasting CPU time on discarded packets)
  What determines how fast it goes down?
    (fraction of packets discarded)(work invested in discarded packets)
       / (total work CPU is capable of)
  What happens when we add screend? (need more CPU time per packet)
  Do dedicated routers suffer the same problem?
    Usually not, because routing usually on fast path entirely in line-cards
    But Slammer worm sometimes caused problems:
       "because one of the routers in this network was starved for
        memory, it didn't run the Cisco Express Forwarding (CEF)
        algorithm... this made the problem worse because without CEF,
        the router must take time to create a "route cache" for each
        destination in order to forward packets at high speed. Since
        the worm generates random destination addresses, the router
        ended up spending most of its time creating these route cache
        entries, and it ran out of memory to boot."
            - http://www.onlamp.com/pub/a/onlamp/2003/01/28/msworm.html

Suppose I wanted to test an NFS server for livelock.
  Run client with this loop:
    while(1) {
      send NFS READ RPC;
      wait for response;
    }
  What would I see?
  Is the NFS server subject to livelock under this load?
     Not really--offered load subject to feedback
     But could livelock if many, many clients retransmitting
  Is it possible to generate livelock just from disk interrupts
     No--because if system is starved for CPU, won't issue more disk requests
  Key point:  flow control/back pressure avoids livelock-inducing overload
  What other situations might cause livelock?
    Maybe very unlucky scheduling on an SMP or distributed system
        (wait for Paxos exercises)

What other problems are we trying to address?
  Increased latency for packet delivery and forwarding
    E.g., start disk head moving when first NFS read request comes
  Transmit starvation
  User-level CPU starvation

What does typical Unix network stack look like?
  - "Top-half" kernel process context (read/write syscalls)
  - Socket buffering
Look at Figure 3 (p. 229)
  - Receive processing, appends packets to ipintrq, IP input queue
  - IP forwarding layer (IPL softnet, or just kernel thread, depends on OS)
  - Device output queue
  - Transmit processing, takes packets from output queue
Look at Figure 11 (p. 240)
  - Device interrupt (lnintr->lnrint/lntint) - Highest priority
  - Software interrupt (ipintr)     - Higher priority
  What would this picture look like with a packet burst?
    ip_forward might get interrupted by another receive interrupt

What's wrong with typical Unix interrupt usage?
  What is overhead of an interrupt?  ~10 usec
  Can't that support 100,000 packets/sec?  Priority
  Why do receive interrupts have high priority?  Is this good?
    Old network cards could not buffer many packets, don't want to lose burst
    Today's cards can, so not really needed any more
What does kernel do on input overload?
  Queues fill up, packets dropped
  When are packets dropped?  When should they be dropped?
    Usually dropped at ipintrq.  Should be dropped earlier.
  Causes throughput to go down as offered load goes up!

Why not just tell scheduler to give interrupts lower priority?
  non-preemptible - otherwise hard to synchronize on queues

Why not completely process (i.e., forward) each packet in interrupt handler?
  Other parts of kernel don't expect to run at high interrupt-level
    E.g., some packet processing code might call sleep (e.g., in mem allocator)
  Still might want an output queue

What about using polling exclusively instead of interrupts?
  Solves overload problem
  Killer latency

What's the paper's solution?
  Do most packet processing in "top half" (scheduler-controlled) kernel thread
    Receive interrupt arrives, poke thread, keep interrupts disabled
    Thread processes packets fully, put on output queue
    Thread checks for more work, reenables interrupts otherwise
  Eliminates IP input queue (packets dropped before work invested)

Why does this work?
  What happens when packets arrive slowly?
  What happens when packets arrive too fast?

Explain Figure 5 (p. 231).
  Why does "Polling (no quota)" work badly?
    Input still starves xmit complete processing (lntint).
  Why does it do even worse than unmodified kernel?
    Because now packets are dropped at the device output queue
    Actually investing more work in each dropped packet!

Explain Figure 6 (p. 232).
  Here white squares have quota of 10 packets
  Why does "Polling, no feedback" behave badly?
    There's a queue in front of screend.
    We can still give 100% to input thread, 0% to screend.
  Why does "Polling w/ feedback" behave well?
    Input thread yields when queue to screend fills.
  What if screend hangs, what about other consumers of packets?
    E.g., can you ssh to machine to fix screend?
      Fortunately screend typically only application
      Also, re-enable input after timeout

What about Figure 10 (p. 235) - CPU-bound app
    Why doesn't screend solution solve this?
  No packet queues from which to get feedback--process is CPU bound
  What is solution?
    Every 10 msec clear counter
    Track CPU usage in receive thread, after a fraction disable
  What if there is no CPU-bound process?  Waste of resources?
    No, re-enable receive interrupts in idle thread
  Why aren't the numbers exactly what you would expect? (25%,50%,75%,...)
    Other background processes may use CPU
    Time to process interrupts not charged against quotas
  Why is there a small dip in most of the curves?
    More interrupts happen at lower pkt rate, and not charged against quota

Why are the three fairness mechanisms different?
  1. Polling thread *with packet quotas*.
  2. Feedback from full queue.
  3. CPU quotas

Could we unify all of these with a better scheduling mechanism?
  Maybe should have used #2 (queue feedback) instead of #1
  For #3, could just use an advanced scheduler--remember BVT from CS140?
      *Share* gives receive processing fixed share of CPU over long run
      But use high *warp* value to give receive processing low latency
        (while still keeping it under its fair share for long run)
    Would be nice somehow to integrate queue feedback with warp & share
      E.g., if just wasted CPU time on a packet you had to drop at full queue
          maybe should temporarily donate some of your share to queue consumer
          (so producer and consumers still can't exceed their total share)

If we apply their fixes, does the phenomemon totally go away?
  E.g. for web server, NFS server threads that wait for disk, etc.
  Can the net device throw away packets without slowing down host?
  Problem:  We want to drop packets for applications with big queues
    But requires work to determine which application a packet belongs to
    Possible solution:  NI-LRP  (have network interface hardware sort packets)

What about latency question?  Look at figure 14 p. 243.
  1st packet looks like an improvement over non-polling.  Why?
    Because can call lnput before receiving all three packets
    Also save about 10 usec by not moving packet on and off ipintr queue
  But 2nd packet transmitted later with poling.  Why?
    No new packets added to xmit buffer until xmit interrupt
    Why?  In traditional BSD, to amortize cost of poking device
    Maybe better to poke a second time anyway
    Note rescheduling polling thread also slightly more work if not overloaded

How is paper?  Claims/evaluation/contribution.
  Do we care about this problem?  Yes, e.g., used by NASDAQ

What should network stack deliver?
  - Throughput
  - Good latency
  - Graceful degradation under high load

Do they convincingly show their system delivers this?  Yes.

Is the work still relevant, or are CPUs now fast enough?
  Network speeds growing as fast or faster than CPU speeds
  Memory speeds not growing proportionally, so cache misses bigger factor
    (so processing one packet to completion should be comparatively bigger win)
  Interrupt cost & device access not scaling with straight-line code
  Multi-processor systems add a new dimension
    Dedicating just one core to receive processing could avoid livelock
    But for 10G+ networks, actually want multiple cores interacting with card

Can we formulate any general principles from paper?
  Don't spend time on new work before completing existing work.
  Or give new work lower priority than partially-completed work.