Afterburner
===========

What does interface to network card look like?  (Fig 2).
  Control interface
  1 MByte VRAM - all sent/received packets must be in here
  4 FIFOs:
    Tx_ready (host->card:  here's a buffer address, transmit it)
    Tx_free (card->host:  I'm done with this buffer you gave me)
    Rx_ready (card->host:  there's now a packet in this buffer)
    Rx_free (host->card:  here's a free buffer to put a received packet in)

How does data move around in a traditional TCP implementation? (Fig. 1)
  1. application makes send() system call
  2. data copied from user space to into mbufs
       mbufs contain small amounts of data (up to 100 bytes)
       cluster mbufs can hold bulk data
       typically packets stored as linked list of mbufs/clusters
  3. data copied from mbufs into card buffer space
__4. card poked - data eventually transmitted
  5. remote card receives data, interrupts
  6. data copied from card buffer to mbufs
  7. application calls recv() (at some point)
  8. data copied from mbufs to user-space

What about TCP acknowledgments?
  Sometimes generated when packet received
  recv() system call can also cause window to open up

Where does the time go in old HP workstations?
  Calling send system call -> 40 usec
  Copying data 50 Mbytes/sec -> 19 nsec/byte
  Computing TCP checksum 127 Mbytes/sec -> 7.6 nsec/byte
__Other output packet processing (e.g., acknowledgments) -> 110 usec/pkt
  Copying data from card 32 Mbytes/sec -> 30 nsec/byte
  Other input packet processing -> 90 usec
  recv system call -> 90 15 usec

So data movement is a major part of the cost!
  187/337 usec for sending 4K packet
  356/361 usec for receiving 4K packet

Do we care?  Aren't computers getting faster anyway?
  Paper claims DRAM improving 7%/year, processors 50%/year
  Is this true?  Paper was written 10 years ago
    Then:  HP 9000/730 has 66Mhz processor, memcpy 50 MB/sec
     Now: 1.75 GHz athlon, memcpy 250 MB/sec
   So memory improved slightly faster than predicted, but still an issue
     Would take 100% of CPU to bcopy 1Gbit/sec twice!

What can we do about this?  Two suggestions:
  "Two-copy" -- what is this?  Combine checksum calculation with bcopy
  "Single-copy" -- somehow avoid making two copies

What are possible ways to achieve single-copy?
* Eliminate system call bcopy:
  - Copy-on-write.  Optimizes send() system call.
    Idea:  After send, keep pointer, only copy data if modified,
           or variation:  Sleep-on-write
    Problem:  Must change applications to use aligned buffers
  - Page remapping.  Optimizes recv() system call.
    Idea:  Split header from incoming packets, put data on its own page
           Remap page from 
* Eliminate driver bcopy:
  - Single-copy.  Optimize both send/recv.
    Idea:  Copy straight from user code into device memory

How is single-copy implemented?
  New type of mbuf--"Single-Copy Clusters"--use card's VRAM
    include a field for checksum of the data
  Special bcopy-like routines optimized to copy to/from single-copy clusters
    also use card's hardware to calculate TCP checksums

How does sending packets now work?
  Copies data to SCC, leaving room for header
  tcp_output makes each packet one SCC, or list of regular mbufs/clusters
    (In future, card will accept packets distributed over groups of VRAM bufs)
  How does OS know how big to make the packets
    General algorithm:  min (MSS, window)
    But we don't know what window will be if buffering lots of data!
    Maybe estimate what window will be based on recent history
  When can we free a transmit SCC?  Must wait for ACK from remote machine.
    Is this problematic?  Yes -- may run you out of VRAM space!
    What is TCP_NODELAY problem?  Causes many packets will small payloads
    What to do?  detect/repair.  Can always convert clusters, but open problem

How about receiving packets in new system?
  4 packet categories:  Non-IP, Small IP (< 100 B), Large TCP/IP, other IP
  When do we put data is a SCC?  Optimize if all of following hold:
    - Large TCP/IP.  Why?  (small have tiny payload, VRAM bufs scarce)
    - Packet has no control information (e.g., FIN, RST bits, etc.)
    - Packet passes "header prediction"--i.e., is expected packet, in order
  If optimizing, put data in a SCC
  Calculate checksum of TCP header (but not payload!)
  Do we ACK data?  Not yet, because haven't checked checksum
  Application calls recv() - now we calculate cksum
    What if cksum fails?  Have to block or return EAGAIN
    How does this interact with select?
      (Assumes applications can deal with "false positives" from select)
    Must also fix up if application asks for non-integral # of buffers
      If req < 1 buffer, calculate whole checksum, remainder in regular mbuf
      If n < req < n+1, return short read, process just n buffers
  What happens if app. doesn't call recv() quickly?
    Exploit delayed ACK mechanism: convert SCC to regular, check cksum, & ACK

How are preformance results?  What questions should we be asking?
  Compare Fig 4 back to Table 1 -- looks like they predicted roughly right
  So did they achieve what they wanted?
  What about all "1 Gbit/sec" talk?
    What they did bought a factor of 2, but still 5x away from 1 Gbit/sec

Are ideas still useful today?
  How are network interface cards different today?
    DMA straight into and out of main memory
      Typically have linked ring of packet descriptor structures in memory
      Poke card with first descriptor for transmission/reception
    Like afterburner, cards can compute TCP checksums in hardware
  So would single-copy clusters be useful today?
    Could just put the mbuf memory into packet descriptors for cards
    But typically isn't done for several reasons
      Convenient to arrange buffers in a ring and not change them
      Some cards require 4-byte-aligned boundaries
        Inconvenient--14-byte ethernet header misaligns 32-bit numbers in IP 
      Small packets received in big buffers waste memory
  But could potentially have zero-copy I/O with specialized OS!
    Ideas for better system call interface? ...