Afterburner =========== What does interface to network card look like? (Fig 2). Control interface 1 MByte VRAM - all sent/received packets must be in here 4 FIFOs: Tx_ready (host->card: here's a buffer address, transmit it) Tx_free (card->host: I'm done with this buffer you gave me) Rx_ready (card->host: there's now a packet in this buffer) Rx_free (host->card: here's a free buffer to put a received packet in) How does data move around in a traditional TCP implementation? (Fig. 1) 1. application makes send() system call 2. data copied from user space to into mbufs mbufs contain small amounts of data (up to 100 bytes) cluster mbufs can hold bulk data typically packets stored as linked list of mbufs/clusters 3. data copied from mbufs into card buffer space __4. card poked - data eventually transmitted 5. remote card receives data, interrupts 6. data copied from card buffer to mbufs 7. application calls recv() (at some point) 8. data copied from mbufs to user-space What about TCP acknowledgments? Sometimes generated when packet received recv() system call can also cause window to open up Where does the time go in old HP workstations? Calling send system call -> 40 usec Copying data 50 Mbytes/sec -> 19 nsec/byte Computing TCP checksum 127 Mbytes/sec -> 7.6 nsec/byte __Other output packet processing (e.g., acknowledgments) -> 110 usec/pkt Copying data from card 32 Mbytes/sec -> 30 nsec/byte Other input packet processing -> 90 usec recv system call -> 90 15 usec So data movement is a major part of the cost! 187/337 usec for sending 4K packet 356/361 usec for receiving 4K packet Do we care? Aren't computers getting faster anyway? Paper claims DRAM improving 7%/year, processors 50%/year Is this true? Paper was written 10 years ago Then: HP 9000/730 has 66Mhz processor, memcpy 50 MB/sec Now: 1.75 GHz athlon, memcpy 250 MB/sec So memory improved slightly faster than predicted, but still an issue Would take 100% of CPU to bcopy 1Gbit/sec twice! What can we do about this? Two suggestions: "Two-copy" -- what is this? Combine checksum calculation with bcopy "Single-copy" -- somehow avoid making two copies What are possible ways to achieve single-copy? * Eliminate system call bcopy: - Copy-on-write. Optimizes send() system call. Idea: After send, keep pointer, only copy data if modified, or variation: Sleep-on-write Problem: Must change applications to use aligned buffers - Page remapping. Optimizes recv() system call. Idea: Split header from incoming packets, put data on its own page Remap page from * Eliminate driver bcopy: - Single-copy. Optimize both send/recv. Idea: Copy straight from user code into device memory How is single-copy implemented? New type of mbuf--"Single-Copy Clusters"--use card's VRAM include a field for checksum of the data Special bcopy-like routines optimized to copy to/from single-copy clusters also use card's hardware to calculate TCP checksums How does sending packets now work? Copies data to SCC, leaving room for header tcp_output makes each packet one SCC, or list of regular mbufs/clusters (In future, card will accept packets distributed over groups of VRAM bufs) How does OS know how big to make the packets General algorithm: min (MSS, window) But we don't know what window will be if buffering lots of data! Maybe estimate what window will be based on recent history When can we free a transmit SCC? Must wait for ACK from remote machine. Is this problematic? Yes -- may run you out of VRAM space! What is TCP_NODELAY problem? Causes many packets will small payloads What to do? detect/repair. Can always convert clusters, but open problem How about receiving packets in new system? 4 packet categories: Non-IP, Small IP (< 100 B), Large TCP/IP, other IP When do we put data is a SCC? Optimize if all of following hold: - Large TCP/IP. Why? (small have tiny payload, VRAM bufs scarce) - Packet has no control information (e.g., FIN, RST bits, etc.) - Packet passes "header prediction"--i.e., is expected packet, in order If optimizing, put data in a SCC Calculate checksum of TCP header (but not payload!) Do we ACK data? Not yet, because haven't checked checksum Application calls recv() - now we calculate cksum What if cksum fails? Have to block or return EAGAIN How does this interact with select? (Assumes applications can deal with "false positives" from select) Must also fix up if application asks for non-integral # of buffers If req < 1 buffer, calculate whole checksum, remainder in regular mbuf If n < req < n+1, return short read, process just n buffers What happens if app. doesn't call recv() quickly? Exploit delayed ACK mechanism: convert SCC to regular, check cksum, & ACK How are preformance results? What questions should we be asking? Compare Fig 4 back to Table 1 -- looks like they predicted roughly right So did they achieve what they wanted? What about all "1 Gbit/sec" talk? What they did bought a factor of 2, but still 5x away from 1 Gbit/sec Are ideas still useful today? How are network interface cards different today? DMA straight into and out of main memory Typically have linked ring of packet descriptor structures in memory Poke card with first descriptor for transmission/reception Like afterburner, cards can compute TCP checksums in hardware So would single-copy clusters be useful today? Could just put the mbuf memory into packet descriptors for cards But typically isn't done for several reasons Convenient to arrange buffers in a ring and not change them Some cards require 4-byte-aligned boundaries Inconvenient--14-byte ethernet header misaligns 32-bit numbers in IP Small packets received in big buffers waste memory But could potentially have zero-copy I/O with specialized OS! Ideas for better system call interface? ...