High Speed Switch Scheduling for Local Area Networks ==================================================== What is goal of paper? - Support speeds of up to 1 Gb/sec on LANs Advantages of switched network - high aggregate throughput (faster than link speed) - can add throughput incrementally - lower latency (don't wait to acquire control of net to xmit) - potential for redundant paths for availability How big is their switch design supposed to scale? From 16x16 to 64x64. Why? - Too big would make switch too expensive for small network Point is you can incrementally build net by buying more switches - But optical drivers most expensive part of switch Don't want too many fibers between switches What is switch architecture? * ATM network, uses fixed-size cells * Does data forwarding hardware also do switch scheduling? - Traditionally yes, but NOT for this switch - Argument: Specialize scheduling hardware for task (including here special wires from all input to output ports) * Crossbar - Simple - Requires O(N^2) hardware, but still only 5% of cost of switch (probably because optical components are so expensive) - Could switch have used Banyan network instead? Requirement is that fabric not block, but batcher-banyan OK - Could switch have used shared memory? No--not enough bandwidth What is Head-of-Line (HOL) blocking? * When each input must send data in FIFO order * Why is this bad? - Limits throughput to 58% of link when destinations uniformly distributed! - Can be even worse for periodic traffic. Stationary blocking: ____ 44332211 -| |- 44332211 -| |- 44332211 -| |- 44332211 -| |- ---- * Can this be solved by output buffering? - Use k replicas of switch fabric to forward k packets to same output - Buffer packets at output port when contention - Drop packets when buffers overflow--rare with uniform workload - But LAN traffic is not uniform! (Think client-server) - Bad to drop packets in network with bit error rate of 10^{-12} * What is a re-circulating queue - Feeds dropped packets back into fabric, but eventually will drop anyway * Other alternative: Try first k packets (non-FIFO) - Try head of line, if fails, try second packet, etc. - Only tries k packets, so mitigates but doesn't solve HOL problem Parallel Iterative Matching * What is the difference between Maximum and Maximal match? - Maximum: Pairs highest possible number of inputs and outputs - Maximal: Cannot add any new pairing (without removing an existing one) * What are advantages of each - Maximum: Best throughput (matching) Worst-case Maximal only 50% as good, but expect 85%-100% - Maximal: Potentially easier to computer (especially in distributed way) Maximum computation for NxN graph w. M links O(N*(N+M)) - Maximal: Avoids starvation. Consider optimum match for: ____ 11111111 -| |- 12121212 -| |- ---- * How does parallel iterative matching work? - Goal: Find maximal match 1. Each inputs request all outputs for which it has cells 2. Output randomly grants to one of the requesting inputs 3. Each input w. a grant randomly selects one grant to accept 4. Repeat for all unsatisfied inputs * How many times does algorithm repeat? 4. Why? - 1 Gb/s, 53B packets = 424 nsec per packet only have time for 4 iterations with FPGA hardware they used - How fast would we expect algorithm to converge? Consider output port Q Say n requesting inputs, of which k get no other grants If Q grants to one of the k, all inputs resolved i.e., can't add another edge to this output If Q grants to one of the n-k, then k inputs may be resolved Average unresolved = (k/n)*0 + (1-k/n)*k = k(n-k)/n Minimum when k = n/2 (differentiate), so (n/2)(n-n/2)/n = n/4 So should converge in log steps, details in appendix (but typo there) - Simulations show achieves maximal match 98% of time * p.5-6 "the switch maintains a FIFO queue for each flow". Why? - E.g., recall AAL5 CS-PDU from last lecture * Note: no centralized scheduler for algorithm How to accommodate CBR (constant bit rate) traffic? * Divide time up into *frames*, consisting of a fixed number of *slots* - slot is time to transmit one cell - frame doesn't mean physical layer fram -- strictly internal timing unit * Bandwidth request expressed as # of cells/frame * How big are frames? 1000 slots in prototype. Trade-offs? - Small frame -- Low latency (fast frame-turnover time) - Large frame -- Small granularity (1/frame size) * (link bandwidth) * How to determine if you can schedule a b/w request - Find path through network in which new flow doesn't exceed link b/w - At switch, just check in and out ports have enough unreserved b/w * If path exists, is parallel iterative matching good enough? - No. E.g., because of randomness, nothing is guaranteed - So keep an explicit per-port schedule of reserved slots - May need to shuffle slots around (E.g., add 2->4 in fig 3) * How expensive is static scheduling? Is this a problem? - O((frame size) * N) -- but okey because doesn't happen all the time What are limitations of parallel iterative matching? * Not good for CBR traffic that doesn't always use full bit rate * Not fair. What is fairness? "Every user should receive an equal share of every network resource that does not have enough capacity to satisfy all user requests" - How is it unfair? Consider: ____ 11111111 -| |- 11111111 -| |- 11111111 -| |- 43214321 -| |- ---- 4th port traffic will go to 1 1/5 as often as to other ports P1 = (1/4)*(1/4) = 6.25%, P2=P3=P4 = (1/4)*(1/4)+(3/4)*(1/3) = 31.25% Even if switch is fair, network might not be (see Fig. 6) Statistical matching * Rough idea: weight probabilities of grants & accepts to achieve fairness - No request phase. Grant in proportion to reservation b/w - Similarly, probabilistically weight accepts - Repeat a second time, discarding any 2nd round assignments that conflict What is contribution? How is evaluation?