High Speed Switch Scheduling for Local Area Networks
====================================================

What is goal of paper?
  - Support speeds of up to 1 Gb/sec on LANs

Advantages of switched network
  - high aggregate throughput (faster than link speed)
  - can add throughput incrementally
  - lower latency (don't wait to acquire control of net to xmit)
  - potential for redundant paths for availability

How big is their switch design supposed to scale?
  From 16x16 to 64x64.  Why?
  - Too big would make switch too expensive for small network
      Point is you can incrementally build net by buying more switches
  - But optical drivers most expensive part of switch
      Don't want too many fibers between switches

What is switch architecture?
  * ATM network, uses fixed-size cells
  * Does data forwarding hardware also do switch scheduling?
    - Traditionally yes, but NOT for this switch
    - Argument:  Specialize scheduling hardware for task
        (including here special wires from all input to output ports)
  * Crossbar
    - Simple
    - Requires O(N^2) hardware, but still only 5% of cost of switch
        (probably because optical components are so expensive)
    - Could switch have used Banyan network instead?
        Requirement is that fabric not block, but batcher-banyan OK
    - Could switch have used shared memory?  No--not enough bandwidth

What is Head-of-Line (HOL) blocking?
  * When each input must send data in FIFO order
  * Why is this bad?
    - Limits throughput to 58% of link when destinations uniformly distributed!
    - Can be even worse for periodic traffic.  Stationary blocking:
                     ____
          44332211 -|    |-
          44332211 -|    |-
          44332211 -|    |-
          44332211 -|    |-
                     ----
  * Can this be solved by output buffering?
    - Use k replicas of switch fabric to forward k packets to same output
    - Buffer packets at output port when contention
    - Drop packets when buffers overflow--rare with uniform workload
    - But LAN traffic is not uniform!  (Think client-server)
    - Bad to drop packets in network with bit error rate of 10^{-12}
  * What is a re-circulating queue
    - Feeds dropped packets back into fabric, but eventually will drop anyway
  * Other alternative:  Try first k packets (non-FIFO)
    - Try head of line, if fails, try second packet, etc.
    - Only tries k packets, so mitigates but doesn't solve HOL problem

Parallel Iterative Matching
  * What is the difference between Maximum and Maximal match?
    - Maximum:  Pairs highest possible number of inputs and outputs
    - Maximal:  Cannot add any new pairing (without removing an existing one)
  * What are advantages of each
    - Maximum:  Best throughput (matching)
                  Worst-case Maximal only 50% as good, but expect 85%-100%
    - Maximal:  Potentially easier to computer (especially in distributed way)
                  Maximum computation for NxN graph w. M links O(N*(N+M))
    - Maximal:  Avoids starvation.  Consider optimum match for:
                     ____
          11111111 -|    |-
          12121212 -|    |-
                     ----
  * How does parallel iterative matching work?
    - Goal:  Find maximal match
      1. Each inputs request all outputs for which it has cells
      2. Output randomly grants to one of the requesting inputs
      3. Each input w. a grant randomly selects one grant to accept
      4. Repeat for all unsatisfied inputs
  * How many times does algorithm repeat?  4.  Why?
    - 1 Gb/s, 53B packets = 424 nsec per packet
        only have time for 4 iterations with FPGA hardware they used
    - How fast would we expect algorithm to converge?
        Consider output port Q
        Say n requesting inputs, of which k get no other grants
        If Q grants to one of the k, all inputs resolved
          i.e., can't add another edge to this output
        If Q grants to one of the n-k, then k inputs may be resolved 
        Average unresolved = (k/n)*0 + (1-k/n)*k = k(n-k)/n
          Minimum when k = n/2 (differentiate), so (n/2)(n-n/2)/n = n/4
        So should converge in log steps, details in appendix (but typo there)
    - Simulations show achieves maximal match 98% of time
  * p.5-6 "the switch maintains a FIFO queue for each flow".  Why?
    - E.g., recall AAL5 CS-PDU from last lecture
  * Note: no centralized scheduler for algorithm

How to accommodate CBR (constant bit rate) traffic?
  * Divide time up into *frames*, consisting of a fixed number of *slots*
    - slot is time to transmit one cell
    - frame doesn't mean physical layer fram -- strictly internal timing unit
  * Bandwidth request expressed as # of cells/frame
  * How big are frames? 1000 slots in prototype.  Trade-offs?
    - Small frame -- Low latency (fast frame-turnover time)
    - Large frame -- Small granularity (1/frame size) * (link bandwidth)
  * How to determine if you can schedule a b/w request
    - Find path through network in which new flow doesn't exceed link b/w
    - At switch, just check in and out ports have enough unreserved b/w
  * If path exists, is parallel iterative matching good enough?
    - No.  E.g., because of randomness, nothing is guaranteed
    - So keep an explicit per-port schedule of reserved slots
    - May need to shuffle slots around (E.g., add 2->4 in fig 3)
  * How expensive is static scheduling?  Is this a problem?
    - O((frame size) * N) -- but okey because doesn't happen all the time

What are limitations of parallel iterative matching?
  * Not good for CBR traffic that doesn't always use full bit rate
  * Not fair.  What is fairness?
      "Every user should receive an equal share of every network
       resource that does not have enough capacity to satisfy all user
       requests"
    - How is it unfair?  Consider:
                     ____
          11111111 -|    |-
          11111111 -|    |-
          11111111 -|    |-
          43214321 -|    |-
                     ----
      4th port traffic will go to 1 1/5 as often as to other ports
        P1 = (1/4)*(1/4) = 6.25%, P2=P3=P4 = (1/4)*(1/4)+(3/4)*(1/3) = 31.25%
      Even if switch is fair, network might not be (see Fig. 6)

Statistical matching
  * Rough idea:  weight probabilities of grants & accepts to achieve fairness
    - No request phase.  Grant in proportion to reservation b/w
    - Similarly, probabilistically weight accepts
    - Repeat a second time, discarding any 2nd round assignments that conflict

What is contribution?
How is evaluation?