Multicast ========= Suppose you want to send message to many recipients. Examples: Internet radio Stock quote information Internet multi-way chat / video conferencing Multi-player games Can just send many copies of same message (iterated unicast). Drawbacks? 128Kbps MP3 stream x 9,000 listeners = 1.1Gbps ==> huge bandwidth costs Same data sent many times over same link. Known as *link stress* Want some kind of multicast service. What is this? Receivers join a multicast group G Senders send packets to address G Network routes and delivers packets to all members of G On Internet, class D addresses (start 1110) 224.0.0.0 - 239.255.255.255 What can you do over a single Ethernet? Use multicast -- send one copy over entire network (easy w. shared medium) Use Ethernet multicast address range 00:00:5e:00:00:00 -- 00:00:5e:7f:ff:ff Simply set low 23-bits of Ethernet address to low 23 of IP address But can't do that over Internet (across Ethernets) Form a tree out of receivers Each radio station listener forwards data to two other listeners Still have some link stress--but pushed out closer to receivers Might be bad for recipients who don't have much upstream b/w to spare Now messages have to travel over more network hops for many recipients Will add latency to message delivery This is known as *stretch* Can we reduce upstream b/w for receivers in tree scenario? Notice that internal nodes send 2x bit rate, leaves send 0x bit rate So can form a "forest" of multicast trees, splitting data E.g., 9 trees, each send 1 bit per byte, plus one tree for parity Nodes are leaves in some trees, internal in others Can reduce total upstream utilization to 1x bit rate What about eliminating stretch? Optimize topology of your tree? Would help Change the *routers* to send copies of packets out multiple interfaces Now routers themselves form the multicast tree This is what IP multicast does Two approaches to forming trees Source-specific trees For each sender to multicast group figure out tree with shortest path to each recipient minimizes both stress and stretch Shared trees Figure out one tree To send message, forward to root of tree What are advantages/disadvantages of each? Hard to find one shared tree that's best for many senders But required state in routers much larger for source-specific How to form trees? Exploit existing unicast routing infrastructure Each router figures out what multicast groups are being subscribed to Hosts on LAN periodically send IGMP packets with groups they belong to Somehow combine this info w. routing protocol to form multicast trees Augmenting link state routing protocols Recall each router floods network with its link state information (LSP) Allows each router to compute global picture and find best routes Add list of groups subscribed to on each lan to LSP Now can use Dijkstra's algorithm to find shortest-path multicast tree from any source to any group Expensive... so only calculate for active senders Augmenting distance vector protocols Recall each router keeps list of triples How would you do broadcast without loops? Reverse path broadcast (RPB): Router receives packet from source S on link L If L is the next hop for packets *to* S Then forward the packet out all links except L Why does this work? Only forwards packets farther from S Drawbacks Duplicate packets on one LAN, if multiple routers connected. Fix? Designate one router as parent for particular LAN/source pair E.g., This is broadcast, and we want multicast Reverse path multicast (RPM) Again, use IGMP to announce when you are interested in multicast group Propagate "no members of G here" info up the shortest-path tree But requires a lot of bandwidth/info So use RPB (broadcast) until some source starts sending Only for active sources do you prune tree Protocol-independent multicast (PIM) Motivation 1: Want multicast that works regardless of routing (LS/DV/etc.) Motivation 2: Want to handle sparse and dense groups efficiently E.g., RPM not very efficient for very small groups Sparse-mode PIM (PIM-SM) Each group has an explicit Rendezvous Point (RP) Routers send explicit join and leave ("prune") messages to RP Join message contains (*,G), where * means all senders Used to construct a shared multicast tree with RP as root Routers forwarding join requests see which links to send packets down But we know shared trees not optimal... so separate mechanism allows construction of source-specific trees Recipients notice lots of traffic from a source Send source-specific join messages (S,G) towards S Routers can now construct new source-specific tree for S But must keep RP for when new users join Problem: Not all routers implement IP multicast Backbone routers in particular Even if routers support feature, ISP probably scared to turn it on How to use multicast over wide-area network? Use overlay network MBone: groups of multicast-enabled routers, connected by tunnels Can encapsulate multicast IP packets entirely within unicast packets What are the scalability issues for the MBone Bandwidth utilization State at routers Solutions Aggregate network addresses (as with class A/B/C & CIDR) Regions (like ASes, reduces routing state) Scoped addresses 224.0.0.0 - 224.0.0.255 only on local network 224.0.1.0 - 238.255.255.255 globally-scoped 239.0.0.0 - 239.255.255.255 limited scope stay within organization Site-local scope: 239.253.0.0/16 Organization-local scope: 239.192.0/14 routers typically filter these address at the border 233.0.0.0/8 "glop" addressing - second and third bytes are AS numbers Traffic rate limiting We've seen three layers at which to implement multicast - Data link layer (Ethernet) - Network layer (IP multicast) - Application layer (End-system multicast) Which is best? Depends on the situation Data link layer: Efficient, but doesn't work across networks IP multicast: Doesn't scall well with large # of groups Supporting higher-level functionality (e.g., reliability) hard Not available everywhere Application multicast: Most generally available, but: Applications can implement layered multicast I.e., fast hosts get higher-resolution video "Downsample" when forwarding to lower-levels of tree with less b/w More stress and stretch than IP multicast With NAT might be still complicated to implement Maybe can trust routers more than end-systems--possibly less reliable Say you have many senders in a multicast group... What is the order of messages? (Does this question even make sense?) Would at least like to preserve causality If message X caused message Y, then X precedes Y What is causality? We want a notion of X "happens before" Y. Powerful enough to help us write distributed applications. Weak enough to be efficiently implemented. I.e. no global clock or central time-stamp server. - What does "happening before" mean in a distributed system? - In normal life, we look at a physical clock and say something happens before or after. - Using a physical clocks in a distributed system to determine whether an event happens before or after is difficult----it requires having physical clocks on each computer, and synchronizing clocks accurately is difficult. - The "happens-before" relationship in a distributed system: Model: Processes are a sequence of events, where events are an abstract notion, which could be a single instruction, a procedure, sending a message, receiving an interrupt, etc. Happens before within a process: an event A that precedes event B in the sequence is said to happen before B. Lets assume two events: send and receiving messages. A happens before B (1) if A and B are events in the same process and A comes before B, or (2) if A is the sending of a message by one process and B is the receipt of the same message by another process. Notation: A happens before B is written as A -> B. Two events are concurrent if A didn't happen before B and B didn't happen before A. Happens-before defines an irreflexive partial ordering. We want *other* nodes to see events in an order that obeys ->. - Logical clocks -- called Lamport Clocks - A distributed algorithm to order events in a way that preserves causality. - Using the happens-before relationship we can define a logical clock. - Define Ci, a *logical* clock for process Pi as follows: Ci(A) = a number for event A. The number can be a counter. - The global Clock C for the system must adhere to: if A -> B, then C(A) < C(B) This condition is satisfied if: (1) if A and B are events in Pi, then Ci(A) < Ci(B) (2) if A is the sending of a message by process Pi and B is the receipt of that message by process Pj, then Ci(A) < Cj(B). - Implementation rules: (1) Pi increments Ci between any two successive events (2)(a) If event A is the sending of a message M by process Pi, the the message contains a timestamp Tm = Ci(a). (b) Upon receiving a message M, process Pj sets Cj >= its present value and greater than Tm. - Example: use logical clocks to order messages. - A -> B implies C(A) < C(B), but C(A) < C(B) DOES NOT IMPLY A -> B (events may happen concurrently) - Note that two concurrent events may have the same time stamp - Total ordering of events - Order events A -> B by their logical clocks, Break ties by ordering them in some arbitrary order - A ==> B if and only if: (1) Ci(B) < Cj(B), or (2) Ci(A) == Cj(B) and Pi < Pj. ==> is a total order.