Borrowed Virtual Time ===================== What is the goal of this work? They propose "a candidate 'universal' processor scheduler" Should support a mix of real-time and best-effort tasks Should be simple for programmers to use Should be easy and cheap to implement What's wrong with previous schedulers? Conventional schedulers - bad for real-time tasks Priority schedulers Get into serious trouble if high-prio thread over-consumes CPU Deadline-based schedulers Requires application to predict when and where it will need CPU Penalty for overestimating CPU needs - might get refused Penalty for underestimating CPU needs - might miss deadline Doesn't work at all for things like network packet receive code What is basic idea of BVT algorithm? Monitor "virtual time" A_i of each process Schedule the process with the lowest "effective virtual time" E_i What's this? Special "Warp" factor borrows against future CPU usage E_i = A_i - (warp_i ? W_i : 0) ... hence name of algorithm How to assign different priorities to threads? (e.g., gcc vs. bigsim) Each thread has a "weight" determining its fraction of the CPU Intuitively, thread i should get w_i / (Sigma_j w_j) fraction of CPU If thread i runs for time t, charge it A_i += t/w_i virtual time (to implement, pre-compute m_i = N/w_i for some fixed N, A_i += k*m_i) So let's simulate gcc (w=2) and bigsim (w=1), with no warping When do we context switch? On timer (clock) interrupts? clock 0 => gcc clock 3 => gcc clock 1 => gcc clock 4 => gcc clock 2 => bigsim clock 5 => bigsim Is this a good idea? No - too many context switches E.g., context switch might be 5 µsec, clock interrupt 100 µsec, costs 5% (Other context switch costs include TLB and cache misses after switch) Okay, so why not switch each 100 msec (using traditional Unix quantum) 100 msec may be too much latency for real-time high-weight thread E.g., don't want bigsim delaying mpeg_play for longer than frame time When do they actually context switch? Add context-switch allowance value, C (e.g., 200 msec) Switch from i -> j when E_j <= E_i - C/w_i With algorithm so far, what happens if thread i wakes up after 30 seconds? A_i much smaller than other running threads--so thread could monopolize CPU How to fix? Bound lag with Scheduler Virtual Time (SVT): After waking up, set A_i = max (A_i, SVT) SVT is min A_j for all runable threads j - See Fig 2 For which threads should scheduler consider A_j in computing SVT Blocked in socket read? No Blocked in file read? Maybe (probably not if over NFS) Blocked in page fault? Probably, as would be unfair otherwise If thread descheduled for a while, might also be paged out Should be allowed to run to page itself in What is the warp parameter W_i, and how does it get set? Intuitively, W_i gives a thread dispatch preference when scheduled Doesn't affect a thread's long-term share of CPU specified by weight Affects priority on wakeup when thread not using its full share W_i set by syscall; sounds like must be root to increase But recall W_i only affects E_i if warp_i = true. How to set warp_i? Can set it with a system call, or Can set it for signal handler with SA_BVT_WARP flag Look at Fig 3 to see the effect on mpeg_player What are the L_i and U_i values L_i limits how long you can run warped - why might you want this? Example: Packet receive thread Usually want low latency handling of received packets But under heavy load, might overly delay downstream processing Note L_i = 0 means no limit (ok if small W_i value anyway) U_i says how often L_i limit reset Example: Task that runs for 1 msec every 100 msec Usually want low latency But could malfunction and call setitimer with 2 msec interval Would add too much latency to other processes, so set U_i = 100 msec How might BVT help Google's web servers? Common queries 150 times faster than uncommon ones Have 10-thread pool of threads to handle requests Assign W_i value sufficient to process fast query (say 50) What happens with 1 slow query, small trickle of fast queries Fast queries come in, warped by 50, execute immediately Slow query runs in background What happens with 1 slow query, but many fast queries At first, only fast queries run But SVT is bounced by A_i of slow query thread i Eventually Fast query thread j gets A_j = max (A_j, SVT) = A_j At that point thread i will run again, so no starvation What is multi-level scheduling, and why do you want it? Idea: Treat one scheduler like a single task in its parent scheduler Whenever child scheduler is dispatched, run it's first task as usual Allows easy mixing of real-time and conventional tasks E.g., two real-time tasks A, B need 10% and 20% of CPU each Assign weights w_A = 1, w_B = 2, w_{child-scheduler} = 7 Conventional processes inserted on child-scheduler Give A and B very high warp values W_A, W_B What happens if B goes nuts and starts chewing up CPU time? Might monopolize CPU for duration of W_B But then child-scheduler still has 70% of CPU So you can actually type "kill" to your shell (very important!) What do you do with Warp values in child scheduler? Two options: "threshold mode" child-scheduler C has its own fixed warp value W_C set warp_C = true if selected thread warped over certain amount note warp values still used for decisions within C "direct mode" always run as if warp_C = warp_i and W_C = W_i for current thread i When would you use the two modes? If parent real-time, child best effort, want threshold (otherwise, conventional task could mess up real-time task) If child real-time, then use direct mode Might happen if want to give bunch of RT tasks fixed fraction of CPU Linux implementation details Pre-compute m_i value based on w_i Always determine current & next thread, to keep "countdown" value (saves running the algorithm on each timer interrupt) Always run thread with lowest E_i if it just woke up in other words, C does not apply after wake up event--why? What questions should we be asking in evaluating BVT? Is algorithm itself expensive to run? no, costs at most 0.3% Do processes get their fair share of CPU? yes, this is easy Are dispatch latency & response time good? more complicated, discuss below Is BVT a candidate for a 'universal' processor scheduler? I.e., can we do anything we need? Look at Figure 5 Why does BVT(diff) do better than BVT(same)? With equal weights, if 2 procs not schedulable, both will miss deadline Looks like DBS (deadline-based scheduler) does best... ditch BVT? No, point is utilization is hard to predict This graph only shows e_i (prediction error) = 0 That's why we have Fig 6-8 In Fig 6, DBS lower better than DBS upper, but not always in Fig 8. Why? Less contention (Fig 6) means no penalty for over-allocating BVT does better with error and contention. Why? Because BVT doesn't have error--apps don't need to predict usage Higher penalty for error with more contention How to calculate weights in a hard real-time system? Sec 6.1.1, p. 10 Can you emulate priority-based scheduling with BVT? Sec 6.1.3, p. 12 High level: Use warp to emulate priority Works fine as long as thread doesn't fail and chew up too much CPU What are shortcomings of BVT? What can't it emulate? Tickets & donation (e.g., to avoid priority inversion in RPC) p. 16: "It is better to provide a single simple measure, such as virtual time, that provides well-specified behavior relative to other tasks and have applications map their requirements onto this measure than to require applications to specify their requirements in detail."