Borrowed Virtual Time
=====================

What is the goal of this work?
  They propose "a candidate 'universal' processor scheduler"
    Should support a mix of real-time and best-effort tasks
    Should be simple for programmers to use
    Should be easy and cheap to implement
  What's wrong with previous schedulers?
    Conventional schedulers - bad for real-time tasks
    Priority schedulers
      Get into serious trouble if high-prio thread over-consumes CPU
    Deadline-based schedulers
      Requires application to predict when and where it will need CPU
      Penalty for overestimating CPU needs - might get refused
      Penalty for underestimating CPU needs - might miss deadline
      Doesn't work at all for things like network packet receive code

What is basic idea of BVT algorithm?
  Monitor "virtual time" A_i of each process
  Schedule the process with the lowest "effective virtual time" E_i
  What's this?  Special "Warp" factor borrows against future CPU usage
    E_i = A_i - (warp_i ? W_i : 0)
  ... hence name of algorithm

How to assign different priorities to threads?  (e.g., gcc vs. bigsim)
  Each thread has a "weight" determining its fraction of the CPU
    Intuitively, thread i should get w_i / (Sigma_j w_j) fraction of CPU
  If thread i runs for time t, charge it A_i += t/w_i virtual time
    (to implement, pre-compute m_i = N/w_i for some fixed N, A_i += k*m_i)
  So let's simulate gcc (w=2) and bigsim (w=1), with no warping
    When do we context switch?  On timer (clock) interrupts?
      clock 0 => gcc           clock 3 => gcc          
      clock 1 => gcc           clock 4 => gcc          
      clock 2 => bigsim        clock 5 => bigsim
    Is this a good idea?  No - too many context switches
      E.g., context switch might be 5 ľsec, clock interrupt 100 ľsec, costs 5%
      (Other context switch costs include TLB and cache misses after switch)
    Okay, so why not switch each 100 msec (using traditional Unix quantum)
      100 msec may be too much latency for real-time high-weight thread
      E.g., don't want bigsim delaying mpeg_play for longer than frame time
  When do they actually context switch?
    Add context-switch allowance value, C (e.g., 200 msec)
    Switch from i -> j when E_j <= E_i - C/w_i

With algorithm so far, what happens if thread i wakes up after 30 seconds?
  A_i much smaller than other running threads--so thread could monopolize CPU
  How to fix?  Bound lag with Scheduler Virtual Time (SVT):
    After waking up, set A_i = max (A_i, SVT)
    SVT is min A_j for all runable threads j - See Fig 2
  For which threads should scheduler consider A_j in computing SVT
    Blocked in socket read?  No
    Blocked in file read?  Maybe (probably not if over NFS)
    Blocked in page fault?  Probably, as would be unfair otherwise
      If thread descheduled for a while, might also be paged out
      Should be allowed to run to page itself in 

What is the warp parameter W_i, and how does it get set?
  Intuitively, W_i gives a thread dispatch preference when scheduled
    Doesn't affect a thread's long-term share of CPU specified by weight
    Affects priority on wakeup when thread not using its full share
  W_i set by syscall; sounds like must be root to increase
  But recall W_i only affects E_i if warp_i = true.  How to set warp_i?
    Can set it with a system call, or
    Can set it for signal handler with SA_BVT_WARP flag
  Look at Fig 3 to see the effect on mpeg_player
  What are the L_i and U_i values
    L_i limits how long you can run warped - why might you want this?
      Example: Packet receive thread
        Usually want low latency handling of received packets
        But under heavy load, might overly delay downstream processing
      Note L_i = 0 means no limit (ok if small W_i value anyway)
    U_i says how often L_i limit reset
      Example: Task that runs for 1 msec every 100 msec
        Usually want low latency
        But could malfunction and call setitimer with 2 msec interval
        Would add too much latency to other processes, so set U_i = 100 msec

How might BVT help Google's web servers?
  Common queries 150 times faster than uncommon ones
  Have 10-thread pool of threads to handle requests
  Assign W_i value sufficient to process fast query (say 50)
  What happens with 1 slow query, small trickle of fast queries
    Fast queries come in, warped by 50, execute immediately
    Slow query runs in background
  What happens with 1 slow query, but many fast queries
    At first, only fast queries run
    But SVT is bounced by A_i of slow query thread i
    Eventually Fast query thread j gets A_j = max (A_j, SVT) = A_j
    At that point thread i will run again, so no starvation

What is multi-level scheduling, and why do you want it?
  Idea:  Treat one scheduler like a single task in its parent scheduler
    Whenever child scheduler is dispatched, run it's first task as usual
  Allows easy mixing of real-time and conventional tasks
    E.g., two real-time tasks A, B need 10% and 20% of CPU each
    Assign weights w_A = 1, w_B = 2, w_{child-scheduler} = 7
    Conventional processes inserted on child-scheduler
    Give A and B very high warp values W_A, W_B
    What happens if B goes nuts and starts chewing up CPU time?
      Might monopolize CPU for duration of W_B
      But then child-scheduler still has 70% of CPU
      So you can actually type "kill" to your shell (very important!)
  What do you do with Warp values in child scheduler?  Two options:
    "threshold mode"
      child-scheduler C has its own fixed warp value W_C
      set warp_C = true if selected thread warped over certain amount
      note warp values still used for decisions within C
    "direct mode"
      always run as if warp_C = warp_i and W_C = W_i for current thread i
    When would you use the two modes?
      If parent real-time, child best effort, want threshold
        (otherwise, conventional task could mess up real-time task)
      If child real-time, then use direct mode
        Might happen if want to give bunch of RT tasks fixed fraction of CPU

Linux implementation details
  Pre-compute m_i value based on w_i
  Always determine current & next thread, to keep "countdown" value
    (saves running the algorithm on each timer interrupt)
  Always run thread with lowest E_i if it just woke up
    in other words, C does not apply after wake up event--why?

What questions should we be asking in evaluating BVT?
  Is algorithm itself expensive to run? no, costs at most 0.3%
  Do processes get their fair share of CPU?  yes, this is easy
  Are dispatch latency & response time good?  more complicated, discuss below
  Is BVT a candidate for a 'universal' processor scheduler?
    I.e., can we do anything we need?

Look at Figure 5
  Why does BVT(diff) do better than BVT(same)?
    With equal weights, if 2 procs not schedulable, both will miss deadline
  Looks like DBS (deadline-based scheduler) does best... ditch BVT?
    No, point is utilization is hard to predict
    This graph only shows e_i (prediction error) = 0
  That's why we have Fig 6-8
    In Fig 6, DBS lower better than DBS upper, but not always in Fig 8.
      Why?  Less contention (Fig 6) means no penalty for over-allocating
    BVT does better with error and contention.  Why?
      Because BVT doesn't have error--apps don't need to predict usage
      Higher penalty for error with more contention

How to calculate weights in a hard real-time system?
  Sec 6.1.1, p. 10

Can you emulate priority-based scheduling with BVT?
  Sec 6.1.3, p. 12
  High level:  Use warp to emulate priority
    Works fine as long as thread doesn't fail and chew up too much CPU

What are shortcomings of BVT?  What can't it emulate?
  Tickets & donation (e.g., to avoid priority inversion in RPC)

p. 16: "It is better to provide a single simple measure, such as
virtual time, that provides well-specified behavior relative to other
tasks and have applications map their requirements onto this measure
than to require applications to specify their requirements in detail."