Capriccio
=========

What is goal of Capriccio?  Gain benefits of even-driven systems
  Co-operative (non-preemptive) threads package
  Scalability to 100,000 threads - O(1) efficient ops, minimize stack space
  "Resource-aware" scheduling

Look at Figure 1.  Explain graph -- Why do NPTL/LinuxThreads tank?
  Capriccio non-preemptive, so no lock contention!

What are real scalability bottlenecks?
Virtual and Physical memory for stacks
  Why do we care about VM?  (32-bit machines, max stack might be 2MB)
    Note:  Newer 64-bit PCs typically have 48-bit virtual address space
  Why do we care about PM?
    Page size typically 4K - would waste average of 2KB per thread
    Also, OS might not be clever enough to reclaim thread stack space
      Could do something with mmap (), but would add more syscall overhead
Algorithms?
  Bit fuzzier, but generally want O(1), not O(# threads)
  Examples: Prioritizing run threads and stride scheduling in S4.2

What is Linked stack management?
  Break stack into pieces
  Can code deal with non-contiguous stack?
    Yes, if discontinuity across procedure boundary
    return usually restores stack pointer based on frame pointer
      can just restore to different place
    But have to know max space used within function
      What about alloca() -- oops, probably can't allow
  Compute graph like Fig. 5 - place checkpoints where might overflow
  What is internal wasted space?
    Skipped rest of chunk, added bookkeeping to get new chunk
  What is external wasted space?
    Bottom of current chunk is unused (maybe didn't take fork in graph)
    This is what traditional threads packages have with one big stack
  How do these get tuned?
    MaxPath - increase for fewer checkpoints, more internal wasted space
    MinChunk - increase for less stack linking, more external wasted space
  How to deal with library code?
  How to do function pointers?  (guess based on type signature)

3.5 compares to guard pages - what are these?
  If have enough VM, use page faults to allocate physical memory as needed
    Idea: overflowing stack always gives page fault on guard page
  Advantage?
    Common case is cheap (no extra code)
  Drawbacks?
    Huge multi-page buffers might undetectedly trample other thread's stack
    Overhead for handling page fault probably higher than checkpoint
    Might use more physical memory?  How much of a problem is this?
      Datapoint, S2.6 uses 4KB for MinChunk, which is also x86 page size

What is resource-aware scheduling?
  What resources?  CPU, memory, file descriptors
  What about disk?  Hard because request rate depends on locality
  Compute blocking graph (Fig. 8)
  Track resource utilization
  Maintain separate run queue for each node in graph
  "Natural admission control" p. 9?

What about tasks that don't yield?
  p. 10:  "use multiple kernel threads"
  Think this would just work?

Eval?  Convincing?