Capriccio ========= What is goal of Capriccio? Gain benefits of even-driven systems Co-operative (non-preemptive) threads package Scalability to 100,000 threads - O(1) efficient ops, minimize stack space "Resource-aware" scheduling Look at Figure 1. Explain graph -- Why do NPTL/LinuxThreads tank? Capriccio non-preemptive, so no lock contention! What are real scalability bottlenecks? Virtual and Physical memory for stacks Why do we care about VM? (32-bit machines, max stack might be 2MB) Note: Newer 64-bit PCs typically have 48-bit virtual address space Why do we care about PM? Page size typically 4K - would waste average of 2KB per thread Also, OS might not be clever enough to reclaim thread stack space Could do something with mmap (), but would add more syscall overhead Algorithms? Bit fuzzier, but generally want O(1), not O(# threads) Examples: Prioritizing run threads and stride scheduling in S4.2 What is Linked stack management? Break stack into pieces Can code deal with non-contiguous stack? Yes, if discontinuity across procedure boundary return usually restores stack pointer based on frame pointer can just restore to different place But have to know max space used within function What about alloca() -- oops, probably can't allow Compute graph like Fig. 5 - place checkpoints where might overflow What is internal wasted space? Skipped rest of chunk, added bookkeeping to get new chunk What is external wasted space? Bottom of current chunk is unused (maybe didn't take fork in graph) This is what traditional threads packages have with one big stack How do these get tuned? MaxPath - increase for fewer checkpoints, more internal wasted space MinChunk - increase for less stack linking, more external wasted space How to deal with library code? How to do function pointers? (guess based on type signature) 3.5 compares to guard pages - what are these? If have enough VM, use page faults to allocate physical memory as needed Idea: overflowing stack always gives page fault on guard page Advantage? Common case is cheap (no extra code) Drawbacks? Huge multi-page buffers might undetectedly trample other thread's stack Overhead for handling page fault probably higher than checkpoint Might use more physical memory? How much of a problem is this? Datapoint, S2.6 uses 4KB for MinChunk, which is also x86 page size What is resource-aware scheduling? What resources? CPU, memory, file descriptors What about disk? Hard because request rate depends on locality Compute blocking graph (Fig. 8) Track resource utilization Maintain separate run queue for each node in graph "Natural admission control" p. 9? What about tasks that don't yield? p. 10: "use multiple kernel threads" Think this would just work? Eval? Convincing?