sched-ext.rst 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326
  1. ==========================
  2. Extensible Scheduler Class
  3. ==========================
  4. sched_ext is a scheduler class whose behavior can be defined by a set of BPF
  5. programs - the BPF scheduler.
  6. * sched_ext exports a full scheduling interface so that any scheduling
  7. algorithm can be implemented on top.
  8. * The BPF scheduler can group CPUs however it sees fit and schedule them
  9. together, as tasks aren't tied to specific CPUs at the time of wakeup.
  10. * The BPF scheduler can be turned on and off dynamically anytime.
  11. * The system integrity is maintained no matter what the BPF scheduler does.
  12. The default scheduling behavior is restored anytime an error is detected,
  13. a runnable task stalls, or on invoking the SysRq key sequence
  14. :kbd:`SysRq-S`.
  15. * When the BPF scheduler triggers an error, debug information is dumped to
  16. aid debugging. The debug dump is passed to and printed out by the
  17. scheduler binary. The debug dump can also be accessed through the
  18. `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D`
  19. triggers a debug dump. This doesn't terminate the BPF scheduler and can
  20. only be read through the tracepoint.
  21. Switching to and from sched_ext
  22. ===============================
  23. ``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
  24. ``tools/sched_ext`` contains the example schedulers. The following config
  25. options should be enabled to use sched_ext:
  26. .. code-block:: none
  27. CONFIG_BPF=y
  28. CONFIG_SCHED_CLASS_EXT=y
  29. CONFIG_BPF_SYSCALL=y
  30. CONFIG_BPF_JIT=y
  31. CONFIG_DEBUG_INFO_BTF=y
  32. CONFIG_BPF_JIT_ALWAYS_ON=y
  33. CONFIG_BPF_JIT_DEFAULT_ON=y
  34. CONFIG_PAHOLE_HAS_SPLIT_BTF=y
  35. CONFIG_PAHOLE_HAS_BTF_TAG=y
  36. sched_ext is used only when the BPF scheduler is loaded and running.
  37. If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
  38. treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
  39. loaded.
  40. When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set
  41. in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and
  42. ``SCHED_EXT`` tasks are scheduled by sched_ext.
  43. However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is
  44. set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled
  45. by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and
  46. ``SCHED_IDLE`` policies are scheduled by CFS.
  47. Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or
  48. detection of any internal error including stalled runnable tasks aborts the
  49. BPF scheduler and reverts all tasks back to CFS.
  50. .. code-block:: none
  51. # make -j16 -C tools/sched_ext
  52. # tools/sched_ext/build/bin/scx_simple
  53. local=0 global=3
  54. local=5 global=24
  55. local=9 global=44
  56. local=13 global=56
  57. local=17 global=72
  58. ^CEXIT: BPF scheduler unregistered
  59. The current status of the BPF scheduler can be determined as follows:
  60. .. code-block:: none
  61. # cat /sys/kernel/sched_ext/state
  62. enabled
  63. # cat /sys/kernel/sched_ext/root/ops
  64. simple
  65. You can check if any BPF scheduler has ever been loaded since boot by examining
  66. this monotonically incrementing counter (a value of zero indicates that no BPF
  67. scheduler has been loaded):
  68. .. code-block:: none
  69. # cat /sys/kernel/sched_ext/enable_seq
  70. 1
  71. ``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
  72. detailed information:
  73. .. code-block:: none
  74. # tools/sched_ext/scx_show_state.py
  75. ops : simple
  76. enabled : 1
  77. switching_all : 1
  78. switched_all : 1
  79. enable_state : enabled (2)
  80. bypass_depth : 0
  81. nr_rejected : 0
  82. enable_seq : 1
  83. If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can
  84. be determined as follows:
  85. .. code-block:: none
  86. # grep ext /proc/self/sched
  87. ext.enabled : 1
  88. The Basics
  89. ==========
  90. Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
  91. programs that implement ``struct sched_ext_ops``. The only mandatory field
  92. is ``ops.name`` which must be a valid BPF object name. All operations are
  93. optional. The following modified excerpt is from
  94. ``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.
  95. .. code-block:: c
  96. /*
  97. * Decide which CPU a task should be migrated to before being
  98. * enqueued (either at wakeup, fork time, or exec time). If an
  99. * idle core is found by the default ops.select_cpu() implementation,
  100. * then dispatch the task directly to SCX_DSQ_LOCAL and skip the
  101. * ops.enqueue() callback.
  102. *
  103. * Note that this implementation has exactly the same behavior as the
  104. * default ops.select_cpu implementation. The behavior of the scheduler
  105. * would be exactly same if the implementation just didn't define the
  106. * simple_select_cpu() struct_ops prog.
  107. */
  108. s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
  109. s32 prev_cpu, u64 wake_flags)
  110. {
  111. s32 cpu;
  112. /* Need to initialize or the BPF verifier will reject the program */
  113. bool direct = false;
  114. cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
  115. if (direct)
  116. scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
  117. return cpu;
  118. }
  119. /*
  120. * Do a direct dispatch of a task to the global DSQ. This ops.enqueue()
  121. * callback will only be invoked if we failed to find a core to dispatch
  122. * to in ops.select_cpu() above.
  123. *
  124. * Note that this implementation has exactly the same behavior as the
  125. * default ops.enqueue implementation, which just dispatches the task
  126. * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
  127. * if the implementation just didn't define the simple_enqueue struct_ops
  128. * prog.
  129. */
  130. void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
  131. {
  132. scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
  133. }
  134. s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
  135. {
  136. /*
  137. * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and
  138. * SCHED_BATCH tasks should use sched_ext.
  139. */
  140. return 0;
  141. }
  142. void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
  143. {
  144. exit_type = ei->type;
  145. }
  146. SEC(".struct_ops")
  147. struct sched_ext_ops simple_ops = {
  148. .select_cpu = (void *)simple_select_cpu,
  149. .enqueue = (void *)simple_enqueue,
  150. .init = (void *)simple_init,
  151. .exit = (void *)simple_exit,
  152. .name = "simple",
  153. };
  154. Dispatch Queues
  155. ---------------
  156. To match the impedance between the scheduler core and the BPF scheduler,
  157. sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
  158. priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
  159. and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
  160. an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and
  161. ``scx_bpf_destroy_dsq()``.
  162. A CPU always executes a task from its local DSQ. A task is "dispatched" to a
  163. DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
  164. local DSQ.
  165. When a CPU is looking for the next task to run, if the local DSQ is not
  166. empty, the first task is picked. Otherwise, the CPU tries to consume the
  167. global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
  168. is invoked.
  169. Scheduling Cycle
  170. ----------------
  171. The following briefly shows how a waking task is scheduled and executed.
  172. 1. When a task is waking up, ``ops.select_cpu()`` is the first operation
  173. invoked. This serves two purposes. First, CPU selection optimization
  174. hint. Second, waking up the selected CPU if idle.
  175. The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
  176. binding. The actual decision is made at the last step of scheduling.
  177. However, there is a small performance gain if the CPU
  178. ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
  179. A side-effect of selecting a CPU is waking it up from idle. While a BPF
  180. scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
  181. using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
  182. A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by
  183. calling ``scx_bpf_dispatch()``. If the task is dispatched to
  184. ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the
  185. local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
  186. Additionally, dispatching directly from ``ops.select_cpu()`` will cause the
  187. ``ops.enqueue()`` callback to be skipped.
  188. Note that the scheduler core will ignore an invalid CPU selection, for
  189. example, if it's outside the allowed cpumask of the task.
  190. 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
  191. task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()``
  192. can make one of the following decisions:
  193. * Immediately dispatch the task to either the global or local DSQ by
  194. calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
  195. ``SCX_DSQ_LOCAL``, respectively.
  196. * Immediately dispatch the task to a custom DSQ by calling
  197. ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.
  198. * Queue the task on the BPF side.
  199. 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
  200. empty, it then looks at the global DSQ. If there still isn't a task to
  201. run, ``ops.dispatch()`` is invoked which can use the following two
  202. functions to populate the local DSQ.
  203. * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
  204. be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
  205. ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
  206. currently can't be called with BPF locks held, this is being worked on
  207. and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
  208. rather than performing them immediately. There can be up to
  209. ``ops.dispatch_max_batch`` pending tasks.
  210. * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
  211. to the dispatching DSQ. This function cannot be called with any BPF
  212. locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
  213. before trying to consume the specified DSQ.
  214. 4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
  215. the CPU runs the first one. If empty, the following steps are taken:
  216. * Try to consume the global DSQ. If successful, run the task.
  217. * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
  218. * If the previous task is an SCX task and still runnable, keep executing
  219. it (see ``SCX_OPS_ENQ_LAST``).
  220. * Go idle.
  221. Note that the BPF scheduler can always choose to dispatch tasks immediately
  222. in ``ops.enqueue()`` as illustrated in the above simple example. If only the
  223. built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
  224. a task is never queued on the BPF scheduler and both the local and global
  225. DSQs are consumed automatically.
  226. ``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use
  227. ``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as
  228. ``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
  229. dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the
  230. function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for
  231. more information.
  232. Where to Look
  233. =============
  234. * ``include/linux/sched/ext.h`` defines the core data structures, ops table
  235. and constants.
  236. * ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
  237. The functions prefixed with ``scx_bpf_`` can be called from the BPF
  238. scheduler.
  239. * ``tools/sched_ext/`` hosts example BPF scheduler implementations.
  240. * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
  241. custom DSQ.
  242. * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
  243. levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
  244. ABI Instability
  245. ===============
  246. The APIs provided by sched_ext to BPF schedulers programs have no stability
  247. guarantees. This includes the ops table callbacks and constants defined in
  248. ``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
  249. ``kernel/sched/ext.c``.
  250. While we will attempt to provide a relatively stable API surface when
  251. possible, they are subject to change without warning between kernel
  252. versions.