sched-capacity.rst 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442
  1. =========================
  2. Capacity Aware Scheduling
  3. =========================
  4. 1. CPU Capacity
  5. ===============
  6. 1.1 Introduction
  7. ----------------
  8. Conventional, homogeneous SMP platforms are composed of purely identical
  9. CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
  10. different performance characteristics - on such platforms, not all CPUs can be
  11. considered equal.
  12. CPU capacity is a measure of the performance a CPU can reach, normalized against
  13. the most performant CPU in the system. Heterogeneous systems are also called
  14. asymmetric CPU capacity systems, as they contain CPUs of different capacities.
  15. Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
  16. from two factors:
  17. - not all CPUs may have the same microarchitecture (µarch).
  18. - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
  19. physically able to attain the higher Operating Performance Points (OPP).
  20. Arm big.LITTLE systems are an example of both. The big CPUs are more
  21. performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
  22. smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
  23. can.
  24. CPU performance is usually expressed in Millions of Instructions Per Second
  25. (MIPS), which can also be expressed as a given amount of instructions attainable
  26. per Hz, leading to::
  27. capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
  28. 1.2 Scheduler terms
  29. -------------------
  30. Two different capacity values are used within the scheduler. A CPU's
  31. ``original capacity`` is its maximum attainable capacity, i.e. its maximum
  32. attainable performance level. This original capacity is returned by
  33. the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
  34. capacity`` to which some loss of available performance (e.g. time spent
  35. handling IRQs) is subtracted.
  36. Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
  37. while ``original capacity`` is class-agnostic. The rest of this document will use
  38. the term ``capacity`` interchangeably with ``original capacity`` for the sake of
  39. brevity.
  40. 1.3 Platform examples
  41. ---------------------
  42. 1.3.1 Identical OPPs
  43. ~~~~~~~~~~~~~~~~~~~~
  44. Consider an hypothetical dual-core asymmetric CPU capacity system where
  45. - work_per_hz(CPU0) = W
  46. - work_per_hz(CPU1) = W/2
  47. - all CPUs are running at the same fixed frequency
  48. By the above definition of capacity:
  49. - capacity(CPU0) = C
  50. - capacity(CPU1) = C/2
  51. To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
  52. be a LITTLE.
  53. With a workload that periodically does a fixed amount of work, you will get an
  54. execution trace like so::
  55. CPU0 work ^
  56. | ____ ____ ____
  57. | | | | | | |
  58. +----+----+----+----+----+----+----+----+----+----+-> time
  59. CPU1 work ^
  60. | _________ _________ ____
  61. | | | | | |
  62. +----+----+----+----+----+----+----+----+----+----+-> time
  63. CPU0 has the highest capacity in the system (C), and completes a fixed amount of
  64. work W in T units of time. On the other hand, CPU1 has half the capacity of
  65. CPU0, and thus only completes W/2 in T.
  66. 1.3.2 Different max OPPs
  67. ~~~~~~~~~~~~~~~~~~~~~~~~
  68. Usually, CPUs of different capacity values also have different maximum
  69. OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
  70. - max_freq(CPU0) = F
  71. - max_freq(CPU1) = 2/3 * F
  72. This yields:
  73. - capacity(CPU0) = C
  74. - capacity(CPU1) = C/3
  75. Executing the same workload as described in 1.3.1, which each CPU running at its
  76. maximum frequency results in::
  77. CPU0 work ^
  78. | ____ ____ ____
  79. | | | | | | |
  80. +----+----+----+----+----+----+----+----+----+----+-> time
  81. workload on CPU1
  82. CPU1 work ^
  83. | ______________ ______________ ____
  84. | | | | | |
  85. +----+----+----+----+----+----+----+----+----+----+-> time
  86. 1.4 Representation caveat
  87. -------------------------
  88. It should be noted that having a *single* value to represent differences in CPU
  89. performance is somewhat of a contentious point. The relative performance
  90. difference between two different µarchs could be X% on integer operations, Y% on
  91. floating point operations, Z% on branches, and so on. Still, results using this
  92. simple approach have been satisfactory for now.
  93. 2. Task utilization
  94. ===================
  95. 2.1 Introduction
  96. ----------------
  97. Capacity aware scheduling requires an expression of a task's requirements with
  98. regards to CPU capacity. Each scheduler class can express this differently, and
  99. while task utilization is specific to CFS, it is convenient to describe it here
  100. in order to introduce more generic concepts.
  101. Task utilization is a percentage meant to represent the throughput requirements
  102. of a task. A simple approximation of it is the task's duty cycle, i.e.::
  103. task_util(p) = duty_cycle(p)
  104. On an SMP system with fixed frequencies, 100% utilization suggests the task is a
  105. busy loop. Conversely, 10% utilization hints it is a small periodic task that
  106. spends more time sleeping than executing. Variable CPU frequencies and
  107. asymmetric CPU capacities complexify this somewhat; the following sections will
  108. expand on these.
  109. 2.2 Frequency invariance
  110. ------------------------
  111. One issue that needs to be taken into account is that a workload's duty cycle is
  112. directly impacted by the current OPP the CPU is running at. Consider running a
  113. periodic workload at a given frequency F::
  114. CPU work ^
  115. | ____ ____ ____
  116. | | | | | | |
  117. +----+----+----+----+----+----+----+----+----+----+-> time
  118. This yields duty_cycle(p) == 25%.
  119. Now, consider running the *same* workload at frequency F/2::
  120. CPU work ^
  121. | _________ _________ ____
  122. | | | | | |
  123. +----+----+----+----+----+----+----+----+----+----+-> time
  124. This yields duty_cycle(p) == 50%, despite the task having the exact same
  125. behaviour (i.e. executing the same amount of work) in both executions.
  126. The task utilization signal can be made frequency invariant using the following
  127. formula::
  128. task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
  129. Applying this formula to the two examples above yields a frequency invariant
  130. task utilization of 25%.
  131. 2.3 CPU invariance
  132. ------------------
  133. CPU capacity has a similar effect on task utilization in that running an
  134. identical workload on CPUs of different capacity values will yield different
  135. duty cycles.
  136. Consider the system described in 1.3.2., i.e.::
  137. - capacity(CPU0) = C
  138. - capacity(CPU1) = C/3
  139. Executing a given periodic workload on each CPU at their maximum frequency would
  140. result in::
  141. CPU0 work ^
  142. | ____ ____ ____
  143. | | | | | | |
  144. +----+----+----+----+----+----+----+----+----+----+-> time
  145. CPU1 work ^
  146. | ______________ ______________ ____
  147. | | | | | |
  148. +----+----+----+----+----+----+----+----+----+----+-> time
  149. IOW,
  150. - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
  151. - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
  152. The task utilization signal can be made CPU invariant using the following
  153. formula::
  154. task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
  155. with ``max_capacity`` being the highest CPU capacity value in the
  156. system. Applying this formula to the above example above yields a CPU
  157. invariant task utilization of 25%.
  158. 2.4 Invariant task utilization
  159. ------------------------------
  160. Both frequency and CPU invariance need to be applied to task utilization in
  161. order to obtain a truly invariant signal. The pseudo-formula for a task
  162. utilization that is both CPU and frequency invariant is thus, for a given
  163. task p::
  164. curr_frequency(cpu) capacity(cpu)
  165. task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
  166. max_frequency(cpu) max_capacity
  167. In other words, invariant task utilization describes the behaviour of a task as
  168. if it were running on the highest-capacity CPU in the system, running at its
  169. maximum frequency.
  170. Any mention of task utilization in the following sections will imply its
  171. invariant form.
  172. 2.5 Utilization estimation
  173. --------------------------
  174. Without a crystal ball, task behaviour (and thus task utilization) cannot
  175. accurately be predicted the moment a task first becomes runnable. The CFS class
  176. maintains a handful of CPU and task signals based on the Per-Entity Load
  177. Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
  178. opposed to instantaneous).
  179. This means that while the capacity aware scheduling criteria will be written
  180. considering a "true" task utilization (using a crystal ball), the implementation
  181. will only ever be able to use an estimator thereof.
  182. 3. Capacity aware scheduling requirements
  183. =========================================
  184. 3.1 CPU capacity
  185. ----------------
  186. Linux cannot currently figure out CPU capacity on its own, this information thus
  187. needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
  188. for that purpose.
  189. The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver
  190. CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
  191. Documentation/devicetree/bindings/cpu/cpu-capacity.txt.
  192. 3.2 Frequency invariance
  193. ------------------------
  194. As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
  195. utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
  196. purpose.
  197. Implementing this function requires figuring out at which frequency each CPU
  198. have been running at. One way to implement this is to leverage hardware counters
  199. whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
  200. AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
  201. when the kernel is aware of the switched-to frequency (also employed by
  202. arm/arm64).
  203. 4. Scheduler topology
  204. =====================
  205. During the construction of the sched domains, the scheduler will figure out
  206. whether the system exhibits asymmetric CPU capacities. Should that be the
  207. case:
  208. - The sched_asym_cpucapacity static key will be enabled.
  209. - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
  210. level that spans all unique CPU capacity values.
  211. - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
  212. CPUs with any range of asymmetry.
  213. The sched_asym_cpucapacity static key is intended to guard sections of code that
  214. cater to asymmetric CPU capacity systems. Do note however that said key is
  215. *system-wide*. Imagine the following setup using cpusets::
  216. capacity C/2 C
  217. ________ ________
  218. / \ / \
  219. CPUs 0 1 2 3 4 5 6 7
  220. \__/ \______________/
  221. cpusets cs0 cs1
  222. Which could be created via:
  223. .. code-block:: sh
  224. mkdir /sys/fs/cgroup/cpuset/cs0
  225. echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
  226. echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
  227. mkdir /sys/fs/cgroup/cpuset/cs1
  228. echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
  229. echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
  230. echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
  231. Since there *is* CPU capacity asymmetry in the system, the
  232. sched_asym_cpucapacity static key will be enabled. However, the sched_domain
  233. hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
  234. set in that hierarchy, it describes an SMP island and should be treated as such.
  235. Therefore, the 'canonical' pattern for protecting codepaths that cater to
  236. asymmetric CPU capacities is to:
  237. - Check the sched_asym_cpucapacity static key
  238. - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
  239. the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
  240. CPU or group thereof)
  241. 5. Capacity aware scheduling implementation
  242. ===========================================
  243. 5.1 CFS
  244. -------
  245. 5.1.1 Capacity fitness
  246. ~~~~~~~~~~~~~~~~~~~~~~
  247. The main capacity scheduling criterion of CFS is::
  248. task_util(p) < capacity(task_cpu(p))
  249. This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
  250. task "fits" on its CPU. If it is violated, the task will need to achieve more
  251. work than what its CPU can provide: it will be CPU-bound.
  252. Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
  253. value for a task, either via sched_setattr() or via the cgroup interface (see
  254. Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
  255. clamp task_util() in the previous criterion.
  256. 5.1.2 Wakeup CPU selection
  257. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  258. CFS task wakeup CPU selection follows the capacity fitness criterion described
  259. above. On top of that, uclamp is used to clamp the task utilization values,
  260. which lets userspace have more leverage over the CPU selection of CFS
  261. tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
  262. clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
  263. By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
  264. on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
  265. periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
  266. giving it a high uclamp.min value.
  267. .. note::
  268. Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
  269. (EAS), which is described in Documentation/scheduler/sched-energy.rst.
  270. 5.1.3 Load balancing
  271. ~~~~~~~~~~~~~~~~~~~~
  272. A pathological case in the wakeup CPU selection occurs when a task rarely
  273. sleeps, if at all - it thus rarely wakes up, if at all. Consider::
  274. w == wakeup event
  275. capacity(CPU0) = C
  276. capacity(CPU1) = C / 3
  277. workload on CPU0
  278. CPU work ^
  279. | _________ _________ ____
  280. | | | | | |
  281. +----+----+----+----+----+----+----+----+----+----+-> time
  282. w w w
  283. workload on CPU1
  284. CPU work ^
  285. | ____________________________________________
  286. | |
  287. +----+----+----+----+----+----+----+----+----+----+->
  288. w
  289. This workload should run on CPU0, but if the task either:
  290. - was improperly scheduled from the start (inaccurate initial
  291. utilization estimation)
  292. - was properly scheduled from the start, but suddenly needs more
  293. processing power
  294. then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
  295. the CPU capacity scheduling criterion is violated, and there may not be any more
  296. wakeup event to fix this up via wakeup CPU selection.
  297. Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
  298. put in place to handle this shares the same name. Misfit task migration
  299. leverages the CFS load balancer, more specifically the active load balance part
  300. (which caters to migrating currently running tasks). When load balance happens,
  301. a misfit active load balance will be triggered if a misfit task can be migrated
  302. to a CPU with more capacity than its current one.
  303. 5.2 RT
  304. ------
  305. 5.2.1 Wakeup CPU selection
  306. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  307. RT task wakeup CPU selection searches for a CPU that satisfies::
  308. task_uclamp_min(p) <= capacity(task_cpu(cpu))
  309. while still following the usual priority constraints. If none of the candidate
  310. CPUs can satisfy this capacity criterion, then strict priority based scheduling
  311. is followed and CPU capacities are ignored.
  312. 5.3 DL
  313. ------
  314. 5.3.1 Wakeup CPU selection
  315. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  316. DL task wakeup CPU selection searches for a CPU that satisfies::
  317. task_bandwidth(p) < capacity(task_cpu(p))
  318. while still respecting the usual bandwidth and deadline constraints. If
  319. none of the candidate CPUs can satisfy this capacity criterion, then the
  320. task will remain on its current CPU.