multigen_lru.rst 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =============
  3. Multi-Gen LRU
  4. =============
  5. The multi-gen LRU is an alternative LRU implementation that optimizes
  6. page reclaim and improves performance under memory pressure. Page
  7. reclaim decides the kernel's caching policy and ability to overcommit
  8. memory. It directly impacts the kswapd CPU usage and RAM efficiency.
  9. Design overview
  10. ===============
  11. Objectives
  12. ----------
  13. The design objectives are:
  14. * Good representation of access recency
  15. * Try to profit from spatial locality
  16. * Fast paths to make obvious choices
  17. * Simple self-correcting heuristics
  18. The representation of access recency is at the core of all LRU
  19. implementations. In the multi-gen LRU, each generation represents a
  20. group of pages with similar access recency. Generations establish a
  21. (time-based) common frame of reference and therefore help make better
  22. choices, e.g., between different memcgs on a computer or different
  23. computers in a data center (for job scheduling).
  24. Exploiting spatial locality improves efficiency when gathering the
  25. accessed bit. A rmap walk targets a single page and does not try to
  26. profit from discovering a young PTE. A page table walk can sweep all
  27. the young PTEs in an address space, but the address space can be too
  28. sparse to make a profit. The key is to optimize both methods and use
  29. them in combination.
  30. Fast paths reduce code complexity and runtime overhead. Unmapped pages
  31. do not require TLB flushes; clean pages do not require writeback.
  32. These facts are only helpful when other conditions, e.g., access
  33. recency, are similar. With generations as a common frame of reference,
  34. additional factors stand out. But obvious choices might not be good
  35. choices; thus self-correction is necessary.
  36. The benefits of simple self-correcting heuristics are self-evident.
  37. Again, with generations as a common frame of reference, this becomes
  38. attainable. Specifically, pages in the same generation can be
  39. categorized based on additional factors, and a feedback loop can
  40. statistically compare the refault percentages across those categories
  41. and infer which of them are better choices.
  42. Assumptions
  43. -----------
  44. The protection of hot pages and the selection of cold pages are based
  45. on page access channels and patterns. There are two access channels:
  46. * Accesses through page tables
  47. * Accesses through file descriptors
  48. The protection of the former channel is by design stronger because:
  49. 1. The uncertainty in determining the access patterns of the former
  50. channel is higher due to the approximation of the accessed bit.
  51. 2. The cost of evicting the former channel is higher due to the TLB
  52. flushes required and the likelihood of encountering the dirty bit.
  53. 3. The penalty of underprotecting the former channel is higher because
  54. applications usually do not prepare themselves for major page
  55. faults like they do for blocked I/O. E.g., GUI applications
  56. commonly use dedicated I/O threads to avoid blocking rendering
  57. threads.
  58. There are also two access patterns:
  59. * Accesses exhibiting temporal locality
  60. * Accesses not exhibiting temporal locality
  61. For the reasons listed above, the former channel is assumed to follow
  62. the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
  63. present, and the latter channel is assumed to follow the latter
  64. pattern unless outlying refaults have been observed.
  65. Workflow overview
  66. =================
  67. Evictable pages are divided into multiple generations for each
  68. ``lruvec``. The youngest generation number is stored in
  69. ``lrugen->max_seq`` for both anon and file types as they are aged on
  70. an equal footing. The oldest generation numbers are stored in
  71. ``lrugen->min_seq[]`` separately for anon and file types as clean file
  72. pages can be evicted regardless of swap constraints. These three
  73. variables are monotonically increasing.
  74. Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
  75. bits in order to fit into the gen counter in ``folio->flags``. Each
  76. truncated generation number is an index to ``lrugen->folios[]``. The
  77. sliding window technique is used to track at least ``MIN_NR_GENS`` and
  78. at most ``MAX_NR_GENS`` generations. The gen counter stores a value
  79. within ``[1, MAX_NR_GENS]`` while a page is on one of
  80. ``lrugen->folios[]``; otherwise it stores zero.
  81. Each generation is divided into multiple tiers. A page accessed ``N``
  82. times through file descriptors is in tier ``order_base_2(N)``. Unlike
  83. generations, tiers do not have dedicated ``lrugen->folios[]``. In
  84. contrast to moving across generations, which requires the LRU lock,
  85. moving across tiers only involves atomic operations on
  86. ``folio->flags`` and therefore has a negligible cost. A feedback loop
  87. modeled after the PID controller monitors refaults over all the tiers
  88. from anon and file types and decides which tiers from which types to
  89. evict or protect. The desired effect is to balance refault percentages
  90. between anon and file types proportional to the swappiness level.
  91. There are two conceptually independent procedures: the aging and the
  92. eviction. They form a closed-loop system, i.e., the page reclaim.
  93. Aging
  94. -----
  95. The aging produces young generations. Given an ``lruvec``, it
  96. increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
  97. ``MIN_NR_GENS``. The aging promotes hot pages to the youngest
  98. generation when it finds them accessed through page tables; the
  99. demotion of cold pages happens consequently when it increments
  100. ``max_seq``. The aging uses page table walks and rmap walks to find
  101. young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
  102. and calls ``walk_page_range()`` with each ``mm_struct`` on this list
  103. to scan PTEs, and after each iteration, it increments ``max_seq``. For
  104. the latter, when the eviction walks the rmap and finds a young PTE,
  105. the aging scans the adjacent PTEs. For both, on finding a young PTE,
  106. the aging clears the accessed bit and updates the gen counter of the
  107. page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
  108. Eviction
  109. --------
  110. The eviction consumes old generations. Given an ``lruvec``, it
  111. increments ``min_seq`` when ``lrugen->folios[]`` indexed by
  112. ``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
  113. evict from, it first compares ``min_seq[]`` to select the older type.
  114. If both types are equally old, it selects the one whose first tier has
  115. a lower refault percentage. The first tier contains single-use
  116. unmapped clean pages, which are the best bet. The eviction sorts a
  117. page according to its gen counter if the aging has found this page
  118. accessed through page tables and updated its gen counter. It also
  119. moves a page to the next generation, i.e., ``min_seq+1``, if this page
  120. was accessed multiple times through file descriptors and the feedback
  121. loop has detected outlying refaults from the tier this page is in. To
  122. this end, the feedback loop uses the first tier as the baseline, for
  123. the reason stated earlier.
  124. Working set protection
  125. ----------------------
  126. Each generation is timestamped at birth. If ``lru_gen_min_ttl`` is
  127. set, an ``lruvec`` is protected from the eviction when its oldest
  128. generation was born within ``lru_gen_min_ttl`` milliseconds. In other
  129. words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds
  130. from getting evicted. The OOM killer is triggered if this working set
  131. cannot be kept in memory.
  132. This time-based approach has the following advantages:
  133. 1. It is easier to configure because it is agnostic to applications
  134. and memory sizes.
  135. 2. It is more reliable because it is directly wired to the OOM killer.
  136. ``mm_struct`` list
  137. ------------------
  138. An ``mm_struct`` list is maintained for each memcg, and an
  139. ``mm_struct`` follows its owner task to the new memcg when this task
  140. is migrated.
  141. A page table walker iterates ``lruvec_memcg()->mm_list`` and calls
  142. ``walk_page_range()`` with each ``mm_struct`` on this list to scan
  143. PTEs. When multiple page table walkers iterate the same list, each of
  144. them gets a unique ``mm_struct``, and therefore they can run in
  145. parallel.
  146. Page table walkers ignore any misplaced pages, e.g., if an
  147. ``mm_struct`` was migrated, pages left in the previous memcg will be
  148. ignored when the current memcg is under reclaim. Similarly, page table
  149. walkers will ignore pages from nodes other than the one under reclaim.
  150. This infrastructure also tracks the usage of ``mm_struct`` between
  151. context switches so that page table walkers can skip processes that
  152. have been sleeping since the last iteration.
  153. Rmap/PT walk feedback
  154. ---------------------
  155. Searching the rmap for PTEs mapping each page on an LRU list (to test
  156. and clear the accessed bit) can be expensive because pages from
  157. different VMAs (PA space) are not cache friendly to the rmap (VA
  158. space). For workloads mostly using mapped pages, searching the rmap
  159. can incur the highest CPU cost in the reclaim path.
  160. ``lru_gen_look_around()`` exploits spatial locality to reduce the
  161. trips into the rmap. It scans the adjacent PTEs of a young PTE and
  162. promotes hot pages. If the scan was done cacheline efficiently, it
  163. adds the PMD entry pointing to the PTE table to the Bloom filter. This
  164. forms a feedback loop between the eviction and the aging.
  165. Bloom filters
  166. -------------
  167. Bloom filters are a space and memory efficient data structure for set
  168. membership test, i.e., test if an element is not in the set or may be
  169. in the set.
  170. In the eviction path, specifically, in ``lru_gen_look_around()``, if a
  171. PMD has a sufficient number of hot pages, its address is placed in the
  172. filter. In the aging path, set membership means that the PTE range
  173. will be scanned for young pages.
  174. Note that Bloom filters are probabilistic on set membership. If a test
  175. is false positive, the cost is an additional scan of a range of PTEs,
  176. which may yield hot pages anyway. Parameters of the filter itself can
  177. control the false positive rate in the limit.
  178. PID controller
  179. --------------
  180. A feedback loop modeled after the Proportional-Integral-Derivative
  181. (PID) controller monitors refaults over anon and file types and
  182. decides which type to evict when both types are available from the
  183. same generation.
  184. The PID controller uses generations rather than the wall clock as the
  185. time domain because a CPU can scan pages at different rates under
  186. varying memory pressure. It calculates a moving average for each new
  187. generation to avoid being permanently locked in a suboptimal state.
  188. Memcg LRU
  189. ---------
  190. An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
  191. since each node and memcg combination has an LRU of folios (see
  192. ``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
  193. global reclaim, which is critical to system-wide memory overcommit in
  194. data centers. Note that memcg LRU only applies to global reclaim.
  195. The basic structure of an memcg LRU can be understood by an analogy to
  196. the active/inactive LRU (of folios):
  197. 1. It has the young and the old (generations), i.e., the counterparts
  198. to the active and the inactive;
  199. 2. The increment of ``max_seq`` triggers promotion, i.e., the
  200. counterpart to activation;
  201. 3. Other events trigger similar operations, e.g., offlining an memcg
  202. triggers demotion, i.e., the counterpart to deactivation.
  203. In terms of global reclaim, it has two distinct features:
  204. 1. Sharding, which allows each thread to start at a random memcg (in
  205. the old generation) and improves parallelism;
  206. 2. Eventual fairness, which allows direct reclaim to bail out at will
  207. and reduces latency without affecting fairness over some time.
  208. In terms of traversing memcgs during global reclaim, it improves the
  209. best-case complexity from O(n) to O(1) and does not affect the
  210. worst-case complexity O(n). Therefore, on average, it has a sublinear
  211. complexity.
  212. Summary
  213. -------
  214. The multi-gen LRU (of folios) can be disassembled into the following
  215. parts:
  216. * Generations
  217. * Rmap walks
  218. * Page table walks via ``mm_struct`` list
  219. * Bloom filters for rmap/PT walk feedback
  220. * PID controller for refault feedback
  221. The aging and the eviction form a producer-consumer model;
  222. specifically, the latter drives the former by the sliding window over
  223. generations. Within the aging, rmap walks drive page table walks by
  224. inserting hot densely populated page tables to the Bloom filters.
  225. Within the eviction, the PID controller uses refaults as the feedback
  226. to select types to evict and tiers to protect.