transhuge.rst 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418
  1. .. _admin_guide_transhuge:
  2. ============================
  3. Transparent Hugepage Support
  4. ============================
  5. Objective
  6. =========
  7. Performance critical computing applications dealing with large memory
  8. working sets are already running on top of libhugetlbfs and in turn
  9. hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
  10. using huge pages for the backing of virtual memory with huge pages
  11. that supports the automatic promotion and demotion of page sizes and
  12. without the shortcomings of hugetlbfs.
  13. Currently THP only works for anonymous memory mappings and tmpfs/shmem.
  14. But in the future it can expand to other filesystems.
  15. .. note::
  16. in the examples below we presume that the basic page size is 4K and
  17. the huge page size is 2M, although the actual numbers may vary
  18. depending on the CPU architecture.
  19. The reason applications are running faster is because of two
  20. factors. The first factor is almost completely irrelevant and it's not
  21. of significant interest because it'll also have the downside of
  22. requiring larger clear-page copy-page in page faults which is a
  23. potentially negative effect. The first factor consists in taking a
  24. single page fault for each 2M virtual region touched by userland (so
  25. reducing the enter/exit kernel frequency by a 512 times factor). This
  26. only matters the first time the memory is accessed for the lifetime of
  27. a memory mapping. The second long lasting and much more important
  28. factor will affect all subsequent accesses to the memory for the whole
  29. runtime of the application. The second factor consist of two
  30. components:
  31. 1) the TLB miss will run faster (especially with virtualization using
  32. nested pagetables but almost always also on bare metal without
  33. virtualization)
  34. 2) a single TLB entry will be mapping a much larger amount of virtual
  35. memory in turn reducing the number of TLB misses. With
  36. virtualization and nested pagetables the TLB can be mapped of
  37. larger size only if both KVM and the Linux guest are using
  38. hugepages but a significant speedup already happens if only one of
  39. the two is using hugepages just because of the fact the TLB miss is
  40. going to run faster.
  41. THP can be enabled system wide or restricted to certain tasks or even
  42. memory ranges inside task's address space. Unless THP is completely
  43. disabled, there is ``khugepaged`` daemon that scans memory and
  44. collapses sequences of basic pages into huge pages.
  45. The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
  46. interface and using madivse(2) and prctl(2) system calls.
  47. Transparent Hugepage Support maximizes the usefulness of free memory
  48. if compared to the reservation approach of hugetlbfs by allowing all
  49. unused memory to be used as cache or other movable (or even unmovable
  50. entities). It doesn't require reservation to prevent hugepage
  51. allocation failures to be noticeable from userland. It allows paging
  52. and all other advanced VM features to be available on the
  53. hugepages. It requires no modifications for applications to take
  54. advantage of it.
  55. Applications however can be further optimized to take advantage of
  56. this feature, like for example they've been optimized before to avoid
  57. a flood of mmap system calls for every malloc(4k). Optimizing userland
  58. is by far not mandatory and khugepaged already can take care of long
  59. lived page allocations even for hugepage unaware applications that
  60. deals with large amounts of memory.
  61. In certain cases when hugepages are enabled system wide, application
  62. may end up allocating more memory resources. An application may mmap a
  63. large region but only touch 1 byte of it, in that case a 2M page might
  64. be allocated instead of a 4k page for no good. This is why it's
  65. possible to disable hugepages system-wide and to only have them inside
  66. MADV_HUGEPAGE madvise regions.
  67. Embedded systems should enable hugepages only inside madvise regions
  68. to eliminate any risk of wasting any precious byte of memory and to
  69. only run faster.
  70. Applications that gets a lot of benefit from hugepages and that don't
  71. risk to lose memory by using hugepages, should use
  72. madvise(MADV_HUGEPAGE) on their critical mmapped regions.
  73. .. _thp_sysfs:
  74. sysfs
  75. =====
  76. Global THP controls
  77. -------------------
  78. Transparent Hugepage Support for anonymous memory can be entirely disabled
  79. (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
  80. regions (to avoid the risk of consuming more memory resources) or enabled
  81. system wide. This can be achieved with one of::
  82. echo always >/sys/kernel/mm/transparent_hugepage/enabled
  83. echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
  84. echo never >/sys/kernel/mm/transparent_hugepage/enabled
  85. It's also possible to limit defrag efforts in the VM to generate
  86. anonymous hugepages in case they're not immediately free to madvise
  87. regions or to never try to defrag memory and simply fallback to regular
  88. pages unless hugepages are immediately available. Clearly if we spend CPU
  89. time to defrag memory, we would expect to gain even more by the fact we
  90. use hugepages later instead of regular pages. This isn't always
  91. guaranteed, but it may be more likely in case the allocation is for a
  92. MADV_HUGEPAGE region.
  93. ::
  94. echo always >/sys/kernel/mm/transparent_hugepage/defrag
  95. echo defer >/sys/kernel/mm/transparent_hugepage/defrag
  96. echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
  97. echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
  98. echo never >/sys/kernel/mm/transparent_hugepage/defrag
  99. always
  100. means that an application requesting THP will stall on
  101. allocation failure and directly reclaim pages and compact
  102. memory in an effort to allocate a THP immediately. This may be
  103. desirable for virtual machines that benefit heavily from THP
  104. use and are willing to delay the VM start to utilise them.
  105. defer
  106. means that an application will wake kswapd in the background
  107. to reclaim pages and wake kcompactd to compact memory so that
  108. THP is available in the near future. It's the responsibility
  109. of khugepaged to then install the THP pages later.
  110. defer+madvise
  111. will enter direct reclaim and compaction like ``always``, but
  112. only for regions that have used madvise(MADV_HUGEPAGE); all
  113. other regions will wake kswapd in the background to reclaim
  114. pages and wake kcompactd to compact memory so that THP is
  115. available in the near future.
  116. madvise
  117. will enter direct reclaim like ``always`` but only for regions
  118. that are have used madvise(MADV_HUGEPAGE). This is the default
  119. behaviour.
  120. never
  121. should be self-explanatory.
  122. By default kernel tries to use huge zero page on read page fault to
  123. anonymous mapping. It's possible to disable huge zero page by writing 0
  124. or enable it back by writing 1::
  125. echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
  126. echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
  127. Some userspace (such as a test program, or an optimized memory allocation
  128. library) may want to know the size (in bytes) of a transparent hugepage::
  129. cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
  130. khugepaged will be automatically started when
  131. transparent_hugepage/enabled is set to "always" or "madvise, and it'll
  132. be automatically shutdown if it's set to "never".
  133. Khugepaged controls
  134. -------------------
  135. khugepaged runs usually at low frequency so while one may not want to
  136. invoke defrag algorithms synchronously during the page faults, it
  137. should be worth invoking defrag at least in khugepaged. However it's
  138. also possible to disable defrag in khugepaged by writing 0 or enable
  139. defrag in khugepaged by writing 1::
  140. echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
  141. echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
  142. You can also control how many pages khugepaged should scan at each
  143. pass::
  144. /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
  145. and how many milliseconds to wait in khugepaged between each pass (you
  146. can set this to 0 to run khugepaged at 100% utilization of one core)::
  147. /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
  148. and how many milliseconds to wait in khugepaged if there's an hugepage
  149. allocation failure to throttle the next allocation attempt::
  150. /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
  151. The khugepaged progress can be seen in the number of pages collapsed::
  152. /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
  153. for each pass::
  154. /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
  155. ``max_ptes_none`` specifies how many extra small pages (that are
  156. not already mapped) can be allocated when collapsing a group
  157. of small pages into one large page::
  158. /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
  159. A higher value leads to use additional memory for programs.
  160. A lower value leads to gain less thp performance. Value of
  161. max_ptes_none can waste cpu time very little, you can
  162. ignore it.
  163. ``max_ptes_swap`` specifies how many pages can be brought in from
  164. swap when collapsing a group of pages into a transparent huge page::
  165. /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
  166. A higher value can cause excessive swap IO and waste
  167. memory. A lower value can prevent THPs from being
  168. collapsed, resulting fewer pages being collapsed into
  169. THPs, and lower memory access performance.
  170. Boot parameter
  171. ==============
  172. You can change the sysfs boot time defaults of Transparent Hugepage
  173. Support by passing the parameter ``transparent_hugepage=always`` or
  174. ``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
  175. to the kernel command line.
  176. Hugepages in tmpfs/shmem
  177. ========================
  178. You can control hugepage allocation policy in tmpfs with mount option
  179. ``huge=``. It can have following values:
  180. always
  181. Attempt to allocate huge pages every time we need a new page;
  182. never
  183. Do not allocate huge pages;
  184. within_size
  185. Only allocate huge page if it will be fully within i_size.
  186. Also respect fadvise()/madvise() hints;
  187. advise
  188. Only allocate huge pages if requested with fadvise()/madvise();
  189. The default policy is ``never``.
  190. ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
  191. ``huge=never`` will not attempt to break up huge pages at all, just stop more
  192. from being allocated.
  193. There's also sysfs knob to control hugepage allocation policy for internal
  194. shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
  195. is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
  196. MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
  197. In addition to policies listed above, shmem_enabled allows two further
  198. values:
  199. deny
  200. For use in emergencies, to force the huge option off from
  201. all mounts;
  202. force
  203. Force the huge option on for all - very useful for testing;
  204. Need of application restart
  205. ===========================
  206. The transparent_hugepage/enabled values and tmpfs mount option only affect
  207. future behavior. So to make them effective you need to restart any
  208. application that could have been using hugepages. This also applies to the
  209. regions registered in khugepaged.
  210. Monitoring usage
  211. ================
  212. The number of anonymous transparent huge pages currently used by the
  213. system is available by reading the AnonHugePages field in ``/proc/meminfo``.
  214. To identify what applications are using anonymous transparent huge pages,
  215. it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
  216. for each mapping.
  217. The number of file transparent huge pages mapped to userspace is available
  218. by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
  219. To identify what applications are mapping file transparent huge pages, it
  220. is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
  221. for each mapping.
  222. Note that reading the smaps file is expensive and reading it
  223. frequently will incur overhead.
  224. There are a number of counters in ``/proc/vmstat`` that may be used to
  225. monitor how successfully the system is providing huge pages for use.
  226. thp_fault_alloc
  227. is incremented every time a huge page is successfully
  228. allocated to handle a page fault. This applies to both the
  229. first time a page is faulted and for COW faults.
  230. thp_collapse_alloc
  231. is incremented by khugepaged when it has found
  232. a range of pages to collapse into one huge page and has
  233. successfully allocated a new huge page to store the data.
  234. thp_fault_fallback
  235. is incremented if a page fault fails to allocate
  236. a huge page and instead falls back to using small pages.
  237. thp_collapse_alloc_failed
  238. is incremented if khugepaged found a range
  239. of pages that should be collapsed into one huge page but failed
  240. the allocation.
  241. thp_file_alloc
  242. is incremented every time a file huge page is successfully
  243. allocated.
  244. thp_file_mapped
  245. is incremented every time a file huge page is mapped into
  246. user address space.
  247. thp_split_page
  248. is incremented every time a huge page is split into base
  249. pages. This can happen for a variety of reasons but a common
  250. reason is that a huge page is old and is being reclaimed.
  251. This action implies splitting all PMD the page mapped with.
  252. thp_split_page_failed
  253. is incremented if kernel fails to split huge
  254. page. This can happen if the page was pinned by somebody.
  255. thp_deferred_split_page
  256. is incremented when a huge page is put onto split
  257. queue. This happens when a huge page is partially unmapped and
  258. splitting it would free up some memory. Pages on split queue are
  259. going to be split under memory pressure.
  260. thp_split_pmd
  261. is incremented every time a PMD split into table of PTEs.
  262. This can happen, for instance, when application calls mprotect() or
  263. munmap() on part of huge page. It doesn't split huge page, only
  264. page table entry.
  265. thp_zero_page_alloc
  266. is incremented every time a huge zero page is
  267. successfully allocated. It includes allocations which where
  268. dropped due race with other allocation. Note, it doesn't count
  269. every map of the huge zero page, only its allocation.
  270. thp_zero_page_alloc_failed
  271. is incremented if kernel fails to allocate
  272. huge zero page and falls back to using small pages.
  273. thp_swpout
  274. is incremented every time a huge page is swapout in one
  275. piece without splitting.
  276. thp_swpout_fallback
  277. is incremented if a huge page has to be split before swapout.
  278. Usually because failed to allocate some continuous swap space
  279. for the huge page.
  280. As the system ages, allocating huge pages may be expensive as the
  281. system uses memory compaction to copy data around memory to free a
  282. huge page for use. There are some counters in ``/proc/vmstat`` to help
  283. monitor this overhead.
  284. compact_stall
  285. is incremented every time a process stalls to run
  286. memory compaction so that a huge page is free for use.
  287. compact_success
  288. is incremented if the system compacted memory and
  289. freed a huge page for use.
  290. compact_fail
  291. is incremented if the system tries to compact memory
  292. but failed.
  293. compact_pages_moved
  294. is incremented each time a page is moved. If
  295. this value is increasing rapidly, it implies that the system
  296. is copying a lot of data to satisfy the huge page allocation.
  297. It is possible that the cost of copying exceeds any savings
  298. from reduced TLB misses.
  299. compact_pagemigrate_failed
  300. is incremented when the underlying mechanism
  301. for moving a page failed.
  302. compact_blocks_moved
  303. is incremented each time memory compaction examines
  304. a huge page aligned range of pages.
  305. It is possible to establish how long the stalls were using the function
  306. tracer to record how long was spent in __alloc_pages_nodemask and
  307. using the mm_page_alloc tracepoint to identify which allocations were
  308. for huge pages.
  309. Optimizing the applications
  310. ===========================
  311. To be guaranteed that the kernel will map a 2M page immediately in any
  312. memory region, the mmap region has to be hugepage naturally
  313. aligned. posix_memalign() can provide that guarantee.
  314. Hugetlbfs
  315. =========
  316. You can use hugetlbfs on a kernel that has transparent hugepage
  317. support enabled just fine as always. No difference can be noted in
  318. hugetlbfs other than there will be less overall fragmentation. All
  319. usual features belonging to hugetlbfs are preserved and
  320. unaffected. libhugetlbfs will also work fine as usual.