concepts.rst 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222
  1. .. _mm_concepts:
  2. =================
  3. Concepts overview
  4. =================
  5. The memory management in Linux is complex system that evolved over the
  6. years and included more and more functionality to support variety of
  7. systems from MMU-less microcontrollers to supercomputers. The memory
  8. management for systems without MMU is called ``nommu`` and it
  9. definitely deserves a dedicated document, which hopefully will be
  10. eventually written. Yet, although some of the concepts are the same,
  11. here we assume that MMU is available and CPU can translate a virtual
  12. address to a physical address.
  13. .. contents:: :local:
  14. Virtual Memory Primer
  15. =====================
  16. The physical memory in a computer system is a limited resource and
  17. even for systems that support memory hotplug there is a hard limit on
  18. the amount of memory that can be installed. The physical memory is not
  19. necessary contiguous, it might be accessible as a set of distinct
  20. address ranges. Besides, different CPU architectures, and even
  21. different implementations of the same architecture have different view
  22. how these address ranges defined.
  23. All this makes dealing directly with physical memory quite complex and
  24. to avoid this complexity a concept of virtual memory was developed.
  25. The virtual memory abstracts the details of physical memory from the
  26. application software, allows to keep only needed information in the
  27. physical memory (demand paging) and provides a mechanism for the
  28. protection and controlled sharing of data between processes.
  29. With virtual memory, each and every memory access uses a virtual
  30. address. When the CPU decodes the an instruction that reads (or
  31. writes) from (or to) the system memory, it translates the `virtual`
  32. address encoded in that instruction to a `physical` address that the
  33. memory controller can understand.
  34. The physical system memory is divided into page frames, or pages. The
  35. size of each page is architecture specific. Some architectures allow
  36. selection of the page size from several supported values; this
  37. selection is performed at the kernel build time by setting an
  38. appropriate kernel configuration option.
  39. Each physical memory page can be mapped as one or more virtual
  40. pages. These mappings are described by page tables that allow
  41. translation from virtual address used by programs to real address in
  42. the physical memory. The page tables organized hierarchically.
  43. The tables at the lowest level of the hierarchy contain physical
  44. addresses of actual pages used by the software. The tables at higher
  45. levels contain physical addresses of the pages belonging to the lower
  46. levels. The pointer to the top level page table resides in a
  47. register. When the CPU performs the address translation, it uses this
  48. register to access the top level page table. The high bits of the
  49. virtual address are used to index an entry in the top level page
  50. table. That entry is then used to access the next level in the
  51. hierarchy with the next bits of the virtual address as the index to
  52. that level page table. The lowest bits in the virtual address define
  53. the offset inside the actual page.
  54. Huge Pages
  55. ==========
  56. The address translation requires several memory accesses and memory
  57. accesses are slow relatively to CPU speed. To avoid spending precious
  58. processor cycles on the address translation, CPUs maintain a cache of
  59. such translations called Translation Lookaside Buffer (or
  60. TLB). Usually TLB is pretty scarce resource and applications with
  61. large memory working set will experience performance hit because of
  62. TLB misses.
  63. Many modern CPU architectures allow mapping of the memory pages
  64. directly by the higher levels in the page table. For instance, on x86,
  65. it is possible to map 2M and even 1G pages using entries in the second
  66. and the third level page tables. In Linux such pages are called
  67. `huge`. Usage of huge pages significantly reduces pressure on TLB,
  68. improves TLB hit-rate and thus improves overall system performance.
  69. There are two mechanisms in Linux that enable mapping of the physical
  70. memory with the huge pages. The first one is `HugeTLB filesystem`, or
  71. hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
  72. store. For the files created in this filesystem the data resides in
  73. the memory and mapped using huge pages. The hugetlbfs is described at
  74. :ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
  75. Another, more recent, mechanism that enables use of the huge pages is
  76. called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
  77. requires users and/or system administrators to configure what parts of
  78. the system memory should and can be mapped by the huge pages, THP
  79. manages such mappings transparently to the user and hence the
  80. name. See
  81. :ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
  82. for more details about THP.
  83. Zones
  84. =====
  85. Often hardware poses restrictions on how different physical memory
  86. ranges can be accessed. In some cases, devices cannot perform DMA to
  87. all the addressable memory. In other cases, the size of the physical
  88. memory exceeds the maximal addressable size of virtual memory and
  89. special actions are required to access portions of the memory. Linux
  90. groups memory pages into `zones` according to their possible
  91. usage. For example, ZONE_DMA will contain memory that can be used by
  92. devices for DMA, ZONE_HIGHMEM will contain memory that is not
  93. permanently mapped into kernel's address space and ZONE_NORMAL will
  94. contain normally addressed pages.
  95. The actual layout of the memory zones is hardware dependent as not all
  96. architectures define all zones, and requirements for DMA are different
  97. for different platforms.
  98. Nodes
  99. =====
  100. Many multi-processor machines are NUMA - Non-Uniform Memory Access -
  101. systems. In such systems the memory is arranged into banks that have
  102. different access latency depending on the "distance" from the
  103. processor. Each bank is referred as `node` and for each node Linux
  104. constructs an independent memory management subsystem. A node has it's
  105. own set of zones, lists of free and used pages and various statistics
  106. counters. You can find more details about NUMA in
  107. :ref:`Documentation/vm/numa.rst <numa>` and in
  108. :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
  109. Page cache
  110. ==========
  111. The physical memory is volatile and the common case for getting data
  112. into the memory is to read it from files. Whenever a file is read, the
  113. data is put into the `page cache` to avoid expensive disk access on
  114. the subsequent reads. Similarly, when one writes to a file, the data
  115. is placed in the page cache and eventually gets into the backing
  116. storage device. The written pages are marked as `dirty` and when Linux
  117. decides to reuse them for other purposes, it makes sure to synchronize
  118. the file contents on the device with the updated data.
  119. Anonymous Memory
  120. ================
  121. The `anonymous memory` or `anonymous mappings` represent memory that
  122. is not backed by a filesystem. Such mappings are implicitly created
  123. for program's stack and heap or by explicit calls to mmap(2) system
  124. call. Usually, the anonymous mappings only define virtual memory areas
  125. that the program is allowed to access. The read accesses will result
  126. in creation of a page table entry that references a special physical
  127. page filled with zeroes. When the program performs a write, regular
  128. physical page will be allocated to hold the written data. The page
  129. will be marked dirty and if the kernel will decide to repurpose it,
  130. the dirty page will be swapped out.
  131. Reclaim
  132. =======
  133. Throughout the system lifetime, a physical page can be used for storing
  134. different types of data. It can be kernel internal data structures,
  135. DMA'able buffers for device drivers use, data read from a filesystem,
  136. memory allocated by user space processes etc.
  137. Depending on the page usage it is treated differently by the Linux
  138. memory management. The pages that can be freed at any time, either
  139. because they cache the data available elsewhere, for instance, on a
  140. hard disk, or because they can be swapped out, again, to the hard
  141. disk, are called `reclaimable`. The most notable categories of the
  142. reclaimable pages are page cache and anonymous memory.
  143. In most cases, the pages holding internal kernel data and used as DMA
  144. buffers cannot be repurposed, and they remain pinned until freed by
  145. their user. Such pages are called `unreclaimable`. However, in certain
  146. circumstances, even pages occupied with kernel data structures can be
  147. reclaimed. For instance, in-memory caches of filesystem metadata can
  148. be re-read from the storage device and therefore it is possible to
  149. discard them from the main memory when system is under memory
  150. pressure.
  151. The process of freeing the reclaimable physical memory pages and
  152. repurposing them is called (surprise!) `reclaim`. Linux can reclaim
  153. pages either asynchronously or synchronously, depending on the state
  154. of the system. When system is not loaded, most of the memory is free
  155. and allocation request will be satisfied immediately from the free
  156. pages supply. As the load increases, the amount of the free pages goes
  157. down and when it reaches a certain threshold (high watermark), an
  158. allocation request will awaken the ``kswapd`` daemon. It will
  159. asynchronously scan memory pages and either just free them if the data
  160. they contain is available elsewhere, or evict to the backing storage
  161. device (remember those dirty pages?). As memory usage increases even
  162. more and reaches another threshold - min watermark - an allocation
  163. will trigger the `direct reclaim`. In this case allocation is stalled
  164. until enough memory pages are reclaimed to satisfy the request.
  165. Compaction
  166. ==========
  167. As the system runs, tasks allocate and free the memory and it becomes
  168. fragmented. Although with virtual memory it is possible to present
  169. scattered physical pages as virtually contiguous range, sometimes it is
  170. necessary to allocate large physically contiguous memory areas. Such
  171. need may arise, for instance, when a device driver requires large
  172. buffer for DMA, or when THP allocates a huge page. Memory `compaction`
  173. addresses the fragmentation issue. This mechanism moves occupied pages
  174. from the lower part of a memory zone to free pages in the upper part
  175. of the zone. When a compaction scan is finished free pages are grouped
  176. together at the beginning of the zone and allocations of large
  177. physically contiguous areas become possible.
  178. Like reclaim, the compaction may happen asynchronously in ``kcompactd``
  179. daemon or synchronously as a result of memory allocation request.
  180. OOM killer
  181. ==========
  182. It may happen, that on a loaded machine memory will be exhausted. When
  183. the kernel detects that the system runs out of memory (OOM) it invokes
  184. `OOM killer`. Its mission is simple: all it has to do is to select a
  185. task to sacrifice for the sake of the overall system health. The
  186. selected task is killed in a hope that after it exits enough memory
  187. will be freed to continue normal operation.