pti.rst 8.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193
  1. .. SPDX-License-Identifier: GPL-2.0
  2. ==========================
  3. Page Table Isolation (PTI)
  4. ==========================
  5. Overview
  6. ========
  7. Page Table Isolation (pti, previously known as KAISER [1]_) is a
  8. countermeasure against attacks on the shared user/kernel address
  9. space such as the "Meltdown" approach [2]_.
  10. To mitigate this class of attacks, we create an independent set of
  11. page tables for use only when running userspace applications. When
  12. the kernel is entered via syscalls, interrupts or exceptions, the
  13. page tables are switched to the full "kernel" copy. When the system
  14. switches back to user mode, the user copy is used again.
  15. The userspace page tables contain only a minimal amount of kernel
  16. data: only what is needed to enter/exit the kernel such as the
  17. entry/exit functions themselves and the interrupt descriptor table
  18. (IDT). There are a few strictly unnecessary things that get mapped
  19. such as the first C function when entering an interrupt (see
  20. comments in pti.c).
  21. This approach helps to ensure that side-channel attacks leveraging
  22. the paging structures do not function when PTI is enabled. It can be
  23. enabled by setting CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y at compile
  24. time. Once enabled at compile-time, it can be disabled at boot with
  25. the 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
  26. Page Table Management
  27. =====================
  28. When PTI is enabled, the kernel manages two sets of page tables.
  29. The first set is very similar to the single set which is present in
  30. kernels without PTI. This includes a complete mapping of userspace
  31. that the kernel can use for things like copy_to_user().
  32. Although _complete_, the user portion of the kernel page tables is
  33. crippled by setting the NX bit in the top level. This ensures
  34. that any missed kernel->user CR3 switch will immediately crash
  35. userspace upon executing its first instruction.
  36. The userspace page tables map only the kernel data needed to enter
  37. and exit the kernel. This data is entirely contained in the 'struct
  38. cpu_entry_area' structure which is placed in the fixmap which gives
  39. each CPU's copy of the area a compile-time-fixed virtual address.
  40. For new userspace mappings, the kernel makes the entries in its
  41. page tables like normal. The only difference is when the kernel
  42. makes entries in the top (PGD) level. In addition to setting the
  43. entry in the main kernel PGD, a copy of the entry is made in the
  44. userspace page tables' PGD.
  45. This sharing at the PGD level also inherently shares all the lower
  46. layers of the page tables. This leaves a single, shared set of
  47. userspace page tables to manage. One PTE to lock, one set of
  48. accessed bits, dirty bits, etc...
  49. Overhead
  50. ========
  51. Protection against side-channel attacks is important. But,
  52. this protection comes at a cost:
  53. 1. Increased Memory Use
  54. a. Each process now needs an order-1 PGD instead of order-0.
  55. (Consumes an additional 4k per process).
  56. b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
  57. aligned so that it can be mapped by setting a single PMD
  58. entry. This consumes nearly 2MB of RAM once the kernel
  59. is decompressed, but no space in the kernel image itself.
  60. 2. Runtime Cost
  61. a. CR3 manipulation to switch between the page table copies
  62. must be done at interrupt, syscall, and exception entry
  63. and exit (it can be skipped when the kernel is interrupted,
  64. though.) Moves to CR3 are on the order of a hundred
  65. cycles, and are required at every entry and exit.
  66. b. Percpu TSS is mapped into the user page tables to allow SYSCALL64 path
  67. to work under PTI. This doesn't have a direct runtime cost but it can
  68. be argued it opens certain timing attack scenarios.
  69. c. Global pages are disabled for all kernel structures not
  70. mapped into both kernel and userspace page tables. This
  71. feature of the MMU allows different processes to share TLB
  72. entries mapping the kernel. Losing the feature means more
  73. TLB misses after a context switch. The actual loss of
  74. performance is very small, however, never exceeding 1%.
  75. d. Process Context IDentifiers (PCID) is a CPU feature that
  76. allows us to skip flushing the entire TLB when switching page
  77. tables by setting a special bit in CR3 when the page tables
  78. are changed. This makes switching the page tables (at context
  79. switch, or kernel entry/exit) cheaper. But, on systems with
  80. PCID support, the context switch code must flush both the user
  81. and kernel entries out of the TLB. The user PCID TLB flush is
  82. deferred until the exit to userspace, minimizing the cost.
  83. See intel.com/sdm for the gory PCID/INVPCID details.
  84. e. The userspace page tables must be populated for each new
  85. process. Even without PTI, the shared kernel mappings
  86. are created by copying top-level (PGD) entries into each
  87. new process. But, with PTI, there are now *two* kernel
  88. mappings: one in the kernel page tables that maps everything
  89. and one for the entry/exit structures. At fork(), we need to
  90. copy both.
  91. f. In addition to the fork()-time copying, there must also
  92. be an update to the userspace PGD any time a set_pgd() is done
  93. on a PGD used to map userspace. This ensures that the kernel
  94. and userspace copies always map the same userspace
  95. memory.
  96. g. On systems without PCID support, each CR3 write flushes
  97. the entire TLB. That means that each syscall, interrupt
  98. or exception flushes the TLB.
  99. h. INVPCID is a TLB-flushing instruction which allows flushing
  100. of TLB entries for non-current PCIDs. Some systems support
  101. PCIDs, but do not support INVPCID. On these systems, addresses
  102. can only be flushed from the TLB for the current PCID. When
  103. flushing a kernel address, we need to flush all PCIDs, so a
  104. single kernel address flush will require a TLB-flushing CR3
  105. write upon the next use of every PCID.
  106. Possible Future Work
  107. ====================
  108. 1. We can be more careful about not actually writing to CR3
  109. unless its value is actually changed.
  110. 2. Allow PTI to be enabled/disabled at runtime in addition to the
  111. boot-time switching.
  112. Testing
  113. ========
  114. To test stability of PTI, the following test procedure is recommended,
  115. ideally doing all of these in parallel:
  116. 1. Set CONFIG_DEBUG_ENTRY=y
  117. 2. Run several copies of all of the tools/testing/selftests/x86/ tests
  118. (excluding MPX and protection_keys) in a loop on multiple CPUs for
  119. several minutes. These tests frequently uncover corner cases in the
  120. kernel entry code. In general, old kernels might cause these tests
  121. themselves to crash, but they should never crash the kernel.
  122. 3. Run the 'perf' tool in a mode (top or record) that generates many
  123. frequent performance monitoring non-maskable interrupts (see "NMI"
  124. in /proc/interrupts). This exercises the NMI entry/exit code which
  125. is known to trigger bugs in code paths that did not expect to be
  126. interrupted, including nested NMIs. Using "-c" boosts the rate of
  127. NMIs, and using two -c with separate counters encourages nested NMIs
  128. and less deterministic behavior.
  129. ::
  130. while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
  131. 4. Launch a KVM virtual machine.
  132. 5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
  133. This has been a lightly-tested code path and needs extra scrutiny.
  134. Debugging
  135. =========
  136. Bugs in PTI cause a few different signatures of crashes
  137. that are worth noting here.
  138. * Failures of the selftests/x86 code. Usually a bug in one of the
  139. more obscure corners of entry_64.S
  140. * Crashes in early boot, especially around CPU bringup. Bugs
  141. in the mappings cause these.
  142. * Crashes at the first interrupt. Caused by bugs in entry_64.S,
  143. like screwing up a page table switch. Also caused by
  144. incorrectly mapping the IRQ handler entry code.
  145. * Crashes at the first NMI. The NMI code is separate from main
  146. interrupt handlers and can have bugs that do not affect
  147. normal interrupts. Also caused by incorrectly mapping NMI
  148. code. NMIs that interrupt the entry code must be very
  149. careful and can be the cause of crashes that show up when
  150. running perf.
  151. * Kernel crashes at the first exit to userspace. entry_64.S
  152. bugs, or failing to map some of the exit code.
  153. * Crashes at first interrupt that interrupts userspace. The paths
  154. in entry_64.S that return to userspace are sometimes separate
  155. from the ones that return to the kernel.
  156. * Double faults: overflowing the kernel stack because of page
  157. faults upon page faults. Caused by touching non-pti-mapped
  158. data in the entry code, or forgetting to switch to kernel
  159. CR3 before calling into C functions which are not pti-mapped.
  160. * Userspace segfaults early in boot, sometimes manifesting
  161. as mount(8) failing to mount the rootfs. These have
  162. tended to be TLB invalidation issues. Usually invalidating
  163. the wrong PCID, or otherwise missing an invalidation.
  164. .. [1] https://gruss.cc/files/kaiser.pdf
  165. .. [2] https://meltdownattack.com/meltdown.pdf