tdx.rst 18 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =====================================
  3. Intel Trust Domain Extensions (TDX)
  4. =====================================
  5. Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
  6. the host and physical attacks by isolating the guest register state and by
  7. encrypting the guest memory. In TDX, a special module running in a special
  8. mode sits between the host and the guest and manages the guest/host
  9. separation.
  10. TDX Host Kernel Support
  11. =======================
  12. TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
  13. a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A
  14. CPU-attested software module called 'the TDX module' runs inside the new
  15. isolated range to provide the functionalities to manage and run protected
  16. VMs.
  17. TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
  18. provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
  19. as TDX private KeyIDs, which are only accessible within the SEAM mode.
  20. BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
  21. Before the TDX module can be used to create and run protected VMs, it
  22. must be loaded into the isolated range and properly initialized. The TDX
  23. architecture doesn't require the BIOS to load the TDX module, but the
  24. kernel assumes it is loaded by the BIOS.
  25. TDX boot-time detection
  26. -----------------------
  27. The kernel detects TDX by detecting TDX private KeyIDs during kernel
  28. boot. Below dmesg shows when TDX is enabled by BIOS::
  29. [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
  30. TDX module initialization
  31. ---------------------------------------
  32. The kernel talks to the TDX module via the new SEAMCALL instruction. The
  33. TDX module implements SEAMCALL leaf functions to allow the kernel to
  34. initialize it.
  35. If the TDX module isn't loaded, the SEAMCALL instruction fails with a
  36. special error. In this case the kernel fails the module initialization
  37. and reports the module isn't loaded::
  38. [..] virt/tdx: module not loaded
  39. Initializing the TDX module consumes roughly ~1/256th system RAM size to
  40. use it as 'metadata' for the TDX memory. It also takes additional CPU
  41. time to initialize those metadata along with the TDX module itself. Both
  42. are not trivial. The kernel initializes the TDX module at runtime on
  43. demand.
  44. Besides initializing the TDX module, a per-cpu initialization SEAMCALL
  45. must be done on one cpu before any other SEAMCALLs can be made on that
  46. cpu.
  47. The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
  48. allow the user of TDX to enable the TDX module and enable TDX on local
  49. cpu respectively.
  50. Making SEAMCALL requires VMXON has been done on that CPU. Currently only
  51. KVM implements VMXON. For now both tdx_enable() and tdx_cpu_enable()
  52. don't do VMXON internally (not trivial), but depends on the caller to
  53. guarantee that.
  54. To enable TDX, the caller of TDX should: 1) temporarily disable CPU
  55. hotplug; 2) do VMXON and tdx_enable_cpu() on all online cpus; 3) call
  56. tdx_enable(). For example::
  57. cpus_read_lock();
  58. on_each_cpu(vmxon_and_tdx_cpu_enable());
  59. ret = tdx_enable();
  60. cpus_read_unlock();
  61. if (ret)
  62. goto no_tdx;
  63. // TDX is ready to use
  64. And the caller of TDX must guarantee the tdx_cpu_enable() has been
  65. successfully done on any cpu before it wants to run any other SEAMCALL.
  66. A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
  67. online callback, and refuse to online if tdx_cpu_enable() fails.
  68. User can consult dmesg to see whether the TDX module has been initialized.
  69. If the TDX module is initialized successfully, dmesg shows something
  70. like below::
  71. [..] virt/tdx: 262668 KBs allocated for PAMT
  72. [..] virt/tdx: module initialized
  73. If the TDX module failed to initialize, dmesg also shows it failed to
  74. initialize::
  75. [..] virt/tdx: module initialization failed ...
  76. TDX Interaction to Other Kernel Components
  77. ------------------------------------------
  78. TDX Memory Policy
  79. ~~~~~~~~~~~~~~~~~
  80. TDX reports a list of "Convertible Memory Region" (CMR) to tell the
  81. kernel which memory is TDX compatible. The kernel needs to build a list
  82. of memory regions (out of CMRs) as "TDX-usable" memory and pass those
  83. regions to the TDX module. Once this is done, those "TDX-usable" memory
  84. regions are fixed during module's lifetime.
  85. To keep things simple, currently the kernel simply guarantees all pages
  86. in the page allocator are TDX memory. Specifically, the kernel uses all
  87. system memory in the core-mm "at the time of TDX module initialization"
  88. as TDX memory, and in the meantime, refuses to online any non-TDX-memory
  89. in the memory hotplug.
  90. Physical Memory Hotplug
  91. ~~~~~~~~~~~~~~~~~~~~~~~
  92. Note TDX assumes convertible memory is always physically present during
  93. machine's runtime. A non-buggy BIOS should never support hot-removal of
  94. any convertible memory. This implementation doesn't handle ACPI memory
  95. removal but depends on the BIOS to behave correctly.
  96. CPU Hotplug
  97. ~~~~~~~~~~~
  98. TDX module requires the per-cpu initialization SEAMCALL must be done on
  99. one cpu before any other SEAMCALLs can be made on that cpu. The kernel
  100. provides tdx_cpu_enable() to let the user of TDX to do it when the user
  101. wants to use a new cpu for TDX task.
  102. TDX doesn't support physical (ACPI) CPU hotplug. During machine boot,
  103. TDX verifies all boot-time present logical CPUs are TDX compatible before
  104. enabling TDX. A non-buggy BIOS should never support hot-add/removal of
  105. physical CPU. Currently the kernel doesn't handle physical CPU hotplug,
  106. but depends on the BIOS to behave correctly.
  107. Note TDX works with CPU logical online/offline, thus the kernel still
  108. allows to offline logical CPU and online it again.
  109. Kexec()
  110. ~~~~~~~
  111. TDX host support currently lacks the ability to handle kexec. For
  112. simplicity only one of them can be enabled in the Kconfig. This will be
  113. fixed in the future.
  114. Erratum
  115. ~~~~~~~
  116. The first few generations of TDX hardware have an erratum. A partial
  117. write to a TDX private memory cacheline will silently "poison" the
  118. line. Subsequent reads will consume the poison and generate a machine
  119. check.
  120. A partial write is a memory write where a write transaction of less than
  121. cacheline lands at the memory controller. The CPU does these via
  122. non-temporal write instructions (like MOVNTI), or through UC/WC memory
  123. mappings. Devices can also do partial writes via DMA.
  124. Theoretically, a kernel bug could do partial write to TDX private memory
  125. and trigger unexpected machine check. What's more, the machine check
  126. code will present these as "Hardware error" when they were, in fact, a
  127. software-triggered issue. But in the end, this issue is hard to trigger.
  128. If the platform has such erratum, the kernel prints additional message in
  129. machine check handler to tell user the machine check may be caused by
  130. kernel bug on TDX private memory.
  131. Interaction vs S3 and deeper states
  132. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  133. TDX cannot survive from S3 and deeper states. The hardware resets and
  134. disables TDX completely when platform goes to S3 and deeper. Both TDX
  135. guests and the TDX module get destroyed permanently.
  136. The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
  137. hibernation. Currently, for simplicity, the kernel chooses to make TDX
  138. mutually exclusive with S3 and hibernation.
  139. The kernel disables TDX during early boot when hibernation support is
  140. available::
  141. [..] virt/tdx: initialization failed: Hibernation support is enabled
  142. Add 'nohibernate' kernel command line to disable hibernation in order to
  143. use TDX.
  144. ACPI S3 is disabled during kernel early boot if TDX is enabled. The user
  145. needs to turn off TDX in the BIOS in order to use S3.
  146. TDX Guest Support
  147. =================
  148. Since the host cannot directly access guest registers or memory, much
  149. normal functionality of a hypervisor must be moved into the guest. This is
  150. implemented using a Virtualization Exception (#VE) that is handled by the
  151. guest kernel. A #VE is handled entirely inside the guest kernel, but some
  152. require the hypervisor to be consulted.
  153. TDX includes new hypercall-like mechanisms for communicating from the
  154. guest to the hypervisor or the TDX module.
  155. New TDX Exceptions
  156. ------------------
  157. TDX guests behave differently from bare-metal and traditional VMX guests.
  158. In TDX guests, otherwise normal instructions or memory accesses can cause
  159. #VE or #GP exceptions.
  160. Instructions marked with an '*' conditionally cause exceptions. The
  161. details for these instructions are discussed below.
  162. Instruction-based #VE
  163. ~~~~~~~~~~~~~~~~~~~~~
  164. - Port I/O (INS, OUTS, IN, OUT)
  165. - HLT
  166. - MONITOR, MWAIT
  167. - WBINVD, INVD
  168. - VMCALL
  169. - RDMSR*,WRMSR*
  170. - CPUID*
  171. Instruction-based #GP
  172. ~~~~~~~~~~~~~~~~~~~~~
  173. - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
  174. VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
  175. - ENCLS, ENCLU
  176. - GETSEC
  177. - RSM
  178. - ENQCMD
  179. - RDMSR*,WRMSR*
  180. RDMSR/WRMSR Behavior
  181. ~~~~~~~~~~~~~~~~~~~~
  182. MSR access behavior falls into three categories:
  183. - #GP generated
  184. - #VE generated
  185. - "Just works"
  186. In general, the #GP MSRs should not be used in guests. Their use likely
  187. indicates a bug in the guest. The guest may try to handle the #GP with a
  188. hypercall but it is unlikely to succeed.
  189. The #VE MSRs are typically able to be handled by the hypervisor. Guests
  190. can make a hypercall to the hypervisor to handle the #VE.
  191. The "just works" MSRs do not need any special guest handling. They might
  192. be implemented by directly passing through the MSR to the hardware or by
  193. trapping and handling in the TDX module. Other than possibly being slow,
  194. these MSRs appear to function just as they would on bare metal.
  195. CPUID Behavior
  196. ~~~~~~~~~~~~~~
  197. For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
  198. return values (in guest EAX/EBX/ECX/EDX) are configurable by the
  199. hypervisor. For such cases, the Intel TDX module architecture defines two
  200. virtualization types:
  201. - Bit fields for which the hypervisor controls the value seen by the guest
  202. TD.
  203. - Bit fields for which the hypervisor configures the value such that the
  204. guest TD either sees their native value or a value of 0. For these bit
  205. fields, the hypervisor can mask off the native values, but it can not
  206. turn *on* values.
  207. A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
  208. not know how to handle. The guest kernel may ask the hypervisor for the
  209. value with a hypercall.
  210. #VE on Memory Accesses
  211. ----------------------
  212. There are essentially two classes of TDX memory: private and shared.
  213. Private memory receives full TDX protections. Its content is protected
  214. against access from the hypervisor. Shared memory is expected to be
  215. shared between guest and hypervisor and does not receive full TDX
  216. protections.
  217. A TD guest is in control of whether its memory accesses are treated as
  218. private or shared. It selects the behavior with a bit in its page table
  219. entries. This helps ensure that a guest does not place sensitive
  220. information in shared memory, exposing it to the untrusted hypervisor.
  221. #VE on Shared Memory
  222. ~~~~~~~~~~~~~~~~~~~~
  223. Access to shared mappings can cause a #VE. The hypervisor ultimately
  224. controls whether a shared memory access causes a #VE, so the guest must be
  225. careful to only reference shared pages it can safely handle a #VE. For
  226. instance, the guest should be careful not to access shared memory in the
  227. #VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
  228. Shared mapping content is entirely controlled by the hypervisor. The guest
  229. should only use shared mappings for communicating with the hypervisor.
  230. Shared mappings must never be used for sensitive memory content like kernel
  231. stacks. A good rule of thumb is that hypervisor-shared memory should be
  232. treated the same as memory mapped to userspace. Both the hypervisor and
  233. userspace are completely untrusted.
  234. MMIO for virtual devices is implemented as shared memory. The guest must
  235. be careful not to access device MMIO regions unless it is also prepared to
  236. handle a #VE.
  237. #VE on Private Pages
  238. ~~~~~~~~~~~~~~~~~~~~
  239. An access to private mappings can also cause a #VE. Since all kernel
  240. memory is also private memory, the kernel might theoretically need to
  241. handle a #VE on arbitrary kernel memory accesses. This is not feasible, so
  242. TDX guests ensure that all guest memory has been "accepted" before memory
  243. is used by the kernel.
  244. A modest amount of memory (typically 512M) is pre-accepted by the firmware
  245. before the kernel runs to ensure that the kernel can start up without
  246. being subjected to a #VE.
  247. The hypervisor is permitted to unilaterally move accepted pages to a
  248. "blocked" state. However, if it does this, page access will not generate a
  249. #VE. It will, instead, cause a "TD Exit" where the hypervisor is required
  250. to handle the exception.
  251. Linux #VE handler
  252. -----------------
  253. Just like page faults or #GP's, #VE exceptions can be either handled or be
  254. fatal. Typically, an unhandled userspace #VE results in a SIGSEGV.
  255. An unhandled kernel #VE results in an oops.
  256. Handling nested exceptions on x86 is typically nasty business. A #VE
  257. could be interrupted by an NMI which triggers another #VE and hilarity
  258. ensues. The TDX #VE architecture anticipated this scenario and includes a
  259. feature to make it slightly less nasty.
  260. During #VE handling, the TDX module ensures that all interrupts (including
  261. NMIs) are blocked. The block remains in place until the guest makes a
  262. TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts
  263. or a new #VE can be delivered.
  264. However, the guest kernel must still be careful to avoid potential
  265. #VE-triggering actions (discussed above) while this block is in place.
  266. While the block is in place, any #VE is elevated to a double fault (#DF)
  267. which is not recoverable.
  268. MMIO handling
  269. -------------
  270. In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
  271. mapping which will cause a VMEXIT on access, and then the hypervisor
  272. emulates the access. That is not possible in TDX guests because VMEXIT
  273. will expose the register state to the host. TDX guests don't trust the host
  274. and can't have their state exposed to the host.
  275. In TDX, MMIO regions typically trigger a #VE exception in the guest. The
  276. guest #VE handler then emulates the MMIO instruction inside the guest and
  277. converts it into a controlled TDCALL to the host, rather than exposing
  278. guest state to the host.
  279. MMIO addresses on x86 are just special physical addresses. They can
  280. theoretically be accessed with any instruction that accesses memory.
  281. However, the kernel instruction decoding method is limited. It is only
  282. designed to decode instructions like those generated by io.h macros.
  283. MMIO access via other means (like structure overlays) may result in an
  284. oops.
  285. Shared Memory Conversions
  286. -------------------------
  287. All TDX guest memory starts out as private at boot. This memory can not
  288. be accessed by the hypervisor. However, some kernel users like device
  289. drivers might have a need to share data with the hypervisor. To do this,
  290. memory must be converted between shared and private. This can be
  291. accomplished using some existing memory encryption helpers:
  292. * set_memory_decrypted() converts a range of pages to shared.
  293. * set_memory_encrypted() converts memory back to private.
  294. Device drivers are the primary user of shared memory, but there's no need
  295. to touch every driver. DMA buffers and ioremap() do the conversions
  296. automatically.
  297. TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
  298. converted to shared on boot.
  299. For coherent DMA allocation, the DMA buffer gets converted on the
  300. allocation. Check force_dma_unencrypted() for details.
  301. Attestation
  302. ===========
  303. Attestation is used to verify the TDX guest trustworthiness to other
  304. entities before provisioning secrets to the guest. For example, a key
  305. server may want to use attestation to verify that the guest is the
  306. desired one before releasing the encryption keys to mount the encrypted
  307. rootfs or a secondary drive.
  308. The TDX module records the state of the TDX guest in various stages of
  309. the guest boot process using the build time measurement register (MRTD)
  310. and runtime measurement registers (RTMR). Measurements related to the
  311. guest initial configuration and firmware image are recorded in the MRTD
  312. register. Measurements related to initial state, kernel image, firmware
  313. image, command line options, initrd, ACPI tables, etc are recorded in
  314. RTMR registers. For more details, as an example, please refer to TDX
  315. Virtual Firmware design specification, section titled "TD Measurement".
  316. At TDX guest runtime, the attestation process is used to attest to these
  317. measurements.
  318. The attestation process consists of two steps: TDREPORT generation and
  319. Quote generation.
  320. TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT)
  321. from the TDX module. TDREPORT is a fixed-size data structure generated by
  322. the TDX module which contains guest-specific information (such as build
  323. and boot measurements), platform security version, and the MAC to protect
  324. the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used
  325. as input and included in the TDREPORT. Typically it can be some nonce
  326. provided by attestation service so the TDREPORT can be verified uniquely.
  327. More details about the TDREPORT can be found in Intel TDX Module
  328. specification, section titled "TDG.MR.REPORT Leaf".
  329. After getting the TDREPORT, the second step of the attestation process
  330. is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT
  331. by design can only be verified on the local platform as the MAC key is
  332. bound to the platform. To support remote verification of the TDREPORT,
  333. TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally
  334. and convert it to a remotely verifiable Quote. Method of sending TDREPORT
  335. to QE is implementation specific. Attestation software can choose
  336. whatever communication channel available (i.e. vsock or TCP/IP) to
  337. send the TDREPORT to QE and receive the Quote.
  338. References
  339. ==========
  340. TDX reference material is collected here:
  341. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html