| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446 |
- .. SPDX-License-Identifier: GPL-2.0
- =====================================
- Intel Trust Domain Extensions (TDX)
- =====================================
- Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
- the host and physical attacks by isolating the guest register state and by
- encrypting the guest memory. In TDX, a special module running in a special
- mode sits between the host and the guest and manages the guest/host
- separation.
- TDX Host Kernel Support
- =======================
- TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
- a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A
- CPU-attested software module called 'the TDX module' runs inside the new
- isolated range to provide the functionalities to manage and run protected
- VMs.
- TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
- provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
- as TDX private KeyIDs, which are only accessible within the SEAM mode.
- BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
- Before the TDX module can be used to create and run protected VMs, it
- must be loaded into the isolated range and properly initialized. The TDX
- architecture doesn't require the BIOS to load the TDX module, but the
- kernel assumes it is loaded by the BIOS.
- TDX boot-time detection
- -----------------------
- The kernel detects TDX by detecting TDX private KeyIDs during kernel
- boot. Below dmesg shows when TDX is enabled by BIOS::
- [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
- TDX module initialization
- ---------------------------------------
- The kernel talks to the TDX module via the new SEAMCALL instruction. The
- TDX module implements SEAMCALL leaf functions to allow the kernel to
- initialize it.
- If the TDX module isn't loaded, the SEAMCALL instruction fails with a
- special error. In this case the kernel fails the module initialization
- and reports the module isn't loaded::
- [..] virt/tdx: module not loaded
- Initializing the TDX module consumes roughly ~1/256th system RAM size to
- use it as 'metadata' for the TDX memory. It also takes additional CPU
- time to initialize those metadata along with the TDX module itself. Both
- are not trivial. The kernel initializes the TDX module at runtime on
- demand.
- Besides initializing the TDX module, a per-cpu initialization SEAMCALL
- must be done on one cpu before any other SEAMCALLs can be made on that
- cpu.
- The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
- allow the user of TDX to enable the TDX module and enable TDX on local
- cpu respectively.
- Making SEAMCALL requires VMXON has been done on that CPU. Currently only
- KVM implements VMXON. For now both tdx_enable() and tdx_cpu_enable()
- don't do VMXON internally (not trivial), but depends on the caller to
- guarantee that.
- To enable TDX, the caller of TDX should: 1) temporarily disable CPU
- hotplug; 2) do VMXON and tdx_enable_cpu() on all online cpus; 3) call
- tdx_enable(). For example::
- cpus_read_lock();
- on_each_cpu(vmxon_and_tdx_cpu_enable());
- ret = tdx_enable();
- cpus_read_unlock();
- if (ret)
- goto no_tdx;
- // TDX is ready to use
- And the caller of TDX must guarantee the tdx_cpu_enable() has been
- successfully done on any cpu before it wants to run any other SEAMCALL.
- A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
- online callback, and refuse to online if tdx_cpu_enable() fails.
- User can consult dmesg to see whether the TDX module has been initialized.
- If the TDX module is initialized successfully, dmesg shows something
- like below::
- [..] virt/tdx: 262668 KBs allocated for PAMT
- [..] virt/tdx: module initialized
- If the TDX module failed to initialize, dmesg also shows it failed to
- initialize::
- [..] virt/tdx: module initialization failed ...
- TDX Interaction to Other Kernel Components
- ------------------------------------------
- TDX Memory Policy
- ~~~~~~~~~~~~~~~~~
- TDX reports a list of "Convertible Memory Region" (CMR) to tell the
- kernel which memory is TDX compatible. The kernel needs to build a list
- of memory regions (out of CMRs) as "TDX-usable" memory and pass those
- regions to the TDX module. Once this is done, those "TDX-usable" memory
- regions are fixed during module's lifetime.
- To keep things simple, currently the kernel simply guarantees all pages
- in the page allocator are TDX memory. Specifically, the kernel uses all
- system memory in the core-mm "at the time of TDX module initialization"
- as TDX memory, and in the meantime, refuses to online any non-TDX-memory
- in the memory hotplug.
- Physical Memory Hotplug
- ~~~~~~~~~~~~~~~~~~~~~~~
- Note TDX assumes convertible memory is always physically present during
- machine's runtime. A non-buggy BIOS should never support hot-removal of
- any convertible memory. This implementation doesn't handle ACPI memory
- removal but depends on the BIOS to behave correctly.
- CPU Hotplug
- ~~~~~~~~~~~
- TDX module requires the per-cpu initialization SEAMCALL must be done on
- one cpu before any other SEAMCALLs can be made on that cpu. The kernel
- provides tdx_cpu_enable() to let the user of TDX to do it when the user
- wants to use a new cpu for TDX task.
- TDX doesn't support physical (ACPI) CPU hotplug. During machine boot,
- TDX verifies all boot-time present logical CPUs are TDX compatible before
- enabling TDX. A non-buggy BIOS should never support hot-add/removal of
- physical CPU. Currently the kernel doesn't handle physical CPU hotplug,
- but depends on the BIOS to behave correctly.
- Note TDX works with CPU logical online/offline, thus the kernel still
- allows to offline logical CPU and online it again.
- Kexec()
- ~~~~~~~
- TDX host support currently lacks the ability to handle kexec. For
- simplicity only one of them can be enabled in the Kconfig. This will be
- fixed in the future.
- Erratum
- ~~~~~~~
- The first few generations of TDX hardware have an erratum. A partial
- write to a TDX private memory cacheline will silently "poison" the
- line. Subsequent reads will consume the poison and generate a machine
- check.
- A partial write is a memory write where a write transaction of less than
- cacheline lands at the memory controller. The CPU does these via
- non-temporal write instructions (like MOVNTI), or through UC/WC memory
- mappings. Devices can also do partial writes via DMA.
- Theoretically, a kernel bug could do partial write to TDX private memory
- and trigger unexpected machine check. What's more, the machine check
- code will present these as "Hardware error" when they were, in fact, a
- software-triggered issue. But in the end, this issue is hard to trigger.
- If the platform has such erratum, the kernel prints additional message in
- machine check handler to tell user the machine check may be caused by
- kernel bug on TDX private memory.
- Interaction vs S3 and deeper states
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- TDX cannot survive from S3 and deeper states. The hardware resets and
- disables TDX completely when platform goes to S3 and deeper. Both TDX
- guests and the TDX module get destroyed permanently.
- The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
- hibernation. Currently, for simplicity, the kernel chooses to make TDX
- mutually exclusive with S3 and hibernation.
- The kernel disables TDX during early boot when hibernation support is
- available::
- [..] virt/tdx: initialization failed: Hibernation support is enabled
- Add 'nohibernate' kernel command line to disable hibernation in order to
- use TDX.
- ACPI S3 is disabled during kernel early boot if TDX is enabled. The user
- needs to turn off TDX in the BIOS in order to use S3.
- TDX Guest Support
- =================
- Since the host cannot directly access guest registers or memory, much
- normal functionality of a hypervisor must be moved into the guest. This is
- implemented using a Virtualization Exception (#VE) that is handled by the
- guest kernel. A #VE is handled entirely inside the guest kernel, but some
- require the hypervisor to be consulted.
- TDX includes new hypercall-like mechanisms for communicating from the
- guest to the hypervisor or the TDX module.
- New TDX Exceptions
- ------------------
- TDX guests behave differently from bare-metal and traditional VMX guests.
- In TDX guests, otherwise normal instructions or memory accesses can cause
- #VE or #GP exceptions.
- Instructions marked with an '*' conditionally cause exceptions. The
- details for these instructions are discussed below.
- Instruction-based #VE
- ~~~~~~~~~~~~~~~~~~~~~
- - Port I/O (INS, OUTS, IN, OUT)
- - HLT
- - MONITOR, MWAIT
- - WBINVD, INVD
- - VMCALL
- - RDMSR*,WRMSR*
- - CPUID*
- Instruction-based #GP
- ~~~~~~~~~~~~~~~~~~~~~
- - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
- VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
- - ENCLS, ENCLU
- - GETSEC
- - RSM
- - ENQCMD
- - RDMSR*,WRMSR*
- RDMSR/WRMSR Behavior
- ~~~~~~~~~~~~~~~~~~~~
- MSR access behavior falls into three categories:
- - #GP generated
- - #VE generated
- - "Just works"
- In general, the #GP MSRs should not be used in guests. Their use likely
- indicates a bug in the guest. The guest may try to handle the #GP with a
- hypercall but it is unlikely to succeed.
- The #VE MSRs are typically able to be handled by the hypervisor. Guests
- can make a hypercall to the hypervisor to handle the #VE.
- The "just works" MSRs do not need any special guest handling. They might
- be implemented by directly passing through the MSR to the hardware or by
- trapping and handling in the TDX module. Other than possibly being slow,
- these MSRs appear to function just as they would on bare metal.
- CPUID Behavior
- ~~~~~~~~~~~~~~
- For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
- return values (in guest EAX/EBX/ECX/EDX) are configurable by the
- hypervisor. For such cases, the Intel TDX module architecture defines two
- virtualization types:
- - Bit fields for which the hypervisor controls the value seen by the guest
- TD.
- - Bit fields for which the hypervisor configures the value such that the
- guest TD either sees their native value or a value of 0. For these bit
- fields, the hypervisor can mask off the native values, but it can not
- turn *on* values.
- A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
- not know how to handle. The guest kernel may ask the hypervisor for the
- value with a hypercall.
- #VE on Memory Accesses
- ----------------------
- There are essentially two classes of TDX memory: private and shared.
- Private memory receives full TDX protections. Its content is protected
- against access from the hypervisor. Shared memory is expected to be
- shared between guest and hypervisor and does not receive full TDX
- protections.
- A TD guest is in control of whether its memory accesses are treated as
- private or shared. It selects the behavior with a bit in its page table
- entries. This helps ensure that a guest does not place sensitive
- information in shared memory, exposing it to the untrusted hypervisor.
- #VE on Shared Memory
- ~~~~~~~~~~~~~~~~~~~~
- Access to shared mappings can cause a #VE. The hypervisor ultimately
- controls whether a shared memory access causes a #VE, so the guest must be
- careful to only reference shared pages it can safely handle a #VE. For
- instance, the guest should be careful not to access shared memory in the
- #VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
- Shared mapping content is entirely controlled by the hypervisor. The guest
- should only use shared mappings for communicating with the hypervisor.
- Shared mappings must never be used for sensitive memory content like kernel
- stacks. A good rule of thumb is that hypervisor-shared memory should be
- treated the same as memory mapped to userspace. Both the hypervisor and
- userspace are completely untrusted.
- MMIO for virtual devices is implemented as shared memory. The guest must
- be careful not to access device MMIO regions unless it is also prepared to
- handle a #VE.
- #VE on Private Pages
- ~~~~~~~~~~~~~~~~~~~~
- An access to private mappings can also cause a #VE. Since all kernel
- memory is also private memory, the kernel might theoretically need to
- handle a #VE on arbitrary kernel memory accesses. This is not feasible, so
- TDX guests ensure that all guest memory has been "accepted" before memory
- is used by the kernel.
- A modest amount of memory (typically 512M) is pre-accepted by the firmware
- before the kernel runs to ensure that the kernel can start up without
- being subjected to a #VE.
- The hypervisor is permitted to unilaterally move accepted pages to a
- "blocked" state. However, if it does this, page access will not generate a
- #VE. It will, instead, cause a "TD Exit" where the hypervisor is required
- to handle the exception.
- Linux #VE handler
- -----------------
- Just like page faults or #GP's, #VE exceptions can be either handled or be
- fatal. Typically, an unhandled userspace #VE results in a SIGSEGV.
- An unhandled kernel #VE results in an oops.
- Handling nested exceptions on x86 is typically nasty business. A #VE
- could be interrupted by an NMI which triggers another #VE and hilarity
- ensues. The TDX #VE architecture anticipated this scenario and includes a
- feature to make it slightly less nasty.
- During #VE handling, the TDX module ensures that all interrupts (including
- NMIs) are blocked. The block remains in place until the guest makes a
- TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts
- or a new #VE can be delivered.
- However, the guest kernel must still be careful to avoid potential
- #VE-triggering actions (discussed above) while this block is in place.
- While the block is in place, any #VE is elevated to a double fault (#DF)
- which is not recoverable.
- MMIO handling
- -------------
- In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
- mapping which will cause a VMEXIT on access, and then the hypervisor
- emulates the access. That is not possible in TDX guests because VMEXIT
- will expose the register state to the host. TDX guests don't trust the host
- and can't have their state exposed to the host.
- In TDX, MMIO regions typically trigger a #VE exception in the guest. The
- guest #VE handler then emulates the MMIO instruction inside the guest and
- converts it into a controlled TDCALL to the host, rather than exposing
- guest state to the host.
- MMIO addresses on x86 are just special physical addresses. They can
- theoretically be accessed with any instruction that accesses memory.
- However, the kernel instruction decoding method is limited. It is only
- designed to decode instructions like those generated by io.h macros.
- MMIO access via other means (like structure overlays) may result in an
- oops.
- Shared Memory Conversions
- -------------------------
- All TDX guest memory starts out as private at boot. This memory can not
- be accessed by the hypervisor. However, some kernel users like device
- drivers might have a need to share data with the hypervisor. To do this,
- memory must be converted between shared and private. This can be
- accomplished using some existing memory encryption helpers:
- * set_memory_decrypted() converts a range of pages to shared.
- * set_memory_encrypted() converts memory back to private.
- Device drivers are the primary user of shared memory, but there's no need
- to touch every driver. DMA buffers and ioremap() do the conversions
- automatically.
- TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
- converted to shared on boot.
- For coherent DMA allocation, the DMA buffer gets converted on the
- allocation. Check force_dma_unencrypted() for details.
- Attestation
- ===========
- Attestation is used to verify the TDX guest trustworthiness to other
- entities before provisioning secrets to the guest. For example, a key
- server may want to use attestation to verify that the guest is the
- desired one before releasing the encryption keys to mount the encrypted
- rootfs or a secondary drive.
- The TDX module records the state of the TDX guest in various stages of
- the guest boot process using the build time measurement register (MRTD)
- and runtime measurement registers (RTMR). Measurements related to the
- guest initial configuration and firmware image are recorded in the MRTD
- register. Measurements related to initial state, kernel image, firmware
- image, command line options, initrd, ACPI tables, etc are recorded in
- RTMR registers. For more details, as an example, please refer to TDX
- Virtual Firmware design specification, section titled "TD Measurement".
- At TDX guest runtime, the attestation process is used to attest to these
- measurements.
- The attestation process consists of two steps: TDREPORT generation and
- Quote generation.
- TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT)
- from the TDX module. TDREPORT is a fixed-size data structure generated by
- the TDX module which contains guest-specific information (such as build
- and boot measurements), platform security version, and the MAC to protect
- the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used
- as input and included in the TDREPORT. Typically it can be some nonce
- provided by attestation service so the TDREPORT can be verified uniquely.
- More details about the TDREPORT can be found in Intel TDX Module
- specification, section titled "TDG.MR.REPORT Leaf".
- After getting the TDREPORT, the second step of the attestation process
- is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT
- by design can only be verified on the local platform as the MAC key is
- bound to the platform. To support remote verification of the TDREPORT,
- TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally
- and convert it to a remotely verifiable Quote. Method of sending TDREPORT
- to QE is implementation specific. Attestation software can choose
- whatever communication channel available (i.e. vsock or TCP/IP) to
- send the TDREPORT to QE and receive the Quote.
- References
- ==========
- TDX reference material is collected here:
- https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
|