vpci.rst 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316
  1. .. SPDX-License-Identifier: GPL-2.0
  2. PCI pass-thru devices
  3. =========================
  4. In a Hyper-V guest VM, PCI pass-thru devices (also called
  5. virtual PCI devices, or vPCI devices) are physical PCI devices
  6. that are mapped directly into the VM's physical address space.
  7. Guest device drivers can interact directly with the hardware
  8. without intermediation by the host hypervisor. This approach
  9. provides higher bandwidth access to the device with lower
  10. latency, compared with devices that are virtualized by the
  11. hypervisor. The device should appear to the guest just as it
  12. would when running on bare metal, so no changes are required
  13. to the Linux device drivers for the device.
  14. Hyper-V terminology for vPCI devices is "Discrete Device
  15. Assignment" (DDA). Public documentation for Hyper-V DDA is
  16. available here: `DDA`_
  17. .. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment
  18. DDA is typically used for storage controllers, such as NVMe,
  19. and for GPUs. A similar mechanism for NICs is called SR-IOV
  20. and produces the same benefits by allowing a guest device
  21. driver to interact directly with the hardware. See Hyper-V
  22. public documentation here: `SR-IOV`_
  23. .. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov-
  24. This discussion of vPCI devices includes DDA and SR-IOV
  25. devices.
  26. Device Presentation
  27. -------------------
  28. Hyper-V provides full PCI functionality for a vPCI device when
  29. it is operating, so the Linux device driver for the device can
  30. be used unchanged, provided it uses the correct Linux kernel
  31. APIs for accessing PCI config space and for other integration
  32. with Linux. But the initial detection of the PCI device and
  33. its integration with the Linux PCI subsystem must use Hyper-V
  34. specific mechanisms. Consequently, vPCI devices on Hyper-V
  35. have a dual identity. They are initially presented to Linux
  36. guests as VMBus devices via the standard VMBus "offer"
  37. mechanism, so they have a VMBus identity and appear under
  38. /sys/bus/vmbus/devices. The VMBus vPCI driver in Linux at
  39. drivers/pci/controller/pci-hyperv.c handles a newly introduced
  40. vPCI device by fabricating a PCI bus topology and creating all
  41. the normal PCI device data structures in Linux that would
  42. exist if the PCI device were discovered via ACPI on a bare-
  43. metal system. Once those data structures are set up, the
  44. device also has a normal PCI identity in Linux, and the normal
  45. Linux device driver for the vPCI device can function as if it
  46. were running in Linux on bare-metal. Because vPCI devices are
  47. presented dynamically through the VMBus offer mechanism, they
  48. do not appear in the Linux guest's ACPI tables. vPCI devices
  49. may be added to a VM or removed from a VM at any time during
  50. the life of the VM, and not just during initial boot.
  51. With this approach, the vPCI device is a VMBus device and a
  52. PCI device at the same time. In response to the VMBus offer
  53. message, the hv_pci_probe() function runs and establishes a
  54. VMBus connection to the vPCI VSP on the Hyper-V host. That
  55. connection has a single VMBus channel. The channel is used to
  56. exchange messages with the vPCI VSP for the purpose of setting
  57. up and configuring the vPCI device in Linux. Once the device
  58. is fully configured in Linux as a PCI device, the VMBus
  59. channel is used only if Linux changes the vCPU to be interrupted
  60. in the guest, or if the vPCI device is removed from
  61. the VM while the VM is running. The ongoing operation of the
  62. device happens directly between the Linux device driver for
  63. the device and the hardware, with VMBus and the VMBus channel
  64. playing no role.
  65. PCI Device Setup
  66. ----------------
  67. PCI device setup follows a sequence that Hyper-V originally
  68. created for Windows guests, and that can be ill-suited for
  69. Linux guests due to differences in the overall structure of
  70. the Linux PCI subsystem compared with Windows. Nonetheless,
  71. with a bit of hackery in the Hyper-V virtual PCI driver for
  72. Linux, the virtual PCI device is setup in Linux so that
  73. generic Linux PCI subsystem code and the Linux driver for the
  74. device "just work".
  75. Each vPCI device is set up in Linux to be in its own PCI
  76. domain with a host bridge. The PCI domainID is derived from
  77. bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI
  78. device. The Hyper-V host does not guarantee that these bytes
  79. are unique, so hv_pci_probe() has an algorithm to resolve
  80. collisions. The collision resolution is intended to be stable
  81. across reboots of the same VM so that the PCI domainIDs don't
  82. change, as the domainID appears in the user space
  83. configuration of some devices.
  84. hv_pci_probe() allocates a guest MMIO range to be used as PCI
  85. config space for the device. This MMIO range is communicated
  86. to the Hyper-V host over the VMBus channel as part of telling
  87. the host that the device is ready to enter d0. See
  88. hv_pci_enter_d0(). When the guest subsequently accesses this
  89. MMIO range, the Hyper-V host intercepts the accesses and maps
  90. them to the physical device PCI config space.
  91. hv_pci_probe() also gets BAR information for the device from
  92. the Hyper-V host, and uses this information to allocate MMIO
  93. space for the BARs. That MMIO space is then setup to be
  94. associated with the host bridge so that it works when generic
  95. PCI subsystem code in Linux processes the BARs.
  96. Finally, hv_pci_probe() creates the root PCI bus. At this
  97. point the Hyper-V virtual PCI driver hackery is done, and the
  98. normal Linux PCI machinery for scanning the root bus works to
  99. detect the device, to perform driver matching, and to
  100. initialize the driver and device.
  101. PCI Device Removal
  102. ------------------
  103. A Hyper-V host may initiate removal of a vPCI device from a
  104. guest VM at any time during the life of the VM. The removal
  105. is instigated by an admin action taken on the Hyper-V host and
  106. is not under the control of the guest OS.
  107. A guest VM is notified of the removal by an unsolicited
  108. "Eject" message sent from the host to the guest over the VMBus
  109. channel associated with the vPCI device. Upon receipt of such
  110. a message, the Hyper-V virtual PCI driver in Linux
  111. asynchronously invokes Linux kernel PCI subsystem calls to
  112. shutdown and remove the device. When those calls are
  113. complete, an "Ejection Complete" message is sent back to
  114. Hyper-V over the VMBus channel indicating that the device has
  115. been removed. At this point, Hyper-V sends a VMBus rescind
  116. message to the Linux guest, which the VMBus driver in Linux
  117. processes by removing the VMBus identity for the device. Once
  118. that processing is complete, all vestiges of the device having
  119. been present are gone from the Linux kernel. The rescind
  120. message also indicates to the guest that Hyper-V has stopped
  121. providing support for the vPCI device in the guest. If the
  122. guest were to attempt to access that device's MMIO space, it
  123. would be an invalid reference. Hypercalls affecting the device
  124. return errors, and any further messages sent in the VMBus
  125. channel are ignored.
  126. After sending the Eject message, Hyper-V allows the guest VM
  127. 60 seconds to cleanly shutdown the device and respond with
  128. Ejection Complete before sending the VMBus rescind
  129. message. If for any reason the Eject steps don't complete
  130. within the allowed 60 seconds, the Hyper-V host forcibly
  131. performs the rescind steps, which will likely result in
  132. cascading errors in the guest because the device is now no
  133. longer present from the guest standpoint and accessing the
  134. device MMIO space will fail.
  135. Because ejection is asynchronous and can happen at any point
  136. during the guest VM lifecycle, proper synchronization in the
  137. Hyper-V virtual PCI driver is very tricky. Ejection has been
  138. observed even before a newly offered vPCI device has been
  139. fully setup. The Hyper-V virtual PCI driver has been updated
  140. several times over the years to fix race conditions when
  141. ejections happen at inopportune times. Care must be taken when
  142. modifying this code to prevent re-introducing such problems.
  143. See comments in the code.
  144. Interrupt Assignment
  145. --------------------
  146. The Hyper-V virtual PCI driver supports vPCI devices using
  147. MSI, multi-MSI, or MSI-X. Assigning the guest vCPU that will
  148. receive the interrupt for a particular MSI or MSI-X message is
  149. complex because of the way the Linux setup of IRQs maps onto
  150. the Hyper-V interfaces. For the single-MSI and MSI-X cases,
  151. Linux calls hv_compse_msi_msg() twice, with the first call
  152. containing a dummy vCPU and the second call containing the
  153. real vCPU. Furthermore, hv_irq_unmask() is finally called
  154. (on x86) or the GICD registers are set (on arm64) to specify
  155. the real vCPU again. Each of these three calls interact
  156. with Hyper-V, which must decide which physical CPU should
  157. receive the interrupt before it is forwarded to the guest VM.
  158. Unfortunately, the Hyper-V decision-making process is a bit
  159. limited, and can result in concentrating the physical
  160. interrupts on a single CPU, causing a performance bottleneck.
  161. See details about how this is resolved in the extensive
  162. comment above the function hv_compose_msi_req_get_cpu().
  163. The Hyper-V virtual PCI driver implements the
  164. irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg().
  165. Unfortunately, on Hyper-V the implementation requires sending
  166. a VMBus message to the Hyper-V host and awaiting an interrupt
  167. indicating receipt of a reply message. Since
  168. irq_chip.irq_compose_msi_msg can be called with IRQ locks
  169. held, it doesn't work to do the normal sleep until awakened by
  170. the interrupt. Instead hv_compose_msi_msg() must send the
  171. VMBus message, and then poll for the completion message. As
  172. further complexity, the vPCI device could be ejected/rescinded
  173. while the polling is in progress, so this scenario must be
  174. detected as well. See comments in the code regarding this
  175. very tricky area.
  176. Most of the code in the Hyper-V virtual PCI driver (pci-
  177. hyperv.c) applies to Hyper-V and Linux guests running on x86
  178. and on arm64 architectures. But there are differences in how
  179. interrupt assignments are managed. On x86, the Hyper-V
  180. virtual PCI driver in the guest must make a hypercall to tell
  181. Hyper-V which guest vCPU should be interrupted by each
  182. MSI/MSI-X interrupt, and the x86 interrupt vector number that
  183. the x86_vector IRQ domain has picked for the interrupt. This
  184. hypercall is made by hv_arch_irq_unmask(). On arm64, the
  185. Hyper-V virtual PCI driver manages the allocation of an SPI
  186. for each MSI/MSI-X interrupt. The Hyper-V virtual PCI driver
  187. stores the allocated SPI in the architectural GICD registers,
  188. which Hyper-V emulates, so no hypercall is necessary as with
  189. x86. Hyper-V does not support using LPIs for vPCI devices in
  190. arm64 guest VMs because it does not emulate a GICv3 ITS.
  191. The Hyper-V virtual PCI driver in Linux supports vPCI devices
  192. whose drivers create managed or unmanaged Linux IRQs. If the
  193. smp_affinity for an unmanaged IRQ is updated via the /proc/irq
  194. interface, the Hyper-V virtual PCI driver is called to tell
  195. the Hyper-V host to change the interrupt targeting and
  196. everything works properly. However, on x86 if the x86_vector
  197. IRQ domain needs to reassign an interrupt vector due to
  198. running out of vectors on a CPU, there's no path to inform the
  199. Hyper-V host of the change, and things break. Fortunately,
  200. guest VMs operate in a constrained device environment where
  201. using all the vectors on a CPU doesn't happen. Since such a
  202. problem is only a theoretical concern rather than a practical
  203. concern, it has been left unaddressed.
  204. DMA
  205. ---
  206. By default, Hyper-V pins all guest VM memory in the host
  207. when the VM is created, and programs the physical IOMMU to
  208. allow the VM to have DMA access to all its memory. Hence
  209. it is safe to assign PCI devices to the VM, and allow the
  210. guest operating system to program the DMA transfers. The
  211. physical IOMMU prevents a malicious guest from initiating
  212. DMA to memory belonging to the host or to other VMs on the
  213. host. From the Linux guest standpoint, such DMA transfers
  214. are in "direct" mode since Hyper-V does not provide a virtual
  215. IOMMU in the guest.
  216. Hyper-V assumes that physical PCI devices always perform
  217. cache-coherent DMA. When running on x86, this behavior is
  218. required by the architecture. When running on arm64, the
  219. architecture allows for both cache-coherent and
  220. non-cache-coherent devices, with the behavior of each device
  221. specified in the ACPI DSDT. But when a PCI device is assigned
  222. to a guest VM, that device does not appear in the DSDT, so the
  223. Hyper-V VMBus driver propagates cache-coherency information
  224. from the VMBus node in the ACPI DSDT to all VMBus devices,
  225. including vPCI devices (since they have a dual identity as a VMBus
  226. device and as a PCI device). See vmbus_dma_configure().
  227. Current Hyper-V versions always indicate that the VMBus is
  228. cache coherent, so vPCI devices on arm64 always get marked as
  229. cache coherent and the CPU does not perform any sync
  230. operations as part of dma_map/unmap_*() calls.
  231. vPCI protocol versions
  232. ----------------------
  233. As previously described, during vPCI device setup and teardown
  234. messages are passed over a VMBus channel between the Hyper-V
  235. host and the Hyper-v vPCI driver in the Linux guest. Some
  236. messages have been revised in newer versions of Hyper-V, so
  237. the guest and host must agree on the vPCI protocol version to
  238. be used. The version is negotiated when communication over
  239. the VMBus channel is first established. See
  240. hv_pci_protocol_negotiation(). Newer versions of the protocol
  241. extend support to VMs with more than 64 vCPUs, and provide
  242. additional information about the vPCI device, such as the
  243. guest virtual NUMA node to which it is most closely affined in
  244. the underlying hardware.
  245. Guest NUMA node affinity
  246. ------------------------
  247. When the vPCI protocol version provides it, the guest NUMA
  248. node affinity of the vPCI device is stored as part of the Linux
  249. device information for subsequent use by the Linux driver. See
  250. hv_pci_assign_numa_node(). If the negotiated protocol version
  251. does not support the host providing NUMA affinity information,
  252. the Linux guest defaults the device NUMA node to 0. But even
  253. when the negotiated protocol version includes NUMA affinity
  254. information, the ability of the host to provide such
  255. information depends on certain host configuration options. If
  256. the guest receives NUMA node value "0", it could mean NUMA
  257. node 0, or it could mean "no information is available".
  258. Unfortunately it is not possible to distinguish the two cases
  259. from the guest side.
  260. PCI config space access in a CoCo VM
  261. ------------------------------------
  262. Linux PCI device drivers access PCI config space using a
  263. standard set of functions provided by the Linux PCI subsystem.
  264. In Hyper-V guests these standard functions map to functions
  265. hv_pcifront_read_config() and hv_pcifront_write_config()
  266. in the Hyper-V virtual PCI driver. In normal VMs,
  267. these hv_pcifront_*() functions directly access the PCI config
  268. space, and the accesses trap to Hyper-V to be handled.
  269. But in CoCo VMs, memory encryption prevents Hyper-V
  270. from reading the guest instruction stream to emulate the
  271. access, so the hv_pcifront_*() functions must invoke
  272. hypercalls with explicit arguments describing the access to be
  273. made.
  274. Config Block back-channel
  275. -------------------------
  276. The Hyper-V host and Hyper-V virtual PCI driver in Linux
  277. together implement a non-standard back-channel communication
  278. path between the host and guest. The back-channel path uses
  279. messages sent over the VMBus channel associated with the vPCI
  280. device. The functions hyperv_read_cfg_blk() and
  281. hyperv_write_cfg_blk() are the primary interfaces provided to
  282. other parts of the Linux kernel. As of this writing, these
  283. interfaces are used only by the Mellanox mlx5 driver to pass
  284. diagnostic data to a Hyper-V host running in the Azure public
  285. cloud. The functions hyperv_read_cfg_blk() and
  286. hyperv_write_cfg_blk() are implemented in a separate module
  287. (pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that
  288. effectively stubs them out when running in non-Hyper-V
  289. environments.