cxl.txt 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449
  1. Coherent Accelerator Interface (CXL)
  2. ====================================
  3. Introduction
  4. ============
  5. The coherent accelerator interface is designed to allow the
  6. coherent connection of accelerators (FPGAs and other devices) to a
  7. POWER system. These devices need to adhere to the Coherent
  8. Accelerator Interface Architecture (CAIA).
  9. IBM refers to this as the Coherent Accelerator Processor Interface
  10. or CAPI. In the kernel it's referred to by the name CXL to avoid
  11. confusion with the ISDN CAPI subsystem.
  12. Coherent in this context means that the accelerator and CPUs can
  13. both access system memory directly and with the same effective
  14. addresses.
  15. Hardware overview
  16. =================
  17. POWER8/9 FPGA
  18. +----------+ +---------+
  19. | | | |
  20. | CPU | | AFU |
  21. | | | |
  22. | | | |
  23. | | | |
  24. +----------+ +---------+
  25. | PHB | | |
  26. | +------+ | PSL |
  27. | | CAPP |<------>| |
  28. +---+------+ PCIE +---------+
  29. The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP)
  30. unit which is part of the PCIe Host Bridge (PHB). This is managed
  31. by Linux by calls into OPAL. Linux doesn't directly program the
  32. CAPP.
  33. The FPGA (or coherently attached device) consists of two parts.
  34. The POWER Service Layer (PSL) and the Accelerator Function Unit
  35. (AFU). The AFU is used to implement specific functionality behind
  36. the PSL. The PSL, among other things, provides memory address
  37. translation services to allow each AFU direct access to userspace
  38. memory.
  39. The AFU is the core part of the accelerator (eg. the compression,
  40. crypto etc function). The kernel has no knowledge of the function
  41. of the AFU. Only userspace interacts directly with the AFU.
  42. The PSL provides the translation and interrupt services that the
  43. AFU needs. This is what the kernel interacts with. For example, if
  44. the AFU needs to read a particular effective address, it sends
  45. that address to the PSL, the PSL then translates it, fetches the
  46. data from memory and returns it to the AFU. If the PSL has a
  47. translation miss, it interrupts the kernel and the kernel services
  48. the fault. The context to which this fault is serviced is based on
  49. who owns that acceleration function.
  50. POWER8 <-----> PSL Version 8 is compliant to the CAIA Version 1.0.
  51. POWER9 <-----> PSL Version 9 is compliant to the CAIA Version 2.0.
  52. This PSL Version 9 provides new features such as:
  53. * Interaction with the nest MMU on the P9 chip.
  54. * Native DMA support.
  55. * Supports sending ASB_Notify messages for host thread wakeup.
  56. * Supports Atomic operations.
  57. * ....
  58. Cards with a PSL9 won't work on a POWER8 system and cards with a
  59. PSL8 won't work on a POWER9 system.
  60. AFU Modes
  61. =========
  62. There are two programming modes supported by the AFU. Dedicated
  63. and AFU directed. AFU may support one or both modes.
  64. When using dedicated mode only one MMU context is supported. In
  65. this mode, only one userspace process can use the accelerator at
  66. time.
  67. When using AFU directed mode, up to 16K simultaneous contexts can
  68. be supported. This means up to 16K simultaneous userspace
  69. applications may use the accelerator (although specific AFUs may
  70. support fewer). In this mode, the AFU sends a 16 bit context ID
  71. with each of its requests. This tells the PSL which context is
  72. associated with each operation. If the PSL can't translate an
  73. operation, the ID can also be accessed by the kernel so it can
  74. determine the userspace context associated with an operation.
  75. MMIO space
  76. ==========
  77. A portion of the accelerator MMIO space can be directly mapped
  78. from the AFU to userspace. Either the whole space can be mapped or
  79. just a per context portion. The hardware is self describing, hence
  80. the kernel can determine the offset and size of the per context
  81. portion.
  82. Interrupts
  83. ==========
  84. AFUs may generate interrupts that are destined for userspace. These
  85. are received by the kernel as hardware interrupts and passed onto
  86. userspace by a read syscall documented below.
  87. Data storage faults and error interrupts are handled by the kernel
  88. driver.
  89. Work Element Descriptor (WED)
  90. =============================
  91. The WED is a 64-bit parameter passed to the AFU when a context is
  92. started. Its format is up to the AFU hence the kernel has no
  93. knowledge of what it represents. Typically it will be the
  94. effective address of a work queue or status block where the AFU
  95. and userspace can share control and status information.
  96. User API
  97. ========
  98. 1. AFU character devices
  99. For AFUs operating in AFU directed mode, two character device
  100. files will be created. /dev/cxl/afu0.0m will correspond to a
  101. master context and /dev/cxl/afu0.0s will correspond to a slave
  102. context. Master contexts have access to the full MMIO space an
  103. AFU provides. Slave contexts have access to only the per process
  104. MMIO space an AFU provides.
  105. For AFUs operating in dedicated process mode, the driver will
  106. only create a single character device per AFU called
  107. /dev/cxl/afu0.0d. This will have access to the entire MMIO space
  108. that the AFU provides (like master contexts in AFU directed).
  109. The types described below are defined in include/uapi/misc/cxl.h
  110. The following file operations are supported on both slave and
  111. master devices.
  112. A userspace library libcxl is available here:
  113. https://github.com/ibm-capi/libcxl
  114. This provides a C interface to this kernel API.
  115. open
  116. ----
  117. Opens the device and allocates a file descriptor to be used with
  118. the rest of the API.
  119. A dedicated mode AFU only has one context and only allows the
  120. device to be opened once.
  121. An AFU directed mode AFU can have many contexts, the device can be
  122. opened once for each context that is available.
  123. When all available contexts are allocated the open call will fail
  124. and return -ENOSPC.
  125. Note: IRQs need to be allocated for each context, which may limit
  126. the number of contexts that can be created, and therefore
  127. how many times the device can be opened. The POWER8 CAPP
  128. supports 2040 IRQs and 3 are used by the kernel, so 2037 are
  129. left. If 1 IRQ is needed per context, then only 2037
  130. contexts can be allocated. If 4 IRQs are needed per context,
  131. then only 2037/4 = 509 contexts can be allocated.
  132. ioctl
  133. -----
  134. CXL_IOCTL_START_WORK:
  135. Starts the AFU context and associates it with the current
  136. process. Once this ioctl is successfully executed, all memory
  137. mapped into this process is accessible to this AFU context
  138. using the same effective addresses. No additional calls are
  139. required to map/unmap memory. The AFU memory context will be
  140. updated as userspace allocates and frees memory. This ioctl
  141. returns once the AFU context is started.
  142. Takes a pointer to a struct cxl_ioctl_start_work:
  143. struct cxl_ioctl_start_work {
  144. __u64 flags;
  145. __u64 work_element_descriptor;
  146. __u64 amr;
  147. __s16 num_interrupts;
  148. __s16 reserved1;
  149. __s32 reserved2;
  150. __u64 reserved3;
  151. __u64 reserved4;
  152. __u64 reserved5;
  153. __u64 reserved6;
  154. };
  155. flags:
  156. Indicates which optional fields in the structure are
  157. valid.
  158. work_element_descriptor:
  159. The Work Element Descriptor (WED) is a 64-bit argument
  160. defined by the AFU. Typically this is an effective
  161. address pointing to an AFU specific structure
  162. describing what work to perform.
  163. amr:
  164. Authority Mask Register (AMR), same as the powerpc
  165. AMR. This field is only used by the kernel when the
  166. corresponding CXL_START_WORK_AMR value is specified in
  167. flags. If not specified the kernel will use a default
  168. value of 0.
  169. num_interrupts:
  170. Number of userspace interrupts to request. This field
  171. is only used by the kernel when the corresponding
  172. CXL_START_WORK_NUM_IRQS value is specified in flags.
  173. If not specified the minimum number required by the
  174. AFU will be allocated. The min and max number can be
  175. obtained from sysfs.
  176. reserved fields:
  177. For ABI padding and future extensions
  178. CXL_IOCTL_GET_PROCESS_ELEMENT:
  179. Get the current context id, also known as the process element.
  180. The value is returned from the kernel as a __u32.
  181. mmap
  182. ----
  183. An AFU may have an MMIO space to facilitate communication with the
  184. AFU. If it does, the MMIO space can be accessed via mmap. The size
  185. and contents of this area are specific to the particular AFU. The
  186. size can be discovered via sysfs.
  187. In AFU directed mode, master contexts are allowed to map all of
  188. the MMIO space and slave contexts are allowed to only map the per
  189. process MMIO space associated with the context. In dedicated
  190. process mode the entire MMIO space can always be mapped.
  191. This mmap call must be done after the START_WORK ioctl.
  192. Care should be taken when accessing MMIO space. Only 32 and 64-bit
  193. accesses are supported by POWER8. Also, the AFU will be designed
  194. with a specific endianness, so all MMIO accesses should consider
  195. endianness (recommend endian(3) variants like: le64toh(),
  196. be64toh() etc). These endian issues equally apply to shared memory
  197. queues the WED may describe.
  198. read
  199. ----
  200. Reads events from the AFU. Blocks if no events are pending
  201. (unless O_NONBLOCK is supplied). Returns -EIO in the case of an
  202. unrecoverable error or if the card is removed.
  203. read() will always return an integral number of events.
  204. The buffer passed to read() must be at least 4K bytes.
  205. The result of the read will be a buffer of one or more events,
  206. each event is of type struct cxl_event, of varying size.
  207. struct cxl_event {
  208. struct cxl_event_header header;
  209. union {
  210. struct cxl_event_afu_interrupt irq;
  211. struct cxl_event_data_storage fault;
  212. struct cxl_event_afu_error afu_error;
  213. };
  214. };
  215. The struct cxl_event_header is defined as:
  216. struct cxl_event_header {
  217. __u16 type;
  218. __u16 size;
  219. __u16 process_element;
  220. __u16 reserved1;
  221. };
  222. type:
  223. This defines the type of event. The type determines how
  224. the rest of the event is structured. These types are
  225. described below and defined by enum cxl_event_type.
  226. size:
  227. This is the size of the event in bytes including the
  228. struct cxl_event_header. The start of the next event can
  229. be found at this offset from the start of the current
  230. event.
  231. process_element:
  232. Context ID of the event.
  233. reserved field:
  234. For future extensions and padding.
  235. If the event type is CXL_EVENT_AFU_INTERRUPT then the event
  236. structure is defined as:
  237. struct cxl_event_afu_interrupt {
  238. __u16 flags;
  239. __u16 irq; /* Raised AFU interrupt number */
  240. __u32 reserved1;
  241. };
  242. flags:
  243. These flags indicate which optional fields are present
  244. in this struct. Currently all fields are mandatory.
  245. irq:
  246. The IRQ number sent by the AFU.
  247. reserved field:
  248. For future extensions and padding.
  249. If the event type is CXL_EVENT_DATA_STORAGE then the event
  250. structure is defined as:
  251. struct cxl_event_data_storage {
  252. __u16 flags;
  253. __u16 reserved1;
  254. __u32 reserved2;
  255. __u64 addr;
  256. __u64 dsisr;
  257. __u64 reserved3;
  258. };
  259. flags:
  260. These flags indicate which optional fields are present in
  261. this struct. Currently all fields are mandatory.
  262. address:
  263. The address that the AFU unsuccessfully attempted to
  264. access. Valid accesses will be handled transparently by the
  265. kernel but invalid accesses will generate this event.
  266. dsisr:
  267. This field gives information on the type of fault. It is a
  268. copy of the DSISR from the PSL hardware when the address
  269. fault occurred. The form of the DSISR is as defined in the
  270. CAIA.
  271. reserved fields:
  272. For future extensions
  273. If the event type is CXL_EVENT_AFU_ERROR then the event structure
  274. is defined as:
  275. struct cxl_event_afu_error {
  276. __u16 flags;
  277. __u16 reserved1;
  278. __u32 reserved2;
  279. __u64 error;
  280. };
  281. flags:
  282. These flags indicate which optional fields are present in
  283. this struct. Currently all fields are Mandatory.
  284. error:
  285. Error status from the AFU. Defined by the AFU.
  286. reserved fields:
  287. For future extensions and padding
  288. 2. Card character device (powerVM guest only)
  289. In a powerVM guest, an extra character device is created for the
  290. card. The device is only used to write (flash) a new image on the
  291. FPGA accelerator. Once the image is written and verified, the
  292. device tree is updated and the card is reset to reload the updated
  293. image.
  294. open
  295. ----
  296. Opens the device and allocates a file descriptor to be used with
  297. the rest of the API. The device can only be opened once.
  298. ioctl
  299. -----
  300. CXL_IOCTL_DOWNLOAD_IMAGE:
  301. CXL_IOCTL_VALIDATE_IMAGE:
  302. Starts and controls flashing a new FPGA image. Partial
  303. reconfiguration is not supported (yet), so the image must contain
  304. a copy of the PSL and AFU(s). Since an image can be quite large,
  305. the caller may have to iterate, splitting the image in smaller
  306. chunks.
  307. Takes a pointer to a struct cxl_adapter_image:
  308. struct cxl_adapter_image {
  309. __u64 flags;
  310. __u64 data;
  311. __u64 len_data;
  312. __u64 len_image;
  313. __u64 reserved1;
  314. __u64 reserved2;
  315. __u64 reserved3;
  316. __u64 reserved4;
  317. };
  318. flags:
  319. These flags indicate which optional fields are present in
  320. this struct. Currently all fields are mandatory.
  321. data:
  322. Pointer to a buffer with part of the image to write to the
  323. card.
  324. len_data:
  325. Size of the buffer pointed to by data.
  326. len_image:
  327. Full size of the image.
  328. Sysfs Class
  329. ===========
  330. A cxl sysfs class is added under /sys/class/cxl to facilitate
  331. enumeration and tuning of the accelerators. Its layout is
  332. described in Documentation/ABI/testing/sysfs-class-cxl
  333. Udev rules
  334. ==========
  335. The following udev rules could be used to create a symlink to the
  336. most logical chardev to use in any programming mode (afuX.Yd for
  337. dedicated, afuX.Ys for afu directed), since the API is virtually
  338. identical for each:
  339. SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b"
  340. SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \
  341. KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b"