edac.rst 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298
  1. Error Detection And Correction (EDAC) Devices
  2. =============================================
  3. Main Concepts used at the EDAC subsystem
  4. ----------------------------------------
  5. There are several things to be aware of that aren't at all obvious, like
  6. *sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*,
  7. etc...
  8. These are some of the many terms that are thrown about that don't always
  9. mean what people think they mean (Inconceivable!). In the interest of
  10. creating a common ground for discussion, terms and their definitions
  11. will be established.
  12. * Memory devices
  13. The individual DRAM chips on a memory stick. These devices commonly
  14. output 4 and 8 bits each (x4, x8). Grouping several of these in parallel
  15. provides the number of bits that the memory controller expects:
  16. typically 72 bits, in order to provide 64 bits + 8 bits of ECC data.
  17. * Memory Stick
  18. A printed circuit board that aggregates multiple memory devices in
  19. parallel. In general, this is the Field Replaceable Unit (FRU) which
  20. gets replaced, in the case of excessive errors. Most often it is also
  21. called DIMM (Dual Inline Memory Module).
  22. * Memory Socket
  23. A physical connector on the motherboard that accepts a single memory
  24. stick. Also called as "slot" on several datasheets.
  25. * Channel
  26. A memory controller channel, responsible to communicate with a group of
  27. DIMMs. Each channel has its own independent control (command) and data
  28. bus, and can be used independently or grouped with other channels.
  29. * Branch
  30. It is typically the highest hierarchy on a Fully-Buffered DIMM memory
  31. controller. Typically, it contains two channels. Two channels at the
  32. same branch can be used in single mode or in lockstep mode. When
  33. lockstep is enabled, the cacheline is doubled, but it generally brings
  34. some performance penalty. Also, it is generally not possible to point to
  35. just one memory stick when an error occurs, as the error correction code
  36. is calculated using two DIMMs instead of one. Due to that, it is capable
  37. of correcting more errors than on single mode.
  38. * Single-channel
  39. The data accessed by the memory controller is contained into one dimm
  40. only. E. g. if the data is 64 bits-wide, the data flows to the CPU using
  41. one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3
  42. memories. FB-DIMM and RAMBUS use a different concept for channel, so
  43. this concept doesn't apply there.
  44. * Double-channel
  45. The data size accessed by the memory controller is interlaced into two
  46. dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72
  47. bits with ECC), the data flows to the CPU using a 128 bits parallel
  48. access.
  49. * Chip-select row
  50. This is the name of the DRAM signal used to select the DRAM ranks to be
  51. accessed. Common chip-select rows for single channel are 64 bits, for
  52. dual channel 128 bits. It may not be visible by the memory controller,
  53. as some DIMM types have a memory buffer that can hide direct access to
  54. it from the Memory Controller.
  55. * Single-Ranked stick
  56. A Single-ranked stick has 1 chip-select row of memory. Motherboards
  57. commonly drive two chip-select pins to a memory stick. A single-ranked
  58. stick, will occupy only one of those rows. The other will be unused.
  59. .. _doubleranked:
  60. * Double-Ranked stick
  61. A double-ranked stick has two chip-select rows which access different
  62. sets of memory devices. The two rows cannot be accessed concurrently.
  63. * Double-sided stick
  64. **DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`.
  65. A double-sided stick has two chip-select rows which access different sets
  66. of memory devices. The two rows cannot be accessed concurrently.
  67. "Double-sided" is irrespective of the memory devices being mounted on
  68. both sides of the memory stick.
  69. * Socket set
  70. All of the memory sticks that are required for a single memory access or
  71. all of the memory sticks spanned by a chip-select row. A single socket
  72. set has two chip-select rows and if double-sided sticks are used these
  73. will occupy those chip-select rows.
  74. * Bank
  75. This term is avoided because it is unclear when needing to distinguish
  76. between chip-select rows and socket sets.
  77. * High Bandwidth Memory (HBM)
  78. HBM is a new memory type with low power consumption and ultra-wide
  79. communication lanes. It uses vertically stacked memory chips (DRAM dies)
  80. interconnected by microscopic wires called "through-silicon vias," or
  81. TSVs.
  82. Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
  83. interconnect called the "interposer". Therefore, HBM's characteristics
  84. are nearly indistinguishable from on-chip integrated RAM.
  85. Memory Controllers
  86. ------------------
  87. Most of the EDAC core is focused on doing Memory Controller error detection.
  88. The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info``
  89. to describe the memory controllers, with is an opaque struct for the EDAC
  90. drivers. Only the EDAC core is allowed to touch it.
  91. .. kernel-doc:: include/linux/edac.h
  92. .. kernel-doc:: drivers/edac/edac_mc.h
  93. PCI Controllers
  94. ---------------
  95. The EDAC subsystem provides a mechanism to handle PCI controllers by calling
  96. the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct
  97. :c:type:`edac_pci_ctl_info` to describe the PCI controllers.
  98. .. kernel-doc:: drivers/edac/edac_pci.h
  99. EDAC Blocks
  100. -----------
  101. The EDAC subsystem also provides a generic mechanism to report errors on
  102. other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function.
  103. The structures :c:type:`edac_dev_sysfs_block_attribute`,
  104. :c:type:`edac_device_block`, :c:type:`edac_device_instance` and
  105. :c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device'
  106. representation at sysfs.
  107. This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or
  108. PCI, like:
  109. - CPU caches (L1 and L2)
  110. - DMA engines
  111. - Core CPU switches
  112. - Fabric switch units
  113. - PCIe interface controllers
  114. - other EDAC/ECC type devices that can be monitored for
  115. errors, etc.
  116. It allows for a 2 level set of hierarchy.
  117. For example, a cache could be composed of L1, L2 and L3 levels of cache.
  118. Each CPU core would have its own L1 cache, while sharing L2 and maybe L3
  119. caches. On such case, those can be represented via the following sysfs
  120. nodes::
  121. /sys/devices/system/edac/..
  122. pci/ <existing pci directory (if available)>
  123. mc/ <existing memory device directory>
  124. cpu/cpu0/.. <L1 and L2 block directory>
  125. /L1-cache/ce_count
  126. /ue_count
  127. /L2-cache/ce_count
  128. /ue_count
  129. cpu/cpu1/.. <L1 and L2 block directory>
  130. /L1-cache/ce_count
  131. /ue_count
  132. /L2-cache/ce_count
  133. /ue_count
  134. ...
  135. the L1 and L2 directories would be "edac_device_block's"
  136. .. kernel-doc:: drivers/edac/edac_device.h
  137. Heterogeneous system support
  138. ----------------------------
  139. An AMD heterogeneous system is built by connecting the data fabrics of
  140. both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
  141. GPU nodes can be accessed the same way as the data fabric on CPU nodes.
  142. The MI200 accelerators are data center GPUs. They have 2 data fabrics,
  143. and each GPU data fabric contains four Unified Memory Controllers (UMC).
  144. Each UMC contains eight channels. Each UMC channel controls one 128-bit
  145. HBM2e (2GB) channel (equivalent to 8 X 2GB ranks). This creates a total
  146. of 4096-bits of DRAM data bus.
  147. While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
  148. channel is interfacing 2GB of DRAM (represented as rank).
  149. Memory controllers on AMD GPU nodes can be represented in EDAC thusly:
  150. GPU DF / GPU Node -> EDAC MC
  151. GPU UMC -> EDAC CSROW
  152. GPU UMC channel -> EDAC CHANNEL
  153. For example: a heterogeneous system with 1 AMD CPU is connected to
  154. 4 MI200 (Aldebaran) GPUs using xGMI.
  155. Some more heterogeneous hardware details:
  156. - The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
  157. They have chip selects (csrows) and channels. However, the layouts are different
  158. for performance, physical layout, or other reasons.
  159. - CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
  160. marketing speak. CPU has X memory channels, etc.
  161. - CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
  162. - GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
  163. - GPU UMCs use 8 channels, So UMC channel = EDAC channel.
  164. The EDAC subsystem provides a mechanism to handle AMD heterogeneous
  165. systems by calling system specific ops for both CPUs and GPUs.
  166. AMD GPU nodes are enumerated in sequential order based on the PCI
  167. hierarchy, and the first GPU node is assumed to have a Node ID value
  168. following those of the CPU nodes after latter are fully populated::
  169. $ ls /sys/devices/system/edac/mc/
  170. mc0 - CPU MC node 0
  171. mc1 |
  172. mc2 |- GPU card[0] => node 0(mc1), node 1(mc2)
  173. mc3 |
  174. mc4 |- GPU card[1] => node 0(mc3), node 1(mc4)
  175. mc5 |
  176. mc6 |- GPU card[2] => node 0(mc5), node 1(mc6)
  177. mc7 |
  178. mc8 |- GPU card[3] => node 0(mc7), node 1(mc8)
  179. For example, a heterogeneous system with one AMD CPU is connected to
  180. four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
  181. via the following sysfs entries::
  182. /sys/devices/system/edac/mc/..
  183. CPU # CPU node
  184. ├── mc 0
  185. GPU Nodes are enumerated sequentially after CPU nodes have been populated
  186. GPU card 1 # Each MI200 GPU has 2 nodes/mcs
  187. ├── mc 1 # GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
  188. │   ├── csrow 0 # UMC 0
  189. │   │   ├── channel 0 # Each UMC has 8 channels
  190. │   │   ├── channel 1 # size of each channel is 2 GB, so each UMC has 16 GB
  191. │   │   ├── channel 2
  192. │   │   ├── channel 3
  193. │   │   ├── channel 4
  194. │   │   ├── channel 5
  195. │   │   ├── channel 6
  196. │   │   ├── channel 7
  197. │   ├── csrow 1 # UMC 1
  198. │   │   ├── channel 0
  199. │   │   ├── ..
  200. │   │   ├── channel 7
  201. │   ├── .. ..
  202. │   ├── csrow 3 # UMC 3
  203. │   │   ├── channel 0
  204. │   │   ├── ..
  205. │   │   ├── channel 7
  206. │   ├── rank 0
  207. │   ├── .. ..
  208. │   ├── rank 31 # total 32 ranks/dimms from 4 UMCs
  209. ├── mc 2 # GPU node 1 == mc2
  210. │   ├── .. # each GPU has total 64 GB
  211. GPU card 2
  212. ├── mc 3
  213. │   ├── ..
  214. ├── mc 4
  215. │   ├── ..
  216. GPU card 3
  217. ├── mc 5
  218. │   ├── ..
  219. ├── mc 6
  220. │   ├── ..
  221. GPU card 4
  222. ├── mc 7
  223. │   ├── ..
  224. ├── mc 8
  225. │   ├── ..