ras.rst 42 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210
  1. .. include:: <isonum.txt>
  2. ============================================
  3. Reliability, Availability and Serviceability
  4. ============================================
  5. RAS concepts
  6. ************
  7. Reliability, Availability and Serviceability (RAS) is a concept used on
  8. servers meant to measure their robustness.
  9. Reliability
  10. is the probability that a system will produce correct outputs.
  11. * Generally measured as Mean Time Between Failures (MTBF)
  12. * Enhanced by features that help to avoid, detect and repair hardware faults
  13. Availability
  14. is the probability that a system is operational at a given time
  15. * Generally measured as a percentage of downtime per a period of time
  16. * Often uses mechanisms to detect and correct hardware faults in
  17. runtime;
  18. Serviceability (or maintainability)
  19. is the simplicity and speed with which a system can be repaired or
  20. maintained
  21. * Generally measured on Mean Time Between Repair (MTBR)
  22. Improving RAS
  23. -------------
  24. In order to reduce systems downtime, a system should be capable of detecting
  25. hardware errors, and, when possible correcting them in runtime. It should
  26. also provide mechanisms to detect hardware degradation, in order to warn
  27. the system administrator to take the action of replacing a component before
  28. it causes data loss or system downtime.
  29. Among the monitoring measures, the most usual ones include:
  30. * CPU – detect errors at instruction execution and at L1/L2/L3 caches;
  31. * Memory – add error correction logic (ECC) to detect and correct errors;
  32. * I/O – add CRC checksums for transferred data;
  33. * Storage – RAID, journal file systems, checksums,
  34. Self-Monitoring, Analysis and Reporting Technology (SMART).
  35. By monitoring the number of occurrences of error detections, it is possible
  36. to identify if the probability of hardware errors is increasing, and, on such
  37. case, do a preventive maintenance to replace a degraded component while
  38. those errors are correctable.
  39. Types of errors
  40. ---------------
  41. Most mechanisms used on modern systems use use technologies like Hamming
  42. Codes that allow error correction when the number of errors on a bit packet
  43. is below a threshold. If the number of errors is above, those mechanisms
  44. can indicate with a high degree of confidence that an error happened, but
  45. they can't correct.
  46. Also, sometimes an error occur on a component that it is not used. For
  47. example, a part of the memory that it is not currently allocated.
  48. That defines some categories of errors:
  49. * **Correctable Error (CE)** - the error detection mechanism detected and
  50. corrected the error. Such errors are usually not fatal, although some
  51. Kernel mechanisms allow the system administrator to consider them as fatal.
  52. * **Uncorrected Error (UE)** - the amount of errors happened above the error
  53. correction threshold, and the system was unable to auto-correct.
  54. * **Fatal Error** - when an UE error happens on a critical component of the
  55. system (for example, a piece of the Kernel got corrupted by an UE), the
  56. only reliable way to avoid data corruption is to hang or reboot the machine.
  57. * **Non-fatal Error** - when an UE error happens on an unused component,
  58. like a CPU in power down state or an unused memory bank, the system may
  59. still run, eventually replacing the affected hardware by a hot spare,
  60. if available.
  61. Also, when an error happens on a userspace process, it is also possible to
  62. kill such process and let userspace restart it.
  63. The mechanism for handling non-fatal errors is usually complex and may
  64. require the help of some userspace application, in order to apply the
  65. policy desired by the system administrator.
  66. Identifying a bad hardware component
  67. ------------------------------------
  68. Just detecting a hardware flaw is usually not enough, as the system needs
  69. to pinpoint to the minimal replaceable unit (MRU) that should be exchanged
  70. to make the hardware reliable again.
  71. So, it requires not only error logging facilities, but also mechanisms that
  72. will translate the error message to the silkscreen or component label for
  73. the MRU.
  74. Typically, it is very complex for memory, as modern CPUs interlace memory
  75. from different memory modules, in order to provide a better performance. The
  76. DMI BIOS usually have a list of memory module labels, with can be obtained
  77. using the ``dmidecode`` tool. For example, on a desktop machine, it shows::
  78. Memory Device
  79. Total Width: 64 bits
  80. Data Width: 64 bits
  81. Size: 16384 MB
  82. Form Factor: SODIMM
  83. Set: None
  84. Locator: ChannelA-DIMM0
  85. Bank Locator: BANK 0
  86. Type: DDR4
  87. Type Detail: Synchronous
  88. Speed: 2133 MHz
  89. Rank: 2
  90. Configured Clock Speed: 2133 MHz
  91. On the above example, a DDR4 SO-DIMM memory module is located at the
  92. system's memory labeled as "BANK 0", as given by the *bank locator* field.
  93. Please notice that, on such system, the *total width* is equal to the
  94. *data width*. It means that such memory module doesn't have error
  95. detection/correction mechanisms.
  96. Unfortunately, not all systems use the same field to specify the memory
  97. bank. On this example, from an older server, ``dmidecode`` shows::
  98. Memory Device
  99. Array Handle: 0x1000
  100. Error Information Handle: Not Provided
  101. Total Width: 72 bits
  102. Data Width: 64 bits
  103. Size: 8192 MB
  104. Form Factor: DIMM
  105. Set: 1
  106. Locator: DIMM_A1
  107. Bank Locator: Not Specified
  108. Type: DDR3
  109. Type Detail: Synchronous Registered (Buffered)
  110. Speed: 1600 MHz
  111. Rank: 2
  112. Configured Clock Speed: 1600 MHz
  113. There, the DDR3 RDIMM memory module is located at the system's memory labeled
  114. as "DIMM_A1", as given by the *locator* field. Please notice that this
  115. memory module has 64 bits of *data width* and 72 bits of *total width*. So,
  116. it has 8 extra bits to be used by error detection and correction mechanisms.
  117. Such kind of memory is called Error-correcting code memory (ECC memory).
  118. To make things even worse, it is not uncommon that systems with different
  119. labels on their system's board to use exactly the same BIOS, meaning that
  120. the labels provided by the BIOS won't match the real ones.
  121. ECC memory
  122. ----------
  123. As mentioned on the previous section, ECC memory has extra bits to be
  124. used for error correction. So, on 64 bit systems, a memory module
  125. has 64 bits of *data width*, and 74 bits of *total width*. So, there are
  126. 8 bits extra bits to be used for the error detection and correction
  127. mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_.
  128. So, when the cpu requests the memory controller to write a word with
  129. *data width*, the memory controller calculates the *syndrome* in real time,
  130. using Hamming code, or some other error correction code, like SECDED+,
  131. producing a code with *total width* size. Such code is then written
  132. on the memory modules.
  133. At read, the *total width* bits code is converted back, using the same
  134. ECC code used on write, producing a word with *data width* and a *syndrome*.
  135. The word with *data width* is sent to the CPU, even when errors happen.
  136. The memory controller also looks at the *syndrome* in order to check if
  137. there was an error, and if the ECC code was able to fix such error.
  138. If the error was corrected, a Corrected Error (CE) happened. If not, an
  139. Uncorrected Error (UE) happened.
  140. The information about the CE/UE errors is stored on some special registers
  141. at the memory controller and can be accessed by reading such registers,
  142. either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
  143. bit CPUs, such errors can also be retrieved via the Machine Check
  144. Architecture (MCA)\ [#f3]_.
  145. .. [#f1] Please notice that several memory controllers allow operation on a
  146. mode called "Lock-Step", where it groups two memory modules together,
  147. doing 128-bit reads/writes. That gives 16 bits for error correction, with
  148. significantly improves the error correction mechanism, at the expense
  149. that, when an error happens, there's no way to know what memory module is
  150. to blame. So, it has to blame both memory modules.
  151. .. [#f2] Some memory controllers also allow using memory in mirror mode.
  152. On such mode, the same data is written to two memory modules. At read,
  153. the system checks both memory modules, in order to check if both provide
  154. identical data. On such configuration, when an error happens, there's no
  155. way to know what memory module is to blame. So, it has to blame both
  156. memory modules (or 4 memory modules, if the system is also on Lock-step
  157. mode).
  158. .. [#f3] For more details about the Machine Check Architecture (MCA),
  159. please read Documentation/x86/x86_64/machinecheck at the Kernel tree.
  160. EDAC - Error Detection And Correction
  161. *************************************
  162. .. note::
  163. "bluesmoke" was the name for this device driver subsystem when it
  164. was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
  165. That site is mostly archaic now and can be used only for historical
  166. purposes.
  167. When the subsystem was pushed upstream for the first time, on
  168. Kernel 2.6.16, for the first time, it was renamed to ``EDAC``.
  169. Purpose
  170. -------
  171. The ``edac`` kernel module's goal is to detect and report hardware errors
  172. that occur within the computer system running under linux.
  173. Memory
  174. ------
  175. Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
  176. primary errors being harvested. These types of errors are harvested by
  177. the ``edac_mc`` device.
  178. Detecting CE events, then harvesting those events and reporting them,
  179. **can** but must not necessarily be a predictor of future UE events. With
  180. CE events only, the system can and will continue to operate as no data
  181. has been damaged yet.
  182. However, preventive maintenance and proactive part replacement of memory
  183. modules exhibiting CEs can reduce the likelihood of the dreaded UE events
  184. and system panics.
  185. Other hardware elements
  186. -----------------------
  187. A new feature for EDAC, the ``edac_device`` class of device, was added in
  188. the 2.6.23 version of the kernel.
  189. This new device type allows for non-memory type of ECC hardware detectors
  190. to have their states harvested and presented to userspace via the sysfs
  191. interface.
  192. Some architectures have ECC detectors for L1, L2 and L3 caches,
  193. along with DMA engines, fabric switches, main data path switches,
  194. interconnections, and various other hardware data paths. If the hardware
  195. reports it, then a edac_device device probably can be constructed to
  196. harvest and present that to userspace.
  197. PCI bus scanning
  198. ----------------
  199. In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
  200. in order to determine if errors are occurring during data transfers.
  201. The presence of PCI Parity errors must be examined with a grain of salt.
  202. There are several add-in adapters that do **not** follow the PCI specification
  203. with regards to Parity generation and reporting. The specification says
  204. the vendor should tie the parity status bits to 0 if they do not intend
  205. to generate parity. Some vendors do not do this, and thus the parity bit
  206. can "float" giving false positives.
  207. There is a PCI device attribute located in sysfs that is checked by
  208. the EDAC PCI scanning code. If that attribute is set, PCI parity/error
  209. scanning is skipped for that device. The attribute is::
  210. broken_parity_status
  211. and is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for
  212. PCI devices.
  213. Versioning
  214. ----------
  215. EDAC is composed of a "core" module (``edac_core.ko``) and several Memory
  216. Controller (MC) driver modules. On a given system, the CORE is loaded
  217. and one MC driver will be loaded. Both the CORE and the MC driver (or
  218. ``edac_device`` driver) have individual versions that reflect current
  219. release level of their respective modules.
  220. Thus, to "report" on what version a system is running, one must report
  221. both the CORE's and the MC driver's versions.
  222. Loading
  223. -------
  224. If ``edac`` was statically linked with the kernel then no loading
  225. is necessary. If ``edac`` was built as modules then simply modprobe
  226. the ``edac`` pieces that you need. You should be able to modprobe
  227. hardware-specific modules and have the dependencies load the necessary
  228. core modules.
  229. Example::
  230. $ modprobe amd76x_edac
  231. loads both the ``amd76x_edac.ko`` memory controller module and the
  232. ``edac_mc.ko`` core module.
  233. Sysfs interface
  234. ---------------
  235. EDAC presents a ``sysfs`` interface for control and reporting purposes. It
  236. lives in the /sys/devices/system/edac directory.
  237. Within this directory there currently reside 2 components:
  238. ======= ==============================
  239. mc memory controller(s) system
  240. pci PCI control and status system
  241. ======= ==============================
  242. Memory Controller (mc) Model
  243. ----------------------------
  244. Each ``mc`` device controls a set of memory modules [#f4]_. These modules
  245. are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
  246. There can be multiple csrows and multiple channels.
  247. .. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
  248. used to refer to a memory module, although there are other memory
  249. packaging alternatives, like SO-DIMM, SIMM, etc. Along this document,
  250. and inside the EDAC system, the term "dimm" is used for all memory
  251. modules, even when they use a different kind of packaging.
  252. Memory controllers allow for several csrows, with 8 csrows being a
  253. typical value. Yet, the actual number of csrows depends on the layout of
  254. a given motherboard, memory controller and memory module characteristics.
  255. Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
  256. data transfers to/from the CPU from/to memory. Some newer chipsets allow
  257. for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
  258. controllers. The following example will assume 2 channels:
  259. +------------+-----------------------+
  260. | CS Rows | Channels |
  261. +------------+-----------+-----------+
  262. | | ``ch0`` | ``ch1`` |
  263. +============+===========+===========+
  264. | ``csrow0`` | DIMM_A0 | DIMM_B0 |
  265. +------------+ | |
  266. | ``csrow1`` | | |
  267. +------------+-----------+-----------+
  268. | ``csrow2`` | DIMM_A1 | DIMM_B1 |
  269. +------------+ | |
  270. | ``csrow3`` | | |
  271. +------------+-----------+-----------+
  272. In the above example, there are 4 physical slots on the motherboard
  273. for memory DIMMs:
  274. +---------+---------+
  275. | DIMM_A0 | DIMM_B0 |
  276. +---------+---------+
  277. | DIMM_A1 | DIMM_B1 |
  278. +---------+---------+
  279. Labels for these slots are usually silk-screened on the motherboard.
  280. Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
  281. channel 1. Notice that there are two csrows possible on a physical DIMM.
  282. These csrows are allocated their csrow assignment based on the slot into
  283. which the memory DIMM is placed. Thus, when 1 DIMM is placed in each
  284. Channel, the csrows cross both DIMMs.
  285. Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
  286. Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
  287. will have just one csrow (csrow0). csrow1 will be empty. On the other
  288. hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0
  289. and csrow1 will be populated. The pattern repeats itself for csrow2 and
  290. csrow3.
  291. The representation of the above is reflected in the directory
  292. tree in EDAC's sysfs interface. Starting in directory
  293. ``/sys/devices/system/edac/mc``, each memory controller will be
  294. represented by its own ``mcX`` directory, where ``X`` is the
  295. index of the MC::
  296. ..../edac/mc/
  297. |
  298. |->mc0
  299. |->mc1
  300. |->mc2
  301. ....
  302. Under each ``mcX`` directory each ``csrowX`` is again represented by a
  303. ``csrowX``, where ``X`` is the csrow index::
  304. .../mc/mc0/
  305. |
  306. |->csrow0
  307. |->csrow2
  308. |->csrow3
  309. ....
  310. Notice that there is no csrow1, which indicates that csrow0 is composed
  311. of a single ranked DIMMs. This should also apply in both Channels, in
  312. order to have dual-channel mode be operational. Since both csrow2 and
  313. csrow3 are populated, this indicates a dual ranked set of DIMMs for
  314. channels 0 and 1.
  315. Within each of the ``mcX`` and ``csrowX`` directories are several EDAC
  316. control and attribute files.
  317. ``mcX`` directories
  318. -------------------
  319. In ``mcX`` directories are EDAC control and attribute files for
  320. this ``X`` instance of the memory controllers.
  321. For a description of the sysfs API, please see:
  322. Documentation/ABI/testing/sysfs-devices-edac
  323. ``dimmX`` or ``rankX`` directories
  324. ----------------------------------
  325. The recommended way to use the EDAC subsystem is to look at the information
  326. provided by the ``dimmX`` or ``rankX`` directories [#f5]_.
  327. A typical EDAC system has the following structure under
  328. ``/sys/devices/system/edac/``\ [#f6]_::
  329. /sys/devices/system/edac/
  330. ├── mc
  331. │   ├── mc0
  332. │   │   ├── ce_count
  333. │   │   ├── ce_noinfo_count
  334. │   │   ├── dimm0
  335. │   │   │   ├── dimm_ce_count
  336. │   │   │   ├── dimm_dev_type
  337. │   │   │   ├── dimm_edac_mode
  338. │   │   │   ├── dimm_label
  339. │   │   │   ├── dimm_location
  340. │   │   │   ├── dimm_mem_type
  341. │   │   │   ├── dimm_ue_count
  342. │   │   │   ├── size
  343. │   │   │   └── uevent
  344. │   │   ├── max_location
  345. │   │   ├── mc_name
  346. │   │   ├── reset_counters
  347. │   │   ├── seconds_since_reset
  348. │   │   ├── size_mb
  349. │   │   ├── ue_count
  350. │   │   ├── ue_noinfo_count
  351. │   │   └── uevent
  352. │   ├── mc1
  353. │   │   ├── ce_count
  354. │   │   ├── ce_noinfo_count
  355. │   │   ├── dimm0
  356. │   │   │   ├── dimm_ce_count
  357. │   │   │   ├── dimm_dev_type
  358. │   │   │   ├── dimm_edac_mode
  359. │   │   │   ├── dimm_label
  360. │   │   │   ├── dimm_location
  361. │   │   │   ├── dimm_mem_type
  362. │   │   │   ├── dimm_ue_count
  363. │   │   │   ├── size
  364. │   │   │   └── uevent
  365. │   │   ├── max_location
  366. │   │   ├── mc_name
  367. │   │   ├── reset_counters
  368. │   │   ├── seconds_since_reset
  369. │   │   ├── size_mb
  370. │   │   ├── ue_count
  371. │   │   ├── ue_noinfo_count
  372. │   │   └── uevent
  373. │   └── uevent
  374. └── uevent
  375. In the ``dimmX`` directories are EDAC control and attribute files for
  376. this ``X`` memory module:
  377. - ``size`` - Total memory managed by this csrow attribute file
  378. This attribute file displays, in count of megabytes, the memory
  379. that this csrow contains.
  380. - ``dimm_ue_count`` - Uncorrectable Errors count attribute file
  381. This attribute file displays the total count of uncorrectable
  382. errors that have occurred on this DIMM. If panic_on_ue is set
  383. this counter will not have a chance to increment, since EDAC
  384. will panic the system.
  385. - ``dimm_ce_count`` - Correctable Errors count attribute file
  386. This attribute file displays the total count of correctable
  387. errors that have occurred on this DIMM. This count is very
  388. important to examine. CEs provide early indications that a
  389. DIMM is beginning to fail. This count field should be
  390. monitored for non-zero values and report such information
  391. to the system administrator.
  392. - ``dimm_dev_type`` - Device type attribute file
  393. This attribute file will display what type of DRAM device is
  394. being utilized on this DIMM.
  395. Examples:
  396. - x1
  397. - x2
  398. - x4
  399. - x8
  400. - ``dimm_edac_mode`` - EDAC Mode of operation attribute file
  401. This attribute file will display what type of Error detection
  402. and correction is being utilized.
  403. - ``dimm_label`` - memory module label control file
  404. This control file allows this DIMM to have a label assigned
  405. to it. With this label in the module, when errors occur
  406. the output can provide the DIMM label in the system log.
  407. This becomes vital for panic events to isolate the
  408. cause of the UE event.
  409. DIMM Labels must be assigned after booting, with information
  410. that correctly identifies the physical slot with its
  411. silk screen label. This information is currently very
  412. motherboard specific and determination of this information
  413. must occur in userland at this time.
  414. - ``dimm_location`` - location of the memory module
  415. The location can have up to 3 levels, and describe how the
  416. memory controller identifies the location of a memory module.
  417. Depending on the type of memory and memory controller, it
  418. can be:
  419. - *csrow* and *channel* - used when the memory controller
  420. doesn't identify a single DIMM - e. g. in ``rankX`` dir;
  421. - *branch*, *channel*, *slot* - typically used on FB-DIMM memory
  422. controllers;
  423. - *channel*, *slot* - used on Nehalem and newer Intel drivers.
  424. - ``dimm_mem_type`` - Memory Type attribute file
  425. This attribute file will display what type of memory is currently
  426. on this csrow. Normally, either buffered or unbuffered memory.
  427. Examples:
  428. - Registered-DDR
  429. - Unbuffered-DDR
  430. .. [#f5] On some systems, the memory controller doesn't have any logic
  431. to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories.
  432. On modern Intel memory controllers, the memory controller identifies the
  433. memory modules directly. On such systems, the directory is called ``dimmX``.
  434. .. [#f6] There are also some ``power`` directories and ``subsystem``
  435. symlinks inside the sysfs mapping that are automatically created by
  436. the sysfs subsystem. Currently, they serve no purpose.
  437. ``csrowX`` directories
  438. ----------------------
  439. When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
  440. directories. As this API doesn't work properly for Rambus, FB-DIMMs and
  441. modern Intel Memory Controllers, this is being deprecated in favor of
  442. ``dimmX`` directories.
  443. In the ``csrowX`` directories are EDAC control and attribute files for
  444. this ``X`` instance of csrow:
  445. - ``ue_count`` - Total Uncorrectable Errors count attribute file
  446. This attribute file displays the total count of uncorrectable
  447. errors that have occurred on this csrow. If panic_on_ue is set
  448. this counter will not have a chance to increment, since EDAC
  449. will panic the system.
  450. - ``ce_count`` - Total Correctable Errors count attribute file
  451. This attribute file displays the total count of correctable
  452. errors that have occurred on this csrow. This count is very
  453. important to examine. CEs provide early indications that a
  454. DIMM is beginning to fail. This count field should be
  455. monitored for non-zero values and report such information
  456. to the system administrator.
  457. - ``size_mb`` - Total memory managed by this csrow attribute file
  458. This attribute file displays, in count of megabytes, the memory
  459. that this csrow contains.
  460. - ``mem_type`` - Memory Type attribute file
  461. This attribute file will display what type of memory is currently
  462. on this csrow. Normally, either buffered or unbuffered memory.
  463. Examples:
  464. - Registered-DDR
  465. - Unbuffered-DDR
  466. - ``edac_mode`` - EDAC Mode of operation attribute file
  467. This attribute file will display what type of Error detection
  468. and correction is being utilized.
  469. - ``dev_type`` - Device type attribute file
  470. This attribute file will display what type of DRAM device is
  471. being utilized on this DIMM.
  472. Examples:
  473. - x1
  474. - x2
  475. - x4
  476. - x8
  477. - ``ch0_ce_count`` - Channel 0 CE Count attribute file
  478. This attribute file will display the count of CEs on this
  479. DIMM located in channel 0.
  480. - ``ch0_ue_count`` - Channel 0 UE Count attribute file
  481. This attribute file will display the count of UEs on this
  482. DIMM located in channel 0.
  483. - ``ch0_dimm_label`` - Channel 0 DIMM Label control file
  484. This control file allows this DIMM to have a label assigned
  485. to it. With this label in the module, when errors occur
  486. the output can provide the DIMM label in the system log.
  487. This becomes vital for panic events to isolate the
  488. cause of the UE event.
  489. DIMM Labels must be assigned after booting, with information
  490. that correctly identifies the physical slot with its
  491. silk screen label. This information is currently very
  492. motherboard specific and determination of this information
  493. must occur in userland at this time.
  494. - ``ch1_ce_count`` - Channel 1 CE Count attribute file
  495. This attribute file will display the count of CEs on this
  496. DIMM located in channel 1.
  497. - ``ch1_ue_count`` - Channel 1 UE Count attribute file
  498. This attribute file will display the count of UEs on this
  499. DIMM located in channel 0.
  500. - ``ch1_dimm_label`` - Channel 1 DIMM Label control file
  501. This control file allows this DIMM to have a label assigned
  502. to it. With this label in the module, when errors occur
  503. the output can provide the DIMM label in the system log.
  504. This becomes vital for panic events to isolate the
  505. cause of the UE event.
  506. DIMM Labels must be assigned after booting, with information
  507. that correctly identifies the physical slot with its
  508. silk screen label. This information is currently very
  509. motherboard specific and determination of this information
  510. must occur in userland at this time.
  511. System Logging
  512. --------------
  513. If logging for UEs and CEs is enabled, then system logs will contain
  514. information indicating that errors have been detected::
  515. EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac
  516. EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac
  517. The structure of the message is:
  518. +---------------------------------------+-------------+
  519. | Content | Example |
  520. +=======================================+=============+
  521. | The memory controller | MC0 |
  522. +---------------------------------------+-------------+
  523. | Error type | CE |
  524. +---------------------------------------+-------------+
  525. | Memory page | 0x283 |
  526. +---------------------------------------+-------------+
  527. | Offset in the page | 0xce0 |
  528. +---------------------------------------+-------------+
  529. | The byte granularity | grain 8 |
  530. | or resolution of the error | |
  531. +---------------------------------------+-------------+
  532. | The error syndrome | 0xb741 |
  533. +---------------------------------------+-------------+
  534. | Memory row | row 0 |
  535. +---------------------------------------+-------------+
  536. | Memory channel | channel 1 |
  537. +---------------------------------------+-------------+
  538. | DIMM label, if set prior | DIMM B1 |
  539. +---------------------------------------+-------------+
  540. | And then an optional, driver-specific | |
  541. | message that may have additional | |
  542. | information. | |
  543. +---------------------------------------+-------------+
  544. Both UEs and CEs with no info will lack all but memory controller, error
  545. type, a notice of "no info" and then an optional, driver-specific error
  546. message.
  547. PCI Bus Parity Detection
  548. ------------------------
  549. On Header Type 00 devices, the primary status is looked at for any
  550. parity error regardless of whether parity is enabled on the device or
  551. not. (The spec indicates parity is generated in some cases). On Header
  552. Type 01 bridges, the secondary status register is also looked at to see
  553. if parity occurred on the bus on the other side of the bridge.
  554. Sysfs configuration
  555. -------------------
  556. Under ``/sys/devices/system/edac/pci`` are control and attribute files as
  557. follows:
  558. - ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
  559. This control file enables or disables the PCI Bus Parity scanning
  560. operation. Writing a 1 to this file enables the scanning. Writing
  561. a 0 to this file disables the scanning.
  562. Enable::
  563. echo "1" >/sys/devices/system/edac/pci/check_pci_parity
  564. Disable::
  565. echo "0" >/sys/devices/system/edac/pci/check_pci_parity
  566. - ``pci_parity_count`` - Parity Count
  567. This attribute file will display the number of parity errors that
  568. have been detected.
  569. Module parameters
  570. -----------------
  571. - ``edac_mc_panic_on_ue`` - Panic on UE control file
  572. An uncorrectable error will cause a machine panic. This is usually
  573. desirable. It is a bad idea to continue when an uncorrectable error
  574. occurs - it is indeterminate what was uncorrected and the operating
  575. system context might be so mangled that continuing will lead to further
  576. corruption. If the kernel has MCE configured, then EDAC will never
  577. notice the UE.
  578. LOAD TIME::
  579. module/kernel parameter: edac_mc_panic_on_ue=[0|1]
  580. RUN TIME::
  581. echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
  582. - ``edac_mc_log_ue`` - Log UE control file
  583. Generate kernel messages describing uncorrectable errors. These errors
  584. are reported through the system message log system. UE statistics
  585. will be accumulated even when UE logging is disabled.
  586. LOAD TIME::
  587. module/kernel parameter: edac_mc_log_ue=[0|1]
  588. RUN TIME::
  589. echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
  590. - ``edac_mc_log_ce`` - Log CE control file
  591. Generate kernel messages describing correctable errors. These
  592. errors are reported through the system message log system.
  593. CE statistics will be accumulated even when CE logging is disabled.
  594. LOAD TIME::
  595. module/kernel parameter: edac_mc_log_ce=[0|1]
  596. RUN TIME::
  597. echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
  598. - ``edac_mc_poll_msec`` - Polling period control file
  599. The time period, in milliseconds, for polling for error information.
  600. Too small a value wastes resources. Too large a value might delay
  601. necessary handling of errors and might loose valuable information for
  602. locating the error. 1000 milliseconds (once each second) is the current
  603. default. Systems which require all the bandwidth they can get, may
  604. increase this.
  605. LOAD TIME::
  606. module/kernel parameter: edac_mc_poll_msec=[0|1]
  607. RUN TIME::
  608. echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
  609. - ``panic_on_pci_parity`` - Panic on PCI PARITY Error
  610. This control file enables or disables panicking when a parity
  611. error has been detected.
  612. module/kernel parameter::
  613. edac_panic_on_pci_pe=[0|1]
  614. Enable::
  615. echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
  616. Disable::
  617. echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
  618. EDAC device type
  619. ----------------
  620. In the header file, edac_pci.h, there is a series of edac_device structures
  621. and APIs for the EDAC_DEVICE.
  622. User space access to an edac_device is through the sysfs interface.
  623. At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
  624. will appear.
  625. There is a three level tree beneath the above ``edac`` directory. For example,
  626. the ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net
  627. website) installs itself as::
  628. /sys/devices/system/edac/test-instance
  629. in this directory are various controls, a symlink and one or more ``instance``
  630. directories.
  631. The standard default controls are:
  632. ============== =======================================================
  633. log_ce boolean to log CE events
  634. log_ue boolean to log UE events
  635. panic_on_ue boolean to ``panic`` the system if an UE is encountered
  636. (default off, can be set true via startup script)
  637. poll_msec time period between POLL cycles for events
  638. ============== =======================================================
  639. The test_device_edac device adds at least one of its own custom control:
  640. ============== ==================================================
  641. test_bits which in the current test driver does nothing but
  642. show how it is installed. A ported driver can
  643. add one or more such controls and/or attributes
  644. for specific uses.
  645. One out-of-tree driver uses controls here to allow
  646. for ERROR INJECTION operations to hardware
  647. injection registers
  648. ============== ==================================================
  649. The symlink points to the 'struct dev' that is registered for this edac_device.
  650. Instances
  651. ---------
  652. One or more instance directories are present. For the ``test_device_edac``
  653. case:
  654. +----------------+
  655. | test-instance0 |
  656. +----------------+
  657. In this directory there are two default counter attributes, which are totals of
  658. counter in deeper subdirectories.
  659. ============== ====================================
  660. ce_count total of CE events of subdirectories
  661. ue_count total of UE events of subdirectories
  662. ============== ====================================
  663. Blocks
  664. ------
  665. At the lowest directory level is the ``block`` directory. There can be 0, 1
  666. or more blocks specified in each instance:
  667. +-------------+
  668. | test-block0 |
  669. +-------------+
  670. In this directory the default attributes are:
  671. ============== ================================================
  672. ce_count which is counter of CE events for this ``block``
  673. of hardware being monitored
  674. ue_count which is counter of UE events for this ``block``
  675. of hardware being monitored
  676. ============== ================================================
  677. The ``test_device_edac`` device adds 4 attributes and 1 control:
  678. ================== ====================================================
  679. test-block-bits-0 for every POLL cycle this counter
  680. is incremented
  681. test-block-bits-1 every 10 cycles, this counter is bumped once,
  682. and test-block-bits-0 is set to 0
  683. test-block-bits-2 every 100 cycles, this counter is bumped once,
  684. and test-block-bits-1 is set to 0
  685. test-block-bits-3 every 1000 cycles, this counter is bumped once,
  686. and test-block-bits-2 is set to 0
  687. ================== ====================================================
  688. ================== ====================================================
  689. reset-counters writing ANY thing to this control will
  690. reset all the above counters.
  691. ================== ====================================================
  692. Use of the ``test_device_edac`` driver should enable any others to create their own
  693. unique drivers for their hardware systems.
  694. The ``test_device_edac`` sample driver is located at the
  695. http://bluesmoke.sourceforge.net project site for EDAC.
  696. Usage of EDAC APIs on Nehalem and newer Intel CPUs
  697. --------------------------------------------------
  698. On older Intel architectures, the memory controller was part of the North
  699. Bridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and
  700. newer Intel architectures integrated an enhanced version of the memory
  701. controller (MC) inside the CPUs.
  702. This chapter will cover the differences of the enhanced memory controllers
  703. found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
  704. ``sbx_edac`` drivers.
  705. .. note::
  706. The Xeon E7 processor families use a separate chip for the memory
  707. controller, called Intel Scalable Memory Buffer. This section doesn't
  708. apply for such families.
  709. 1) There is one Memory Controller per Quick Patch Interconnect
  710. (QPI). At the driver, the term "socket" means one QPI. This is
  711. associated with a physical CPU socket.
  712. Each MC have 3 physical read channels, 3 physical write channels and
  713. 3 logic channels. The driver currently sees it as just 3 channels.
  714. Each channel can have up to 3 DIMMs.
  715. The minimum known unity is DIMMs. There are no information about csrows.
  716. As EDAC API maps the minimum unity is csrows, the driver sequentially
  717. maps channel/DIMM into different csrows.
  718. For example, supposing the following layout::
  719. Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
  720. dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
  721. dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
  722. Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
  723. dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
  724. Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
  725. dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
  726. The driver will map it as::
  727. csrow0: channel 0, dimm0
  728. csrow1: channel 0, dimm1
  729. csrow2: channel 1, dimm0
  730. csrow3: channel 2, dimm0
  731. exports one DIMM per csrow.
  732. Each QPI is exported as a different memory controller.
  733. 2) The MC has the ability to inject errors to test drivers. The drivers
  734. implement this functionality via some error injection nodes:
  735. For injecting a memory error, there are some sysfs nodes, under
  736. ``/sys/devices/system/edac/mc/mc?/``:
  737. - ``inject_addrmatch/*``:
  738. Controls the error injection mask register. It is possible to specify
  739. several characteristics of the address to match an error code::
  740. dimm = the affected dimm. Numbers are relative to a channel;
  741. rank = the memory rank;
  742. channel = the channel that will generate an error;
  743. bank = the affected bank;
  744. page = the page address;
  745. column (or col) = the address column.
  746. each of the above values can be set to "any" to match any valid value.
  747. At driver init, all values are set to any.
  748. For example, to generate an error at rank 1 of dimm 2, for any channel,
  749. any bank, any page, any column::
  750. echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
  751. echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
  752. To return to the default behaviour of matching any, you can do::
  753. echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
  754. echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
  755. - ``inject_eccmask``:
  756. specifies what bits will have troubles,
  757. - ``inject_section``:
  758. specifies what ECC cache section will get the error::
  759. 3 for both
  760. 2 for the highest
  761. 1 for the lowest
  762. - ``inject_type``:
  763. specifies the type of error, being a combination of the following bits::
  764. bit 0 - repeat
  765. bit 1 - ecc
  766. bit 2 - parity
  767. - ``inject_enable``:
  768. starts the error generation when something different than 0 is written.
  769. All inject vars can be read. root permission is needed for write.
  770. Datasheet states that the error will only be generated after a write on an
  771. address that matches inject_addrmatch. It seems, however, that reading will
  772. also produce an error.
  773. For example, the following code will generate an error for any write access
  774. at socket 0, on any DIMM/address on channel 2::
  775. echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
  776. echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
  777. echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
  778. echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
  779. echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
  780. dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
  781. For socket 1, it is needed to replace "mc0" by "mc1" at the above
  782. commands.
  783. The generated error message will look like::
  784. EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
  785. 3) Corrected Error memory register counters
  786. Those newer MCs have some registers to count memory errors. The driver
  787. uses those registers to report Corrected Errors on devices with Registered
  788. DIMMs.
  789. However, those counters don't work with Unregistered DIMM. As the chipset
  790. offers some counters that also work with UDIMMs (but with a worse level of
  791. granularity than the default ones), the driver exposes those registers for
  792. UDIMM memories.
  793. They can be read by looking at the contents of ``all_channel_counts/``::
  794. $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
  795. /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
  796. 0
  797. /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
  798. 0
  799. /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
  800. 0
  801. What happens here is that errors on different csrows, but at the same
  802. dimm number will increment the same counter.
  803. So, in this memory mapping::
  804. csrow0: channel 0, dimm0
  805. csrow1: channel 0, dimm1
  806. csrow2: channel 1, dimm0
  807. csrow3: channel 2, dimm0
  808. The hardware will increment udimm0 for an error at the first dimm at either
  809. csrow0, csrow2 or csrow3;
  810. The hardware will increment udimm1 for an error at the second dimm at either
  811. csrow0, csrow2 or csrow3;
  812. The hardware will increment udimm2 for an error at the third dimm at either
  813. csrow0, csrow2 or csrow3;
  814. 4) Standard error counters
  815. The standard error counters are generated when an mcelog error is received
  816. by the driver. Since, with UDIMM, this is counted by software, it is
  817. possible that some errors could be lost. With RDIMM's, they display the
  818. contents of the registers
  819. Reference documents used on ``amd64_edac``
  820. ------------------------------------------
  821. ``amd64_edac`` module is based on the following documents
  822. (available from http://support.amd.com/en-us/search/tech-docs):
  823. 1. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
  824. Opteron Processors
  825. :AMD publication #: 26094
  826. :Revision: 3.26
  827. :Link: http://support.amd.com/TechDocs/26094.PDF
  828. 2. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh
  829. Processors
  830. :AMD publication #: 32559
  831. :Revision: 3.00
  832. :Issue Date: May 2006
  833. :Link: http://support.amd.com/TechDocs/32559.pdf
  834. 3. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h
  835. Processors
  836. :AMD publication #: 31116
  837. :Revision: 3.00
  838. :Issue Date: September 07, 2007
  839. :Link: http://support.amd.com/TechDocs/31116.pdf
  840. 4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
  841. Models 30h-3Fh Processors
  842. :AMD publication #: 49125
  843. :Revision: 3.06
  844. :Issue Date: 2/12/2015 (latest release)
  845. :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
  846. 5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
  847. Models 60h-6Fh Processors
  848. :AMD publication #: 50742
  849. :Revision: 3.01
  850. :Issue Date: 7/23/2015 (latest release)
  851. :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
  852. 6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h
  853. Models 00h-0Fh Processors
  854. :AMD publication #: 48751
  855. :Revision: 3.03
  856. :Issue Date: 2/23/2015 (latest release)
  857. :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf
  858. Credits
  859. =======
  860. * Written by Doug Thompson <dougthompson@xmission.com>
  861. - 7 Dec 2005
  862. - 17 Jul 2007 Updated
  863. * |copy| Mauro Carvalho Chehab
  864. - 05 Aug 2009 Nehalem interface
  865. - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
  866. * EDAC authors/maintainers:
  867. - Doug Thompson, Dave Jiang, Dave Peterson et al,
  868. - Mauro Carvalho Chehab
  869. - Borislav Petkov
  870. - original author: Thayne Harbaugh