false-sharing.rst 9.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =============
  3. False Sharing
  4. =============
  5. What is False Sharing
  6. =====================
  7. False sharing is related with cache mechanism of maintaining the data
  8. coherence of one cache line stored in multiple CPU's caches; then
  9. academic definition for it is in [1]_. Consider a struct with a
  10. refcount and a string::
  11. struct foo {
  12. refcount_t refcount;
  13. ...
  14. char name[16];
  15. } ____cacheline_internodealigned_in_smp;
  16. Member 'refcount'(A) and 'name'(B) _share_ one cache line like below::
  17. +-----------+ +-----------+
  18. | CPU 0 | | CPU 1 |
  19. +-----------+ +-----------+
  20. / |
  21. / |
  22. V V
  23. +----------------------+ +----------------------+
  24. | A B | Cache 0 | A B | Cache 1
  25. +----------------------+ +----------------------+
  26. | |
  27. ---------------------------+------------------+-----------------------------
  28. | |
  29. +----------------------+
  30. | |
  31. +----------------------+
  32. Main Memory | A B |
  33. +----------------------+
  34. 'refcount' is modified frequently, but 'name' is set once at object
  35. creation time and is never modified. When many CPUs access 'foo' at
  36. the same time, with 'refcount' being only bumped by one CPU frequently
  37. and 'name' being read by other CPUs, all those reading CPUs have to
  38. reload the whole cache line over and over due to the 'sharing', even
  39. though 'name' is never changed.
  40. There are many real-world cases of performance regressions caused by
  41. false sharing. One of these is a rw_semaphore 'mmap_lock' inside
  42. mm_struct struct, whose cache line layout change triggered a
  43. regression and Linus analyzed in [2]_.
  44. There are two key factors for a harmful false sharing:
  45. * A global datum accessed (shared) by many CPUs
  46. * In the concurrent accesses to the data, there is at least one write
  47. operation: write/write or write/read cases.
  48. The sharing could be from totally unrelated kernel components, or
  49. different code paths of the same kernel component.
  50. False Sharing Pitfalls
  51. ======================
  52. Back in time when one platform had only one or a few CPUs, hot data
  53. members could be purposely put in the same cache line to make them
  54. cache hot and save cacheline/TLB, like a lock and the data protected
  55. by it. But for recent large system with hundreds of CPUs, this may
  56. not work when the lock is heavily contended, as the lock owner CPU
  57. could write to the data, while other CPUs are busy spinning the lock.
  58. Looking at past cases, there are several frequently occurring patterns
  59. for false sharing:
  60. * lock (spinlock/mutex/semaphore) and data protected by it are
  61. purposely put in one cache line.
  62. * global data being put together in one cache line. Some kernel
  63. subsystems have many global parameters of small size (4 bytes),
  64. which can easily be grouped together and put into one cache line.
  65. * data members of a big data structure randomly sitting together
  66. without being noticed (cache line is usually 64 bytes or more),
  67. like 'mem_cgroup' struct.
  68. Following 'mitigation' section provides real-world examples.
  69. False sharing could easily happen unless they are intentionally
  70. checked, and it is valuable to run specific tools for performance
  71. critical workloads to detect false sharing affecting performance case
  72. and optimize accordingly.
  73. How to detect and analyze False Sharing
  74. ========================================
  75. perf record/report/stat are widely used for performance tuning, and
  76. once hotspots are detected, tools like 'perf-c2c' and 'pahole' can
  77. be further used to detect and pinpoint the possible false sharing
  78. data structures. 'addr2line' is also good at decoding instruction
  79. pointer when there are multiple layers of inline functions.
  80. perf-c2c can capture the cache lines with most false sharing hits,
  81. decoded functions (line number of file) accessing that cache line,
  82. and in-line offset of the data. Simple commands are::
  83. $ perf c2c record -ag sleep 3
  84. $ perf c2c report --call-graph none -k vmlinux
  85. When running above during testing will-it-scale's tlb_flush1 case,
  86. perf reports something like::
  87. Total records : 1658231
  88. Locked Load/Store Operations : 89439
  89. Load Operations : 623219
  90. Load Local HITM : 92117
  91. Load Remote HITM : 139
  92. #----------------------------------------------------------------------
  93. 4 0 2374 0 0 0 0xff1100088366d880
  94. #----------------------------------------------------------------------
  95. 0.00% 42.29% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81373b7b 0 231 129 5312 64 [k] __mod_lruvec_page_state [kernel.vmlinux] memcontrol.h:752 1
  96. 0.00% 13.10% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81374718 0 226 97 3551 64 [k] folio_lruvec_lock_irqsave [kernel.vmlinux] memcontrol.h:752 1
  97. 0.00% 11.20% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c29bf 0 170 136 555 64 [k] lru_add_fn [kernel.vmlinux] mm_inline.h:41 1
  98. 0.00% 7.62% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c3ec5 0 175 108 632 64 [k] release_pages [kernel.vmlinux] mm_inline.h:41 1
  99. 0.00% 23.29% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffff81372d0a 0 234 279 1051 64 [k] __mod_memcg_lruvec_state [kernel.vmlinux] memcontrol.c:736 1
  100. A nice introduction for perf-c2c is [3]_.
  101. 'pahole' decodes data structure layouts delimited in cache line
  102. granularity. Users can match the offset in perf-c2c output with
  103. pahole's decoding to locate the exact data members. For global
  104. data, users can search the data address in System.map.
  105. Possible Mitigations
  106. ====================
  107. False sharing does not always need to be mitigated. False sharing
  108. mitigations should balance performance gains with complexity and
  109. space consumption. Sometimes, lower performance is OK, and it's
  110. unnecessary to hyper-optimize every rarely used data structure or
  111. a cold data path.
  112. False sharing hurting performance cases are seen more frequently with
  113. core count increasing. Because of these detrimental effects, many
  114. patches have been proposed across variety of subsystems (like
  115. networking and memory management) and merged. Some common mitigations
  116. (with examples) are:
  117. * Separate hot global data in its own dedicated cache line, even if it
  118. is just a 'short' type. The downside is more consumption of memory,
  119. cache line and TLB entries.
  120. - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
  121. * Reorganize the data structure, separate the interfering members to
  122. different cache lines. One downside is it may introduce new false
  123. sharing of other members.
  124. - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
  125. * Replace 'write' with 'read' when possible, especially in loops.
  126. Like for some global variable, use compare(read)-then-write instead
  127. of unconditional write. For example, use::
  128. if (!test_bit(XXX))
  129. set_bit(XXX);
  130. instead of directly "set_bit(XXX);", similarly for atomic_t data::
  131. if (atomic_read(XXX) == AAA)
  132. atomic_set(XXX, BBB);
  133. - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
  134. - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
  135. * Turn hot global data to 'per-cpu data + global data' when possible,
  136. or reasonably increase the threshold for syncing per-cpu data to
  137. global data, to reduce or postpone the 'write' to that global data.
  138. - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
  139. - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
  140. Surely, all mitigations should be carefully verified to not cause side
  141. effects. To avoid introducing false sharing when coding, it's better
  142. to:
  143. * Be aware of cache line boundaries
  144. * Group mostly read-only fields together
  145. * Group things that are written at the same time together
  146. * Separate frequently read and frequently written fields on
  147. different cache lines.
  148. and better add a comment stating the false sharing consideration.
  149. One note is, sometimes even after a severe false sharing is detected
  150. and solved, the performance may still have no obvious improvement as
  151. the hotspot switches to a new place.
  152. Miscellaneous
  153. =============
  154. One open issue is that kernel has an optional data structure
  155. randomization mechanism, which also randomizes the situation of cache
  156. line sharing of data members.
  157. .. [1] https://en.wikipedia.org/wiki/False_sharing
  158. .. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
  159. .. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/