Tree-RCU-Memory-Ordering.html 30 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705
  1. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2. "http://www.w3.org/TR/html4/loose.dtd">
  3. <html>
  4. <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title>
  5. <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
  6. <p>August 8, 2017</p>
  7. <p>This article was contributed by Paul E.&nbsp;McKenney</p>
  8. <h3>Introduction</h3>
  9. <p>This document gives a rough visual overview of how Tree RCU's
  10. grace-period memory ordering guarantee is provided.
  11. <ol>
  12. <li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
  13. What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a>
  14. <li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks">
  15. Tree RCU Grace Period Memory Ordering Building Blocks</a>
  16. <li> <a href="#Tree RCU Grace Period Memory Ordering Components">
  17. Tree RCU Grace Period Memory Ordering Components</a>
  18. <li> <a href="#Putting It All Together">Putting It All Together</a>
  19. </ol>
  20. <h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
  21. What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3>
  22. <p>RCU grace periods provide extremely strong memory-ordering guarantees
  23. for non-idle non-offline code.
  24. Any code that happens after the end of a given RCU grace period is guaranteed
  25. to see the effects of all accesses prior to the beginning of that grace
  26. period that are within RCU read-side critical sections.
  27. Similarly, any code that happens before the beginning of a given RCU grace
  28. period is guaranteed to see the effects of all accesses following the end
  29. of that grace period that are within RCU read-side critical sections.
  30. <p>This guarantee is particularly pervasive for <tt>synchronize_sched()</tt>,
  31. for which RCU-sched read-side critical sections include any region
  32. of code for which preemption is disabled.
  33. Given that each individual machine instruction can be thought of as
  34. an extremely small region of preemption-disabled code, one can think of
  35. <tt>synchronize_sched()</tt> as <tt>smp_mb()</tt> on steroids.
  36. <p>RCU updaters use this guarantee by splitting their updates into
  37. two phases, one of which is executed before the grace period and
  38. the other of which is executed after the grace period.
  39. In the most common use case, phase one removes an element from
  40. a linked RCU-protected data structure, and phase two frees that element.
  41. For this to work, any readers that have witnessed state prior to the
  42. phase-one update (in the common case, removal) must not witness state
  43. following the phase-two update (in the common case, freeing).
  44. <p>The RCU implementation provides this guarantee using a network
  45. of lock-based critical sections, memory barriers, and per-CPU
  46. processing, as is described in the following sections.
  47. <h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks">
  48. Tree RCU Grace Period Memory Ordering Building Blocks</a></h3>
  49. <p>The workhorse for RCU's grace-period memory ordering is the
  50. critical section for the <tt>rcu_node</tt> structure's
  51. <tt>-&gt;lock</tt>.
  52. These critical sections use helper functions for lock acquisition, including
  53. <tt>raw_spin_lock_rcu_node()</tt>,
  54. <tt>raw_spin_lock_irq_rcu_node()</tt>, and
  55. <tt>raw_spin_lock_irqsave_rcu_node()</tt>.
  56. Their lock-release counterparts are
  57. <tt>raw_spin_unlock_rcu_node()</tt>,
  58. <tt>raw_spin_unlock_irq_rcu_node()</tt>, and
  59. <tt>raw_spin_unlock_irqrestore_rcu_node()</tt>,
  60. respectively.
  61. For completeness, a
  62. <tt>raw_spin_trylock_rcu_node()</tt>
  63. is also provided.
  64. The key point is that the lock-acquisition functions, including
  65. <tt>raw_spin_trylock_rcu_node()</tt>, all invoke
  66. <tt>smp_mb__after_unlock_lock()</tt> immediately after successful
  67. acquisition of the lock.
  68. <p>Therefore, for any given <tt>rcu_node</tt> struction, any access
  69. happening before one of the above lock-release functions will be seen
  70. by all CPUs as happening before any access happening after a later
  71. one of the above lock-acquisition functions.
  72. Furthermore, any access happening before one of the
  73. above lock-release function on any given CPU will be seen by all
  74. CPUs as happening before any access happening after a later one
  75. of the above lock-acquisition functions executing on that same CPU,
  76. even if the lock-release and lock-acquisition functions are operating
  77. on different <tt>rcu_node</tt> structures.
  78. Tree RCU uses these two ordering guarantees to form an ordering
  79. network among all CPUs that were in any way involved in the grace
  80. period, including any CPUs that came online or went offline during
  81. the grace period in question.
  82. <p>The following litmus test exhibits the ordering effects of these
  83. lock-acquisition and lock-release functions:
  84. <pre>
  85. 1 int x, y, z;
  86. 2
  87. 3 void task0(void)
  88. 4 {
  89. 5 raw_spin_lock_rcu_node(rnp);
  90. 6 WRITE_ONCE(x, 1);
  91. 7 r1 = READ_ONCE(y);
  92. 8 raw_spin_unlock_rcu_node(rnp);
  93. 9 }
  94. 10
  95. 11 void task1(void)
  96. 12 {
  97. 13 raw_spin_lock_rcu_node(rnp);
  98. 14 WRITE_ONCE(y, 1);
  99. 15 r2 = READ_ONCE(z);
  100. 16 raw_spin_unlock_rcu_node(rnp);
  101. 17 }
  102. 18
  103. 19 void task2(void)
  104. 20 {
  105. 21 WRITE_ONCE(z, 1);
  106. 22 smp_mb();
  107. 23 r3 = READ_ONCE(x);
  108. 24 }
  109. 25
  110. 26 WARN_ON(r1 == 0 &amp;&amp; r2 == 0 &amp;&amp; r3 == 0);
  111. </pre>
  112. <p>The <tt>WARN_ON()</tt> is evaluated at &ldquo;the end of time&rdquo;,
  113. after all changes have propagated throughout the system.
  114. Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the
  115. acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example
  116. on PowerPC.
  117. The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this
  118. <tt>WARN_ON()</tt> from triggering.
  119. <p>This approach must be extended to include idle CPUs, which need
  120. RCU's grace-period memory ordering guarantee to extend to any
  121. RCU read-side critical sections preceding and following the current
  122. idle sojourn.
  123. This case is handled by calls to the strongly ordered
  124. <tt>atomic_add_return()</tt> read-modify-write atomic operation that
  125. is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry
  126. time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time.
  127. The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and
  128. <tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke
  129. an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs.
  130. <table>
  131. <tr><th>&nbsp;</th></tr>
  132. <tr><th align="left">Quick Quiz:</th></tr>
  133. <tr><td>
  134. But what about CPUs that remain offline for the entire
  135. grace period?
  136. </td></tr>
  137. <tr><th align="left">Answer:</th></tr>
  138. <tr><td bgcolor="#ffffff"><font color="ffffff">
  139. Such CPUs will be offline at the beginning of the grace period,
  140. so the grace period won't expect quiescent states from them.
  141. Races between grace-period start and CPU-hotplug operations
  142. are mediated by the CPU's leaf <tt>rcu_node</tt> structure's
  143. <tt>-&gt;lock</tt> as described above.
  144. </font></td></tr>
  145. <tr><td>&nbsp;</td></tr>
  146. </table>
  147. <p>The approach must be extended to handle one final case, that
  148. of waking a task blocked in <tt>synchronize_rcu()</tt>.
  149. This task might be affinitied to a CPU that is not yet aware that
  150. the grace period has ended, and thus might not yet be subject to
  151. the grace period's memory ordering.
  152. Therefore, there is an <tt>smp_mb()</tt> after the return from
  153. <tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt>
  154. code path.
  155. <table>
  156. <tr><th>&nbsp;</th></tr>
  157. <tr><th align="left">Quick Quiz:</th></tr>
  158. <tr><td>
  159. What? Where???
  160. I don't see any <tt>smp_mb()</tt> after the return from
  161. <tt>wait_for_completion()</tt>!!!
  162. </td></tr>
  163. <tr><th align="left">Answer:</th></tr>
  164. <tr><td bgcolor="#ffffff"><font color="ffffff">
  165. That would be because I spotted the need for that
  166. <tt>smp_mb()</tt> during the creation of this documentation,
  167. and it is therefore unlikely to hit mainline before v4.14.
  168. Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
  169. Jonathan Cameron for asking questions that sensitized me
  170. to the rather elaborate sequence of events that demonstrate
  171. the need for this memory barrier.
  172. </font></td></tr>
  173. <tr><td>&nbsp;</td></tr>
  174. </table>
  175. <p>Tree RCU's grace--period memory-ordering guarantees rely most
  176. heavily on the <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>
  177. field, so much so that it is necessary to abbreviate this pattern
  178. in the diagrams in the next section.
  179. For example, consider the <tt>rcu_prepare_for_idle()</tt> function
  180. shown below, which is one of several functions that enforce ordering
  181. of newly arrived RCU callbacks against future grace periods:
  182. <pre>
  183. 1 static void rcu_prepare_for_idle(void)
  184. 2 {
  185. 3 bool needwake;
  186. 4 struct rcu_data *rdp;
  187. 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&amp;rcu_dynticks);
  188. 6 struct rcu_node *rnp;
  189. 7 struct rcu_state *rsp;
  190. 8 int tne;
  191. 9
  192. 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
  193. 11 rcu_is_nocb_cpu(smp_processor_id()))
  194. 12 return;
  195. 13 tne = READ_ONCE(tick_nohz_active);
  196. 14 if (tne != rdtp-&gt;tick_nohz_enabled_snap) {
  197. 15 if (rcu_cpu_has_callbacks(NULL))
  198. 16 invoke_rcu_core();
  199. 17 rdtp-&gt;tick_nohz_enabled_snap = tne;
  200. 18 return;
  201. 19 }
  202. 20 if (!tne)
  203. 21 return;
  204. 22 if (rdtp-&gt;all_lazy &amp;&amp;
  205. 23 rdtp-&gt;nonlazy_posted != rdtp-&gt;nonlazy_posted_snap) {
  206. 24 rdtp-&gt;all_lazy = false;
  207. 25 rdtp-&gt;nonlazy_posted_snap = rdtp-&gt;nonlazy_posted;
  208. 26 invoke_rcu_core();
  209. 27 return;
  210. 28 }
  211. 29 if (rdtp-&gt;last_accelerate == jiffies)
  212. 30 return;
  213. 31 rdtp-&gt;last_accelerate = jiffies;
  214. 32 for_each_rcu_flavor(rsp) {
  215. 33 rdp = this_cpu_ptr(rsp-&gt;rda);
  216. 34 if (rcu_segcblist_pend_cbs(&amp;rdp-&gt;cblist))
  217. 35 continue;
  218. 36 rnp = rdp-&gt;mynode;
  219. 37 raw_spin_lock_rcu_node(rnp);
  220. 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
  221. 39 raw_spin_unlock_rcu_node(rnp);
  222. 40 if (needwake)
  223. 41 rcu_gp_kthread_wake(rsp);
  224. 42 }
  225. 43 }
  226. </pre>
  227. <p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really
  228. matters for this discussion are lines&nbsp;37&ndash;39.
  229. We will therefore abbreviate this function as follows:
  230. </p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg">
  231. <p>The box represents the <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>
  232. critical section, with the double line on top representing the additional
  233. <tt>smp_mb__after_unlock_lock()</tt>.
  234. <h3><a name="Tree RCU Grace Period Memory Ordering Components">
  235. Tree RCU Grace Period Memory Ordering Components</a></h3>
  236. <p>Tree RCU's grace-period memory-ordering guarantee is provided by
  237. a number of RCU components:
  238. <ol>
  239. <li> <a href="#Callback Registry">Callback Registry</a>
  240. <li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a>
  241. <li> <a href="#Self-Reported Quiescent States">
  242. Self-Reported Quiescent States</a>
  243. <li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a>
  244. <li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a>
  245. <li> <a href="Forcing Quiescent States">Forcing Quiescent States</a>
  246. <li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a>
  247. <li> <a href="Callback Invocation">Callback Invocation</a>
  248. </ol>
  249. <p>Each of the following section looks at the corresponding component
  250. in detail.
  251. <h4><a name="Callback Registry">Callback Registry</a></h4>
  252. <p>If RCU's grace-period guarantee is to mean anything at all, any
  253. access that happens before a given invocation of <tt>call_rcu()</tt>
  254. must also happen before the corresponding grace period.
  255. The implementation of this portion of RCU's grace period guarantee
  256. is shown in the following figure:
  257. </p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg">
  258. <p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state,
  259. it provides no ordering guarantees, either for itself or for
  260. phase one of the update (which again will usually be removal of
  261. an element from an RCU-protected data structure).
  262. It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list,
  263. which cannot become associated with a grace period until a later
  264. call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above.
  265. <p>One set of code paths shown on the left invokes
  266. <tt>rcu_accelerate_cbs()</tt> via
  267. <tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if
  268. the current CPU is inundated with queued <tt>rcu_head</tt> structures)
  269. or more likely from an <tt>RCU_SOFTIRQ</tt> handler.
  270. Another code path in the middle is taken only in kernels built with
  271. <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes
  272. <tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>.
  273. The final code path on the right is taken only in kernels built with
  274. <tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes
  275. <tt>rcu_accelerate_cbs()</tt> via
  276. <tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>,
  277. <tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>,
  278. which in turn is invoked on a surviving CPU after the outgoing
  279. CPU has been completely offlined.
  280. <p>There are a few other code paths within grace-period processing
  281. that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>.
  282. However, either way, all of the CPU's recently queued <tt>rcu_head</tt>
  283. structures are associated with a future grace-period number under
  284. the protection of the CPU's lead <tt>rcu_node</tt> structure's
  285. <tt>-&gt;lock</tt>.
  286. In all cases, there is full ordering against any prior critical section
  287. for that same <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>, and
  288. also full ordering against any of the current task's or CPU's prior critical
  289. sections for any <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>.
  290. <p>The next section will show how this ordering ensures that any
  291. accesses prior to the <tt>call_rcu()</tt> (particularly including phase
  292. one of the update)
  293. happen before the start of the corresponding grace period.
  294. <table>
  295. <tr><th>&nbsp;</th></tr>
  296. <tr><th align="left">Quick Quiz:</th></tr>
  297. <tr><td>
  298. But what about <tt>synchronize_rcu()</tt>?
  299. </td></tr>
  300. <tr><th align="left">Answer:</th></tr>
  301. <tr><td bgcolor="#ffffff"><font color="ffffff">
  302. The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt>
  303. to <tt>wait_rcu_gp()</tt>, which invokes it.
  304. So either way, it eventually comes down to <tt>call_rcu()</tt>.
  305. </font></td></tr>
  306. <tr><td>&nbsp;</td></tr>
  307. </table>
  308. <h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4>
  309. <p>Grace-period initialization is carried out by
  310. the grace-period kernel thread, which makes several passes over the
  311. <tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function.
  312. This means that showing the full flow of ordering through the
  313. grace-period computation will require duplicating this tree.
  314. If you find this confusing, please note that the state of the
  315. <tt>rcu_node</tt> changes over time, just like Heraclitus's river.
  316. However, to keep the <tt>rcu_node</tt> river tractable, the
  317. grace-period kernel thread's traversals are presented in multiple
  318. parts, starting in this section with the various phases of
  319. grace-period initialization.
  320. <p>The first ordering-related grace-period initialization action is to
  321. advance the <tt>rcu_state</tt> structure's <tt>-&gt;gp_seq</tt>
  322. grace-period-number counter, as shown below:
  323. </p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
  324. <p>The actual increment is carried out using <tt>smp_store_release()</tt>,
  325. which helps reject false-positive RCU CPU stall detection.
  326. Note that only the root <tt>rcu_node</tt> structure is touched.
  327. <p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks
  328. based on CPUs having come online or gone offline since the start of
  329. the previous grace period.
  330. In the common case where the number of online CPUs for this <tt>rcu_node</tt>
  331. structure has not transitioned to or from zero,
  332. this pass will scan only the leaf <tt>rcu_node</tt> structures.
  333. However, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
  334. structure has transitioned from zero,
  335. <tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU.
  336. Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
  337. structure has transitioned to zero,
  338. <tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU.
  339. The diagram below shows the path of ordering if the leftmost
  340. <tt>rcu_node</tt> structure onlines its first CPU and if the next
  341. <tt>rcu_node</tt> structure has no online CPUs
  342. (or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines
  343. its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs).
  344. </p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
  345. <p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt>
  346. tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's
  347. <tt>-&gt;gp_seq</tt> field to the newly advanced value from the
  348. <tt>rcu_state</tt> structure, as shown in the following diagram.
  349. </p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
  350. <p>This change will also cause each CPU's next call to
  351. <tt>__note_gp_changes()</tt>
  352. to notice that a new grace period has started, as described in the next
  353. section.
  354. But because the grace-period kthread started the grace period at the
  355. root (with the advancing of the <tt>rcu_state</tt> structure's
  356. <tt>-&gt;gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt>
  357. structure's <tt>-&gt;gp_seq</tt> field, each CPU's observation of
  358. the start of the grace period will happen after the actual start
  359. of the grace period.
  360. <table>
  361. <tr><th>&nbsp;</th></tr>
  362. <tr><th align="left">Quick Quiz:</th></tr>
  363. <tr><td>
  364. But what about the CPU that started the grace period?
  365. Why wouldn't it see the start of the grace period right when
  366. it started that grace period?
  367. </td></tr>
  368. <tr><th align="left">Answer:</th></tr>
  369. <tr><td bgcolor="#ffffff"><font color="ffffff">
  370. In some deep philosophical and overly anthromorphized
  371. sense, yes, the CPU starting the grace period is immediately
  372. aware of having done so.
  373. However, if we instead assume that RCU is not self-aware,
  374. then even the CPU starting the grace period does not really
  375. become aware of the start of this grace period until its
  376. first call to <tt>__note_gp_changes()</tt>.
  377. On the other hand, this CPU potentially gets early notification
  378. because it invokes <tt>__note_gp_changes()</tt> during its
  379. last <tt>rcu_gp_init()</tt> pass through its leaf
  380. <tt>rcu_node</tt> structure.
  381. </font></td></tr>
  382. <tr><td>&nbsp;</td></tr>
  383. </table>
  384. <h4><a name="Self-Reported Quiescent States">
  385. Self-Reported Quiescent States</a></h4>
  386. <p>When all entities that might block the grace period have reported
  387. quiescent states (or as described in a later section, had quiescent
  388. states reported on their behalf), the grace period can end.
  389. Online non-idle CPUs report their own quiescent states, as shown
  390. in the following diagram:
  391. </p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%">
  392. <p>This is for the last CPU to report a quiescent state, which signals
  393. the end of the grace period.
  394. Earlier quiescent states would push up the <tt>rcu_node</tt> tree
  395. only until they encountered an <tt>rcu_node</tt> structure that
  396. is waiting for additional quiescent states.
  397. However, ordering is nevertheless preserved because some later quiescent
  398. state will acquire that <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>.
  399. <p>Any number of events can lead up to a CPU invoking
  400. <tt>note_gp_changes</tt> (or alternatively, directly invoking
  401. <tt>__note_gp_changes()</tt>), at which point that CPU will notice
  402. the start of a new grace period while holding its leaf
  403. <tt>rcu_node</tt> lock.
  404. Therefore, all execution shown in this diagram happens after the
  405. start of the grace period.
  406. In addition, this CPU will consider any RCU read-side critical
  407. section that started before the invocation of <tt>__note_gp_changes()</tt>
  408. to have started before the grace period, and thus a critical
  409. section that the grace period must wait on.
  410. <table>
  411. <tr><th>&nbsp;</th></tr>
  412. <tr><th align="left">Quick Quiz:</th></tr>
  413. <tr><td>
  414. But a RCU read-side critical section might have started
  415. after the beginning of the grace period
  416. (the advancing of <tt>-&gt;gp_seq</tt> from earlier), so why should
  417. the grace period wait on such a critical section?
  418. </td></tr>
  419. <tr><th align="left">Answer:</th></tr>
  420. <tr><td bgcolor="#ffffff"><font color="ffffff">
  421. It is indeed not necessary for the grace period to wait on such
  422. a critical section.
  423. However, it is permissible to wait on it.
  424. And it is furthermore important to wait on it, as this
  425. lazy approach is far more scalable than a &ldquo;big bang&rdquo;
  426. all-at-once grace-period start could possibly be.
  427. </font></td></tr>
  428. <tr><td>&nbsp;</td></tr>
  429. </table>
  430. <p>If the CPU does a context switch, a quiescent state will be
  431. noted by <tt>rcu_node_context_switch()</tt> on the left.
  432. On the other hand, if the CPU takes a scheduler-clock interrupt
  433. while executing in usermode, a quiescent state will be noted by
  434. <tt>rcu_check_callbacks()</tt> on the right.
  435. Either way, the passage through a quiescent state will be noted
  436. in a per-CPU variable.
  437. <p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on
  438. this CPU (for example, after the next scheduler-clock
  439. interrupt), <tt>__rcu_process_callbacks()</tt> will invoke
  440. <tt>rcu_check_quiescent_state()</tt>, which will notice the
  441. recorded quiescent state, and invoke
  442. <tt>rcu_report_qs_rdp()</tt>.
  443. If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state
  444. really does apply to the current grace period, it invokes
  445. <tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt>
  446. tree as shown at the bottom of the diagram, clearing bits from
  447. each <tt>rcu_node</tt> structure's <tt>-&gt;qsmask</tt> field,
  448. and propagating up the tree when the result is zero.
  449. <p>Note that traversal passes upwards out of a given <tt>rcu_node</tt>
  450. structure only if the current CPU is reporting the last quiescent
  451. state for the subtree headed by that <tt>rcu_node</tt> structure.
  452. A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt>
  453. structure, then there will be a later traversal by another CPU
  454. (or perhaps the same one) that proceeds upwards
  455. from that point, and the <tt>rcu_node</tt> <tt>-&gt;lock</tt>
  456. guarantees that the first CPU's quiescent state happens before the
  457. remainder of the second CPU's traversal.
  458. Applying this line of thought repeatedly shows that all CPUs'
  459. quiescent states happen before the last CPU traverses through
  460. the root <tt>rcu_node</tt> structure, the &ldquo;last CPU&rdquo;
  461. being the one that clears the last bit in the root <tt>rcu_node</tt>
  462. structure's <tt>-&gt;qsmask</tt> field.
  463. <h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4>
  464. <p>Due to energy-efficiency considerations, RCU is forbidden from
  465. disturbing idle CPUs.
  466. CPUs are therefore required to notify RCU when entering or leaving idle
  467. state, which they do via fully ordered value-returning atomic operations
  468. on a per-CPU variable.
  469. The ordering effects are as shown below:
  470. </p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%">
  471. <p>The RCU grace-period kernel thread samples the per-CPU idleness
  472. variable while holding the corresponding CPU's leaf <tt>rcu_node</tt>
  473. structure's <tt>-&gt;lock</tt>.
  474. This means that any RCU read-side critical sections that precede the
  475. idle period (the oval near the top of the diagram above) will happen
  476. before the end of the current grace period.
  477. Similarly, the beginning of the current grace period will happen before
  478. any RCU read-side critical sections that follow the
  479. idle period (the oval near the bottom of the diagram above).
  480. <p>Plumbing this into the full grace-period execution is described
  481. <a href="#Forcing Quiescent States">below</a>.
  482. <h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4>
  483. <p>RCU is also forbidden from disturbing offline CPUs, which might well
  484. be powered off and removed from the system completely.
  485. CPUs are therefore required to notify RCU of their comings and goings
  486. as part of the corresponding CPU hotplug operations.
  487. The ordering effects are shown below:
  488. </p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%">
  489. <p>Because CPU hotplug operations are much less frequent than idle transitions,
  490. they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt>
  491. structure's <tt>-&gt;lock</tt> and update this structure's
  492. <tt>-&gt;qsmaskinitnext</tt>.
  493. The RCU grace-period kernel thread samples this mask to detect CPUs
  494. having gone offline since the beginning of this grace period.
  495. <p>Plumbing this into the full grace-period execution is described
  496. <a href="#Forcing Quiescent States">below</a>.
  497. <h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4>
  498. <p>As noted above, idle and offline CPUs cannot report their own
  499. quiescent states, and therefore the grace-period kernel thread
  500. must do the reporting on their behalf.
  501. This process is called &ldquo;forcing quiescent states&rdquo;, it is
  502. repeated every few jiffies, and its ordering effects are shown below:
  503. </p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%">
  504. <p>Each pass of quiescent state forcing is guaranteed to traverse the
  505. leaf <tt>rcu_node</tt> structures, and if there are no new quiescent
  506. states due to recently idled and/or offlined CPUs, then only the
  507. leaves are traversed.
  508. However, if there is a newly offlined CPU as illustrated on the left
  509. or a newly idled CPU as illustrated on the right, the corresponding
  510. quiescent state will be driven up towards the root.
  511. As with self-reported quiescent states, the upwards driving stops
  512. once it reaches an <tt>rcu_node</tt> structure that has quiescent
  513. states outstanding from other CPUs.
  514. <table>
  515. <tr><th>&nbsp;</th></tr>
  516. <tr><th align="left">Quick Quiz:</th></tr>
  517. <tr><td>
  518. The leftmost drive to root stopped before it reached
  519. the root <tt>rcu_node</tt> structure, which means that
  520. there are still CPUs subordinate to that structure on
  521. which the current grace period is waiting.
  522. Given that, how is it possible that the rightmost drive
  523. to root ended the grace period?
  524. </td></tr>
  525. <tr><th align="left">Answer:</th></tr>
  526. <tr><td bgcolor="#ffffff"><font color="ffffff">
  527. Good analysis!
  528. It is in fact impossible in the absence of bugs in RCU.
  529. But this diagram is complex enough as it is, so simplicity
  530. overrode accuracy.
  531. You can think of it as poetic license, or you can think of
  532. it as misdirection that is resolved in the
  533. <a href="#Putting It All Together">stitched-together diagram</a>.
  534. </font></td></tr>
  535. <tr><td>&nbsp;</td></tr>
  536. </table>
  537. <h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4>
  538. <p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree
  539. breadth-first advancing all the <tt>-&gt;gp_seq</tt> fields, then it
  540. advances the <tt>rcu_state</tt> structure's <tt>-&gt;gp_seq</tt> field.
  541. The ordering effects are shown below:
  542. </p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%">
  543. <p>As indicated by the oval at the bottom of the diagram, once
  544. grace-period cleanup is complete, the next grace period can begin.
  545. <table>
  546. <tr><th>&nbsp;</th></tr>
  547. <tr><th align="left">Quick Quiz:</th></tr>
  548. <tr><td>
  549. But when precisely does the grace period end?
  550. </td></tr>
  551. <tr><th align="left">Answer:</th></tr>
  552. <tr><td bgcolor="#ffffff"><font color="ffffff">
  553. There is no useful single point at which the grace period
  554. can be said to end.
  555. The earliest reasonable candidate is as soon as the last
  556. CPU has reported its quiescent state, but it may be some
  557. milliseconds before RCU becomes aware of this.
  558. The latest reasonable candidate is once the <tt>rcu_state</tt>
  559. structure's <tt>-&gt;gp_seq</tt> field has been updated,
  560. but it is quite possible that some CPUs have already completed
  561. phase two of their updates by that time.
  562. In short, if you are going to work with RCU, you need to
  563. learn to embrace uncertainty.
  564. </font></td></tr>
  565. <tr><td>&nbsp;</td></tr>
  566. </table>
  567. <h4><a name="Callback Invocation">Callback Invocation</a></h4>
  568. <p>Once a given CPU's leaf <tt>rcu_node</tt> structure's
  569. <tt>-&gt;gp_seq</tt> field has been updated, that CPU can begin
  570. invoking its RCU callbacks that were waiting for this grace period
  571. to end.
  572. These callbacks are identified by <tt>rcu_advance_cbs()</tt>,
  573. which is usually invoked by <tt>__note_gp_changes()</tt>.
  574. As shown in the diagram below, this invocation can be triggered by
  575. the scheduling-clock interrupt (<tt>rcu_check_callbacks()</tt> on
  576. the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on
  577. the right, but only for kernels build with
  578. <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>).
  579. Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in
  580. <tt>rcu_do_batch()</tt> invoking the callbacks, which in turn
  581. allows those callbacks to carry out (either directly or indirectly
  582. via wakeup) the needed phase-two processing for each update.
  583. </p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%">
  584. <p>Please note that callback invocation can also be prompted by any
  585. number of corner-case code paths, for example, when a CPU notes that
  586. it has excessive numbers of callbacks queued.
  587. In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's
  588. <tt>-&gt;lock</tt> before invoking callbacks, which preserves the
  589. required ordering against the newly completed grace period.
  590. <p>However, if the callback function communicates to other CPUs,
  591. for example, doing a wakeup, then it is that function's responsibility
  592. to maintain ordering.
  593. For example, if the callback function wakes up a task that runs on
  594. some other CPU, proper ordering must in place in both the callback
  595. function and the task being awakened.
  596. To see why this is important, consider the top half of the
  597. <a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram.
  598. The callback might be running on a CPU corresponding to the leftmost
  599. leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on
  600. a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure,
  601. and the grace-period kernel thread might not yet have reached the
  602. rightmost leaf.
  603. In this case, the grace period's memory ordering might not yet have
  604. reached that CPU, so again the callback function and the awakened
  605. task must supply proper ordering.
  606. <h3><a name="Putting It All Together">Putting It All Together</a></h3>
  607. <p>A stitched-together diagram is
  608. <a href="Tree-RCU-Diagram.html">here</a>.
  609. <h3><a name="Legal Statement">
  610. Legal Statement</a></h3>
  611. <p>This work represents the view of the author and does not necessarily
  612. represent the view of IBM.
  613. </p><p>Linux is a registered trademark of Linus Torvalds.
  614. </p><p>Other company, product, and service names may be trademarks or
  615. service marks of others.
  616. </body></html>