|
@@ -0,0 +1,626 @@
|
|
|
|
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
|
|
|
+ "http://www.w3.org/TR/html4/loose.dtd">
|
|
|
|
+ <html>
|
|
|
|
+ <head><title>A Tour Through TREE_RCU's Expedited Grace Periods</title>
|
|
|
|
+ <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
|
|
|
|
+
|
|
|
|
+<h2>Introduction</h2>
|
|
|
|
+
|
|
|
|
+This document describes RCU's expedited grace periods.
|
|
|
|
+Unlike RCU's normal grace periods, which accept long latencies to attain
|
|
|
|
+high efficiency and minimal disturbance, expedited grace periods accept
|
|
|
|
+lower efficiency and significant disturbance to attain shorter latencies.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+There are three flavors of RCU (RCU-bh, RCU-preempt, and RCU-sched),
|
|
|
|
+but only two flavors of expedited grace periods because the RCU-bh
|
|
|
|
+expedited grace period maps onto the RCU-sched expedited grace period.
|
|
|
|
+Each of the remaining two implementations is covered in its own section.
|
|
|
|
+
|
|
|
|
+<ol>
|
|
|
|
+<li> <a href="#Expedited Grace Period Design">
|
|
|
|
+ Expedited Grace Period Design</a>
|
|
|
|
+<li> <a href="#RCU-preempt Expedited Grace Periods">
|
|
|
|
+ RCU-preempt Expedited Grace Periods</a>
|
|
|
|
+<li> <a href="#RCU-sched Expedited Grace Periods">
|
|
|
|
+ RCU-sched Expedited Grace Periods</a>
|
|
|
|
+<li> <a href="#Expedited Grace Period and CPU Hotplug">
|
|
|
|
+ Expedited Grace Period and CPU Hotplug</a>
|
|
|
|
+<li> <a href="#Expedited Grace Period Refinements">
|
|
|
|
+ Expedited Grace Period Refinements</a>
|
|
|
|
+</ol>
|
|
|
|
+
|
|
|
|
+<h2><a name="Expedited Grace Period Design">
|
|
|
|
+Expedited Grace Period Design</a></h2>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+The expedited RCU grace periods cannot be accused of being subtle,
|
|
|
|
+given that they for all intents and purposes hammer every CPU that
|
|
|
|
+has not yet provided a quiescent state for the current expedited
|
|
|
|
+grace period.
|
|
|
|
+The one saving grace is that the hammer has grown a bit smaller
|
|
|
|
+over time: The old call to <tt>try_stop_cpus()</tt> has been
|
|
|
|
+replaced with a set of calls to <tt>smp_call_function_single()</tt>,
|
|
|
|
+each of which results in an IPI to the target CPU.
|
|
|
|
+The corresponding handler function checks the CPU's state, motivating
|
|
|
|
+a faster quiescent state where possible, and triggering a report
|
|
|
|
+of that quiescent state.
|
|
|
|
+As always for RCU, once everything has spent some time in a quiescent
|
|
|
|
+state, the expedited grace period has completed.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+The details of the <tt>smp_call_function_single()</tt> handler's
|
|
|
|
+operation depend on the RCU flavor, as described in the following
|
|
|
|
+sections.
|
|
|
|
+
|
|
|
|
+<h2><a name="RCU-preempt Expedited Grace Periods">
|
|
|
|
+RCU-preempt Expedited Grace Periods</a></h2>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+The overall flow of the handling of a given CPU by an RCU-preempt
|
|
|
|
+expedited grace period is shown in the following diagram:
|
|
|
|
+
|
|
|
|
+<p><img src="ExpRCUFlow.svg" alt="ExpRCUFlow.svg" width="55%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+The solid arrows denote direct action, for example, a function call.
|
|
|
|
+The dotted arrows denote indirect action, for example, an IPI
|
|
|
|
+or a state that is reached after some time.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+If a given CPU is offline or idle, <tt>synchronize_rcu_expedited()</tt>
|
|
|
|
+will ignore it because idle and offline CPUs are already residing
|
|
|
|
+in quiescent states.
|
|
|
|
+Otherwise, the expedited grace period will use
|
|
|
|
+<tt>smp_call_function_single()</tt> to send the CPU an IPI, which
|
|
|
|
+is handled by <tt>sync_rcu_exp_handler()</tt>.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+However, because this is preemptible RCU, <tt>sync_rcu_exp_handler()</tt>
|
|
|
|
+can check to see if the CPU is currently running in an RCU read-side
|
|
|
|
+critical section.
|
|
|
|
+If not, the handler can immediately report a quiescent state.
|
|
|
|
+Otherwise, it sets flags so that the outermost <tt>rcu_read_unlock()</tt>
|
|
|
|
+invocation will provide the needed quiescent-state report.
|
|
|
|
+This flag-setting avoids the previous forced preemption of all
|
|
|
|
+CPUs that might have RCU read-side critical sections.
|
|
|
|
+In addition, this flag-setting is done so as to avoid increasing
|
|
|
|
+the overhead of the common-case fastpath through the scheduler.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Again because this is preemptible RCU, an RCU read-side critical section
|
|
|
|
+can be preempted.
|
|
|
|
+When that happens, RCU will enqueue the task, which will the continue to
|
|
|
|
+block the current expedited grace period until it resumes and finds its
|
|
|
|
+outermost <tt>rcu_read_unlock()</tt>.
|
|
|
|
+The CPU will report a quiescent state just after enqueuing the task because
|
|
|
|
+the CPU is no longer blocking the grace period.
|
|
|
|
+It is instead the preempted task doing the blocking.
|
|
|
|
+The list of blocked tasks is managed by <tt>rcu_preempt_ctxt_queue()</tt>,
|
|
|
|
+which is called from <tt>rcu_preempt_note_context_switch()</tt>, which
|
|
|
|
+in turn is called from <tt>rcu_note_context_switch()</tt>, which in
|
|
|
|
+turn is called from the scheduler.
|
|
|
|
+
|
|
|
|
+<table>
|
|
|
|
+<tr><th> </th></tr>
|
|
|
|
+<tr><th align="left">Quick Quiz:</th></tr>
|
|
|
|
+<tr><td>
|
|
|
|
+ Why not just have the expedited grace period check the
|
|
|
|
+ state of all the CPUs?
|
|
|
|
+ After all, that would avoid all those real-time-unfriendly IPIs.
|
|
|
|
+</td></tr>
|
|
|
|
+<tr><th align="left">Answer:</th></tr>
|
|
|
|
+<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
|
|
+ Because we want the RCU read-side critical sections to run fast,
|
|
|
|
+ which means no memory barriers.
|
|
|
|
+ Therefore, it is not possible to safely check the state from some
|
|
|
|
+ other CPU.
|
|
|
|
+ And even if it was possible to safely check the state, it would
|
|
|
|
+ still be necessary to IPI the CPU to safely interact with the
|
|
|
|
+ upcoming <tt>rcu_read_unlock()</tt> invocation, which means that
|
|
|
|
+ the remote state testing would not help the worst-case
|
|
|
|
+ latency that real-time applications care about.
|
|
|
|
+
|
|
|
|
+ <p><font color="ffffff">One way to prevent your real-time
|
|
|
|
+ application from getting hit with these IPIs is to
|
|
|
|
+ build your kernel with <tt>CONFIG_NO_HZ_FULL=y</tt>.
|
|
|
|
+ RCU would then perceive the CPU running your application
|
|
|
|
+ as being idle, and it would be able to safely detect that
|
|
|
|
+ state without needing to IPI the CPU.
|
|
|
|
+</font></td></tr>
|
|
|
|
+<tr><td> </td></tr>
|
|
|
|
+</table>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Please note that this is just the overall flow:
|
|
|
|
+Additional complications can arise due to races with CPUs going idle
|
|
|
|
+or offline, among other things.
|
|
|
|
+
|
|
|
|
+<h2><a name="RCU-sched Expedited Grace Periods">
|
|
|
|
+RCU-sched Expedited Grace Periods</a></h2>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+The overall flow of the handling of a given CPU by an RCU-sched
|
|
|
|
+expedited grace period is shown in the following diagram:
|
|
|
|
+
|
|
|
|
+<p><img src="ExpSchedFlow.svg" alt="ExpSchedFlow.svg" width="55%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+As with RCU-preempt's <tt>synchronize_rcu_expedited()</tt>,
|
|
|
|
+<tt>synchronize_sched_expedited()</tt> ignores offline and
|
|
|
|
+idle CPUs, again because they are in remotely detectable
|
|
|
|
+quiescent states.
|
|
|
|
+However, the <tt>synchronize_rcu_expedited()</tt> handler
|
|
|
|
+is <tt>sync_sched_exp_handler()</tt>, and because the
|
|
|
|
+<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt>
|
|
|
|
+leave no trace of their invocation, in general it is not possible to tell
|
|
|
|
+whether or not the current CPU is in an RCU read-side critical section.
|
|
|
|
+The best that <tt>sync_sched_exp_handler()</tt> can do is to check
|
|
|
|
+for idle, on the off-chance that the CPU went idle while the IPI
|
|
|
|
+was in flight.
|
|
|
|
+If the CPU is idle, then tt>sync_sched_exp_handler()</tt> reports
|
|
|
|
+the quiescent state.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Otherwise, the handler invokes <tt>resched_cpu()</tt>, which forces
|
|
|
|
+a future context switch.
|
|
|
|
+At the time of the context switch, the CPU reports the quiescent state.
|
|
|
|
+Should the CPU go offline first, it will report the quiescent state
|
|
|
|
+at that time.
|
|
|
|
+
|
|
|
|
+<h2><a name="Expedited Grace Period and CPU Hotplug">
|
|
|
|
+Expedited Grace Period and CPU Hotplug</a></h2>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+The expedited nature of expedited grace periods require a much tighter
|
|
|
|
+interaction with CPU hotplug operations than is required for normal
|
|
|
|
+grace periods.
|
|
|
|
+In addition, attempting to IPI offline CPUs will result in splats, but
|
|
|
|
+failing to IPI online CPUs can result in too-short grace periods.
|
|
|
|
+Neither option is acceptable in production kernels.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+The interaction between expedited grace periods and CPU hotplug operations
|
|
|
|
+is carried out at several levels:
|
|
|
|
+
|
|
|
|
+<ol>
|
|
|
|
+<li> The number of CPUs that have ever been online is tracked
|
|
|
|
+ by the <tt>rcu_state</tt> structure's <tt>->ncpus</tt>
|
|
|
|
+ field.
|
|
|
|
+ The <tt>rcu_state</tt> structure's <tt>->ncpus_snap</tt>
|
|
|
|
+ field tracks the number of CPUs that have ever been online
|
|
|
|
+ at the beginning of an RCU expedited grace period.
|
|
|
|
+ Note that this number never decreases, at least in the absence
|
|
|
|
+ of a time machine.
|
|
|
|
+<li> The identities of the CPUs that have ever been online is
|
|
|
|
+ tracked by the <tt>rcu_node</tt> structure's
|
|
|
|
+ <tt>->expmaskinitnext</tt> field.
|
|
|
|
+ The <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt>
|
|
|
|
+ field tracks the identities of the CPUs that were online
|
|
|
|
+ at least once at the beginning of the most recent RCU
|
|
|
|
+ expedited grace period.
|
|
|
|
+ The <tt>rcu_state</tt> structure's <tt>->ncpus</tt> and
|
|
|
|
+ <tt>->ncpus_snap</tt> fields are used to detect when
|
|
|
|
+ new CPUs have come online for the first time, that is,
|
|
|
|
+ when the <tt>rcu_node</tt> structure's <tt>->expmaskinitnext</tt>
|
|
|
|
+ field has changed since the beginning of the last RCU
|
|
|
|
+ expedited grace period, which triggers an update of each
|
|
|
|
+ <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt>
|
|
|
|
+ field from its <tt>->expmaskinitnext</tt> field.
|
|
|
|
+<li> Each <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt>
|
|
|
|
+ field is used to initialize that structure's
|
|
|
|
+ <tt>->expmask</tt> at the beginning of each RCU
|
|
|
|
+ expedited grace period.
|
|
|
|
+ This means that only those CPUs that have been online at least
|
|
|
|
+ once will be considered for a given grace period.
|
|
|
|
+<li> Any CPU that goes offline will clear its bit in its leaf
|
|
|
|
+ <tt>rcu_node</tt> structure's <tt>->qsmaskinitnext</tt>
|
|
|
|
+ field, so any CPU with that bit clear can safely be ignored.
|
|
|
|
+ However, it is possible for a CPU coming online or going offline
|
|
|
|
+ to have this bit set for some time while <tt>cpu_online</tt>
|
|
|
|
+ returns <tt>false</tt>.
|
|
|
|
+<li> For each non-idle CPU that RCU believes is currently online, the grace
|
|
|
|
+ period invokes <tt>smp_call_function_single()</tt>.
|
|
|
|
+ If this succeeds, the CPU was fully online.
|
|
|
|
+ Failure indicates that the CPU is in the process of coming online
|
|
|
|
+ or going offline, in which case it is necessary to wait for a
|
|
|
|
+ short time period and try again.
|
|
|
|
+ The purpose of this wait (or series of waits, as the case may be)
|
|
|
|
+ is to permit a concurrent CPU-hotplug operation to complete.
|
|
|
|
+<li> In the case of RCU-sched, one of the last acts of an outgoing CPU
|
|
|
|
+ is to invoke <tt>rcu_report_dead()</tt>, which
|
|
|
|
+ reports a quiescent state for that CPU.
|
|
|
|
+ However, this is likely paranoia-induced redundancy. <!-- @@@ -->
|
|
|
|
+</ol>
|
|
|
|
+
|
|
|
|
+<table>
|
|
|
|
+<tr><th> </th></tr>
|
|
|
|
+<tr><th align="left">Quick Quiz:</th></tr>
|
|
|
|
+<tr><td>
|
|
|
|
+ Why all the dancing around with multiple counters and masks
|
|
|
|
+ tracking CPUs that were once online?
|
|
|
|
+ Why not just have a single set of masks tracking the currently
|
|
|
|
+ online CPUs and be done with it?
|
|
|
|
+</td></tr>
|
|
|
|
+<tr><th align="left">Answer:</th></tr>
|
|
|
|
+<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
|
|
+ Maintaining single set of masks tracking the online CPUs <i>sounds</i>
|
|
|
|
+ easier, at least until you try working out all the race conditions
|
|
|
|
+ between grace-period initialization and CPU-hotplug operations.
|
|
|
|
+ For example, suppose initialization is progressing down the
|
|
|
|
+ tree while a CPU-offline operation is progressing up the tree.
|
|
|
|
+ This situation can result in bits set at the top of the tree
|
|
|
|
+ that have no counterparts at the bottom of the tree.
|
|
|
|
+ Those bits will never be cleared, which will result in
|
|
|
|
+ grace-period hangs.
|
|
|
|
+ In short, that way lies madness, to say nothing of a great many
|
|
|
|
+ bugs, hangs, and deadlocks.
|
|
|
|
+
|
|
|
|
+ <p><font color="ffffff">
|
|
|
|
+ In contrast, the current multi-mask multi-counter scheme ensures
|
|
|
|
+ that grace-period initialization will always see consistent masks
|
|
|
|
+ up and down the tree, which brings significant simplifications
|
|
|
|
+ over the single-mask method.
|
|
|
|
+
|
|
|
|
+ <p><font color="ffffff">
|
|
|
|
+ This is an instance of
|
|
|
|
+ <a href="http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1992/cucs-039-92.ps.gz"><font color="ffffff">
|
|
|
|
+ deferring work in order to avoid synchronization</a>.
|
|
|
|
+ Lazily recording CPU-hotplug events at the beginning of the next
|
|
|
|
+ grace period greatly simplifies maintenance of the CPU-tracking
|
|
|
|
+ bitmasks in the <tt>rcu_node</tt> tree.
|
|
|
|
+</font></td></tr>
|
|
|
|
+<tr><td> </td></tr>
|
|
|
|
+</table>
|
|
|
|
+
|
|
|
|
+<h2><a name="Expedited Grace Period Refinements">
|
|
|
|
+Expedited Grace Period Refinements</a></h2>
|
|
|
|
+
|
|
|
|
+<ol>
|
|
|
|
+<li> <a href="#Idle-CPU Checks">Idle-CPU checks</a>.
|
|
|
|
+<li> <a href="#Batching via Sequence Counter">
|
|
|
|
+ Batching via sequence counter</a>.
|
|
|
|
+<li> <a href="#Funnel Locking and Wait/Wakeup">
|
|
|
|
+ Funnel locking and wait/wakeup</a>.
|
|
|
|
+<li> <a href="#Use of Workqueues">Use of Workqueues</a>.
|
|
|
|
+<li> <a href="#Stall Warnings">Stall warnings</a>.
|
|
|
|
+</ol>
|
|
|
|
+
|
|
|
|
+<h3><a name="Idle-CPU Checks">Idle-CPU Checks</a></h3>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Each expedited grace period checks for idle CPUs when initially forming
|
|
|
|
+the mask of CPUs to be IPIed and again just before IPIing a CPU
|
|
|
|
+(both checks are carried out by <tt>sync_rcu_exp_select_cpus()</tt>).
|
|
|
|
+If the CPU is idle at any time between those two times, the CPU will
|
|
|
|
+not be IPIed.
|
|
|
|
+Instead, the task pushing the grace period forward will include the
|
|
|
|
+idle CPUs in the mask passed to <tt>rcu_report_exp_cpu_mult()</tt>.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+For RCU-sched, there is an additional check for idle in the IPI
|
|
|
|
+handler, <tt>sync_sched_exp_handler()</tt>.
|
|
|
|
+If the IPI has interrupted the idle loop, then
|
|
|
|
+<tt>sync_sched_exp_handler()</tt> invokes <tt>rcu_report_exp_rdp()</tt>
|
|
|
|
+to report the corresponding quiescent state.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+For RCU-preempt, there is no specific check for idle in the
|
|
|
|
+IPI handler (<tt>sync_rcu_exp_handler()</tt>), but because
|
|
|
|
+RCU read-side critical sections are not permitted within the
|
|
|
|
+idle loop, if <tt>sync_rcu_exp_handler()</tt> sees that the CPU is within
|
|
|
|
+RCU read-side critical section, the CPU cannot possibly be idle.
|
|
|
|
+Otherwise, <tt>sync_rcu_exp_handler()</tt> invokes
|
|
|
|
+<tt>rcu_report_exp_rdp()</tt> to report the corresponding quiescent
|
|
|
|
+state, regardless of whether or not that quiescent state was due to
|
|
|
|
+the CPU being idle.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+In summary, RCU expedited grace periods check for idle when building
|
|
|
|
+the bitmask of CPUs that must be IPIed, just before sending each IPI,
|
|
|
|
+and (either explicitly or implicitly) within the IPI handler.
|
|
|
|
+
|
|
|
|
+<h3><a name="Batching via Sequence Counter">
|
|
|
|
+Batching via Sequence Counter</a></h3>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+If each grace-period request was carried out separately, expedited
|
|
|
|
+grace periods would have abysmal scalability and
|
|
|
|
+problematic high-load characteristics.
|
|
|
|
+Because each grace-period operation can serve an unlimited number of
|
|
|
|
+updates, it is important to <i>batch</i> requests, so that a single
|
|
|
|
+expedited grace-period operation will cover all requests in the
|
|
|
|
+corresponding batch.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+This batching is controlled by a sequence counter named
|
|
|
|
+<tt>->expedited_sequence</tt> in the <tt>rcu_state</tt> structure.
|
|
|
|
+This counter has an odd value when there is an expedited grace period
|
|
|
|
+in progress and an even value otherwise, so that dividing the counter
|
|
|
|
+value by two gives the number of completed grace periods.
|
|
|
|
+During any given update request, the counter must transition from
|
|
|
|
+even to odd and then back to even, thus indicating that a grace
|
|
|
|
+period has elapsed.
|
|
|
|
+Therefore, if the initial value of the counter is <tt>s</tt>,
|
|
|
|
+the updater must wait until the counter reaches at least the
|
|
|
|
+value <tt>(s+3)&~0x1</tt>.
|
|
|
|
+This counter is managed by the following access functions:
|
|
|
|
+
|
|
|
|
+<ol>
|
|
|
|
+<li> <tt>rcu_exp_gp_seq_start()</tt>, which marks the start of
|
|
|
|
+ an expedited grace period.
|
|
|
|
+<li> <tt>rcu_exp_gp_seq_end()</tt>, which marks the end of an
|
|
|
|
+ expedited grace period.
|
|
|
|
+<li> <tt>rcu_exp_gp_seq_snap()</tt>, which obtains a snapshot of
|
|
|
|
+ the counter.
|
|
|
|
+<li> <tt>rcu_exp_gp_seq_done()</tt>, which returns <tt>true</tt>
|
|
|
|
+ if a full expedited grace period has elapsed since the
|
|
|
|
+ corresponding call to <tt>rcu_exp_gp_seq_snap()</tt>.
|
|
|
|
+</ol>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Again, only one request in a given batch need actually carry out
|
|
|
|
+a grace-period operation, which means there must be an efficient
|
|
|
|
+way to identify which of many concurrent reqeusts will initiate
|
|
|
|
+the grace period, and that there be an efficient way for the
|
|
|
|
+remaining requests to wait for that grace period to complete.
|
|
|
|
+However, that is the topic of the next section.
|
|
|
|
+
|
|
|
|
+<h3><a name="Funnel Locking and Wait/Wakeup">
|
|
|
|
+Funnel Locking and Wait/Wakeup</a></h3>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+The natural way to sort out which of a batch of updaters will initiate
|
|
|
|
+the expedited grace period is to use the <tt>rcu_node</tt> combining
|
|
|
|
+tree, as implemented by the <tt>exp_funnel_lock()</tt> function.
|
|
|
|
+The first updater corresponding to a given grace period arriving
|
|
|
|
+at a given <tt>rcu_node</tt> structure records its desired grace-period
|
|
|
|
+sequence number in the <tt>->exp_seq_rq</tt> field and moves up
|
|
|
|
+to the next level in the tree.
|
|
|
|
+Otherwise, if the <tt>->exp_seq_rq</tt> field already contains
|
|
|
|
+the sequence number for the desired grace period or some later one,
|
|
|
|
+the updater blocks on one of four wait queues in the
|
|
|
|
+<tt>->exp_wq[]</tt> array, using the second-from-bottom
|
|
|
|
+and third-from bottom bits as an index.
|
|
|
|
+An <tt>->exp_lock</tt> field in the <tt>rcu_node</tt> structure
|
|
|
|
+synchronizes access to these fields.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+An empty <tt>rcu_node</tt> tree is shown in the following diagram,
|
|
|
|
+with the white cells representing the <tt>->exp_seq_rq</tt> field
|
|
|
|
+and the red cells representing the elements of the
|
|
|
|
+<tt>->exp_wq[]</tt> array.
|
|
|
|
+
|
|
|
|
+<p><img src="Funnel0.svg" alt="Funnel0.svg" width="75%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+The next diagram shows the situation after the arrival of Task A
|
|
|
|
+and Task B at the leftmost and rightmost leaf <tt>rcu_node</tt>
|
|
|
|
+structures, respectively.
|
|
|
|
+The current value of the <tt>rcu_state</tt> structure's
|
|
|
|
+<tt>->expedited_sequence</tt> field is zero, so adding three and
|
|
|
|
+clearing the bottom bit results in the value two, which both tasks
|
|
|
|
+record in the <tt>->exp_seq_rq</tt> field of their respective
|
|
|
|
+<tt>rcu_node</tt> structures:
|
|
|
|
+
|
|
|
|
+<p><img src="Funnel1.svg" alt="Funnel1.svg" width="75%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Each of Tasks A and B will move up to the root
|
|
|
|
+<tt>rcu_node</tt> structure.
|
|
|
|
+Suppose that Task A wins, recording its desired grace-period sequence
|
|
|
|
+number and resulting in the state shown below:
|
|
|
|
+
|
|
|
|
+<p><img src="Funnel2.svg" alt="Funnel2.svg" width="75%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Task A now advances to initiate a new grace period, while Task B
|
|
|
|
+moves up to the root <tt>rcu_node</tt> structure, and, seeing that
|
|
|
|
+its desired sequence number is already recorded, blocks on
|
|
|
|
+<tt>->exp_wq[1]</tt>.
|
|
|
|
+
|
|
|
|
+<table>
|
|
|
|
+<tr><th> </th></tr>
|
|
|
|
+<tr><th align="left">Quick Quiz:</th></tr>
|
|
|
|
+<tr><td>
|
|
|
|
+ Why <tt>->exp_wq[1]</tt>?
|
|
|
|
+ Given that the value of these tasks' desired sequence number is
|
|
|
|
+ two, so shouldn't they instead block on <tt>->exp_wq[2]</tt>?
|
|
|
|
+</td></tr>
|
|
|
|
+<tr><th align="left">Answer:</th></tr>
|
|
|
|
+<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
|
|
+ No.
|
|
|
|
+
|
|
|
|
+ <p><font color="ffffff">
|
|
|
|
+ Recall that the bottom bit of the desired sequence number indicates
|
|
|
|
+ whether or not a grace period is currently in progress.
|
|
|
|
+ It is therefore necessary to shift the sequence number right one
|
|
|
|
+ bit position to obtain the number of the grace period.
|
|
|
|
+ This results in <tt>->exp_wq[1]</tt>.
|
|
|
|
+</font></td></tr>
|
|
|
|
+<tr><td> </td></tr>
|
|
|
|
+</table>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+If Tasks C and D also arrive at this point, they will compute the
|
|
|
|
+same desired grace-period sequence number, and see that both leaf
|
|
|
|
+<tt>rcu_node</tt> structures already have that value recorded.
|
|
|
|
+They will therefore block on their respective <tt>rcu_node</tt>
|
|
|
|
+structures' <tt>->exp_wq[1]</tt> fields, as shown below:
|
|
|
|
+
|
|
|
|
+<p><img src="Funnel3.svg" alt="Funnel3.svg" width="75%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Task A now acquires the <tt>rcu_state</tt> structure's
|
|
|
|
+<tt>->exp_mutex</tt> and initiates the grace period, which
|
|
|
|
+increments <tt>->expedited_sequence</tt>.
|
|
|
|
+Therefore, if Tasks E and F arrive, they will compute
|
|
|
|
+a desired sequence number of 4 and will record this value as
|
|
|
|
+shown below:
|
|
|
|
+
|
|
|
|
+<p><img src="Funnel4.svg" alt="Funnel4.svg" width="75%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Tasks E and F will propagate up the <tt>rcu_node</tt>
|
|
|
|
+combining tree, with Task F blocking on the root <tt>rcu_node</tt>
|
|
|
|
+structure and Task E wait for Task A to finish so that
|
|
|
|
+it can start the next grace period.
|
|
|
|
+The resulting state is as shown below:
|
|
|
|
+
|
|
|
|
+<p><img src="Funnel5.svg" alt="Funnel5.svg" width="75%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Once the grace period completes, Task A
|
|
|
|
+starts waking up the tasks waiting for this grace period to complete,
|
|
|
|
+increments the <tt>->expedited_sequence</tt>,
|
|
|
|
+acquires the <tt>->exp_wake_mutex</tt> and then releases the
|
|
|
|
+<tt>->exp_mutex</tt>.
|
|
|
|
+This results in the following state:
|
|
|
|
+
|
|
|
|
+<p><img src="Funnel6.svg" alt="Funnel6.svg" width="75%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Task E can then acquire <tt>->exp_mutex</tt> and increment
|
|
|
|
+<tt>->expedited_sequence</tt> to the value three.
|
|
|
|
+If new tasks G and H arrive and moves up the combining tree at the
|
|
|
|
+same time, the state will be as follows:
|
|
|
|
+
|
|
|
|
+<p><img src="Funnel7.svg" alt="Funnel7.svg" width="75%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Note that three of the root <tt>rcu_node</tt> structure's
|
|
|
|
+waitqueues are now occupied.
|
|
|
|
+However, at some point, Task A will wake up the
|
|
|
|
+tasks blocked on the <tt>->exp_wq</tt> waitqueues, resulting
|
|
|
|
+in the following state:
|
|
|
|
+
|
|
|
|
+<p><img src="Funnel8.svg" alt="Funnel8.svg" width="75%">
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Execution will continue with Tasks E and H completing
|
|
|
|
+their grace periods and carrying out their wakeups.
|
|
|
|
+
|
|
|
|
+<table>
|
|
|
|
+<tr><th> </th></tr>
|
|
|
|
+<tr><th align="left">Quick Quiz:</th></tr>
|
|
|
|
+<tr><td>
|
|
|
|
+ What happens if Task A takes so long to do its wakeups
|
|
|
|
+ that Task E's grace period completes?
|
|
|
|
+</td></tr>
|
|
|
|
+<tr><th align="left">Answer:</th></tr>
|
|
|
|
+<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
|
|
+ Then Task E will block on the <tt>->exp_wake_mutex</tt>,
|
|
|
|
+ which will also prevent it from releasing <tt>->exp_mutex</tt>,
|
|
|
|
+ which in turn will prevent the next grace period from starting.
|
|
|
|
+ This last is important in preventing overflow of the
|
|
|
|
+ <tt>->exp_wq[]</tt> array.
|
|
|
|
+</font></td></tr>
|
|
|
|
+<tr><td> </td></tr>
|
|
|
|
+</table>
|
|
|
|
+
|
|
|
|
+<h3><a name="Use of Workqueues">Use of Workqueues</a></h3>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+In earlier implementations, the task requesting the expedited
|
|
|
|
+grace period also drove it to completion.
|
|
|
|
+This straightforward approach had the disadvantage of needing to
|
|
|
|
+account for signals sent to user tasks,
|
|
|
|
+so more recent implemementations use the Linux kernel's
|
|
|
|
+<a href="https://www.kernel.org/doc/Documentation/workqueue.txt">workqueues</a>.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+The requesting task still does counter snapshotting and funnel-lock
|
|
|
|
+processing, but the task reaching the top of the funnel lock
|
|
|
|
+does a <tt>schedule_work()</tt> (from <tt>_synchronize_rcu_expedited()</tt>
|
|
|
|
+so that a workqueue kthread does the actual grace-period processing.
|
|
|
|
+Because workqueue kthreads do not accept signals, grace-period-wait
|
|
|
|
+processing need not allow for signals.
|
|
|
|
+
|
|
|
|
+In addition, this approach allows wakeups for the previous expedited
|
|
|
|
+grace period to be overlapped with processing for the next expedited
|
|
|
|
+grace period.
|
|
|
|
+Because there are only four sets of waitqueues, it is necessary to
|
|
|
|
+ensure that the previous grace period's wakeups complete before the
|
|
|
|
+next grace period's wakeups start.
|
|
|
|
+This is handled by having the <tt>->exp_mutex</tt>
|
|
|
|
+guard expedited grace-period processing and the
|
|
|
|
+<tt>->exp_wake_mutex</tt> guard wakeups.
|
|
|
|
+The key point is that the <tt>->exp_mutex</tt> is not released
|
|
|
|
+until the first wakeup is complete, which means that the
|
|
|
|
+<tt>->exp_wake_mutex</tt> has already been acquired at that point.
|
|
|
|
+This approach ensures that the previous grace period's wakeups can
|
|
|
|
+be carried out while the current grace period is in process, but
|
|
|
|
+that these wakeups will complete before the next grace period starts.
|
|
|
|
+This means that only three waitqueues are required, guaranteeing that
|
|
|
|
+the four that are provided are sufficient.
|
|
|
|
+
|
|
|
|
+<h3><a name="Stall Warnings">Stall Warnings</a></h3>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Expediting grace periods does nothing to speed things up when RCU
|
|
|
|
+readers take too long, and therefore expedited grace periods check
|
|
|
|
+for stalls just as normal grace periods do.
|
|
|
|
+
|
|
|
|
+<table>
|
|
|
|
+<tr><th> </th></tr>
|
|
|
|
+<tr><th align="left">Quick Quiz:</th></tr>
|
|
|
|
+<tr><td>
|
|
|
|
+ But why not just let the normal grace-period machinery
|
|
|
|
+ detect the stalls, given that a given reader must block
|
|
|
|
+ both normal and expedited grace periods?
|
|
|
|
+</td></tr>
|
|
|
|
+<tr><th align="left">Answer:</th></tr>
|
|
|
|
+<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
|
|
+ Because it is quite possible that at a given time there
|
|
|
|
+ is no normal grace period in progress, in which case the
|
|
|
|
+ normal grace period cannot emit a stall warning.
|
|
|
|
+</font></td></tr>
|
|
|
|
+<tr><td> </td></tr>
|
|
|
|
+</table>
|
|
|
|
+
|
|
|
|
+The <tt>synchronize_sched_expedited_wait()</tt> function loops waiting
|
|
|
|
+for the expedited grace period to end, but with a timeout set to the
|
|
|
|
+current RCU CPU stall-warning time.
|
|
|
|
+If this time is exceeded, any CPUs or <tt>rcu_node</tt> structures
|
|
|
|
+blocking the current grace period are printed.
|
|
|
|
+Each stall warning results in another pass through the loop, but the
|
|
|
|
+second and subsequent passes use longer stall times.
|
|
|
|
+
|
|
|
|
+<h3><a name="Summary">
|
|
|
|
+Summary</a></h3>
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Expedited grace periods use a sequence-number approach to promote
|
|
|
|
+batching, so that a single grace-period operation can serve numerous
|
|
|
|
+requests.
|
|
|
|
+A funnel lock is used to efficiently identify the one task out of
|
|
|
|
+a concurrent group that will request the grace period.
|
|
|
|
+All members of the group will block on waitqueues provided in
|
|
|
|
+the <tt>rcu_node</tt> structure.
|
|
|
|
+The actual grace-period processing is carried out by a workqueue.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+CPU-hotplug operations are noted lazily in order to prevent the need
|
|
|
|
+for tight synchronization between expedited grace periods and
|
|
|
|
+CPU-hotplug operations.
|
|
|
|
+The dyntick-idle counters are used to avoid sending IPIs to idle CPUs,
|
|
|
|
+at least in the common case.
|
|
|
|
+RCU-preempt and RCU-sched use different IPI handlers and different
|
|
|
|
+code to respond to the state changes carried out by those handlers,
|
|
|
|
+but otherwise use common code.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+Quiescent states are tracked using the <tt>rcu_node</tt> tree,
|
|
|
|
+and once all necessary quiescent states have been reported,
|
|
|
|
+all tasks waiting on this expedited grace period are awakened.
|
|
|
|
+A pair of mutexes are used to allow one grace period's wakeups
|
|
|
|
+to proceed concurrently with the next grace period's processing.
|
|
|
|
+
|
|
|
|
+<p>
|
|
|
|
+This combination of mechanisms allows expedited grace periods to
|
|
|
|
+run reasonably efficiently.
|
|
|
|
+However, for non-time-critical tasks, normal grace periods should be
|
|
|
|
+used instead because their longer duration permits much higher
|
|
|
|
+degrees of batching, and thus much lower per-request overheads.
|
|
|
|
+
|
|
|
|
+</body></html>
|