|
@@ -194,18 +194,22 @@ There are some minimal guarantees that may be expected of a CPU:
|
|
|
(*) On any given CPU, dependent memory accesses will be issued in order, with
|
|
|
respect to itself. This means that for:
|
|
|
|
|
|
- Q = P; D = *Q;
|
|
|
+ ACCESS_ONCE(Q) = P; smp_read_barrier_depends(); D = ACCESS_ONCE(*Q);
|
|
|
|
|
|
the CPU will issue the following memory operations:
|
|
|
|
|
|
Q = LOAD P, D = LOAD *Q
|
|
|
|
|
|
- and always in that order.
|
|
|
+ and always in that order. On most systems, smp_read_barrier_depends()
|
|
|
+ does nothing, but it is required for DEC Alpha. The ACCESS_ONCE()
|
|
|
+ is required to prevent compiler mischief. Please note that you
|
|
|
+ should normally use something like rcu_dereference() instead of
|
|
|
+ open-coding smp_read_barrier_depends().
|
|
|
|
|
|
(*) Overlapping loads and stores within a particular CPU will appear to be
|
|
|
ordered within that CPU. This means that for:
|
|
|
|
|
|
- a = *X; *X = b;
|
|
|
+ a = ACCESS_ONCE(*X); ACCESS_ONCE(*X) = b;
|
|
|
|
|
|
the CPU will only issue the following sequence of memory operations:
|
|
|
|
|
@@ -213,7 +217,7 @@ There are some minimal guarantees that may be expected of a CPU:
|
|
|
|
|
|
And for:
|
|
|
|
|
|
- *X = c; d = *X;
|
|
|
+ ACCESS_ONCE(*X) = c; d = ACCESS_ONCE(*X);
|
|
|
|
|
|
the CPU will only issue:
|
|
|
|
|
@@ -224,6 +228,41 @@ There are some minimal guarantees that may be expected of a CPU:
|
|
|
|
|
|
And there are a number of things that _must_ or _must_not_ be assumed:
|
|
|
|
|
|
+ (*) It _must_not_ be assumed that the compiler will do what you want with
|
|
|
+ memory references that are not protected by ACCESS_ONCE(). Without
|
|
|
+ ACCESS_ONCE(), the compiler is within its rights to do all sorts
|
|
|
+ of "creative" transformations:
|
|
|
+
|
|
|
+ (-) Repeat the load, possibly getting a different value on the second
|
|
|
+ and subsequent loads. This is especially prone to happen when
|
|
|
+ register pressure is high.
|
|
|
+
|
|
|
+ (-) Merge adjacent loads and stores to the same location. The most
|
|
|
+ familiar example is the transformation from:
|
|
|
+
|
|
|
+ while (a)
|
|
|
+ do_something();
|
|
|
+
|
|
|
+ to something like:
|
|
|
+
|
|
|
+ if (a)
|
|
|
+ for (;;)
|
|
|
+ do_something();
|
|
|
+
|
|
|
+ Using ACCESS_ONCE() as follows prevents this sort of optimization:
|
|
|
+
|
|
|
+ while (ACCESS_ONCE(a))
|
|
|
+ do_something();
|
|
|
+
|
|
|
+ (-) "Store tearing", where a single store in the source code is split
|
|
|
+ into smaller stores in the object code. Note that gcc really
|
|
|
+ will do this on some architectures when storing certain constants.
|
|
|
+ It can be cheaper to do a series of immediate stores than to
|
|
|
+ form the constant in a register and then to store that register.
|
|
|
+
|
|
|
+ (-) "Load tearing", which splits loads in a manner analogous to
|
|
|
+ store tearing.
|
|
|
+
|
|
|
(*) It _must_not_ be assumed that independent loads and stores will be issued
|
|
|
in the order given. This means that for:
|
|
|
|
|
@@ -450,14 +489,14 @@ The usage requirements of data dependency barriers are a little subtle, and
|
|
|
it's not always obvious that they're needed. To illustrate, consider the
|
|
|
following sequence of events:
|
|
|
|
|
|
- CPU 1 CPU 2
|
|
|
- =============== ===============
|
|
|
+ CPU 1 CPU 2
|
|
|
+ =============== ===============
|
|
|
{ A == 1, B == 2, C = 3, P == &A, Q == &C }
|
|
|
B = 4;
|
|
|
<write barrier>
|
|
|
- P = &B
|
|
|
- Q = P;
|
|
|
- D = *Q;
|
|
|
+ ACCESS_ONCE(P) = &B
|
|
|
+ Q = ACCESS_ONCE(P);
|
|
|
+ D = *Q;
|
|
|
|
|
|
There's a clear data dependency here, and it would seem that by the end of the
|
|
|
sequence, Q must be either &A or &B, and that:
|
|
@@ -477,15 +516,15 @@ Alpha).
|
|
|
To deal with this, a data dependency barrier or better must be inserted
|
|
|
between the address load and the data load:
|
|
|
|
|
|
- CPU 1 CPU 2
|
|
|
- =============== ===============
|
|
|
+ CPU 1 CPU 2
|
|
|
+ =============== ===============
|
|
|
{ A == 1, B == 2, C = 3, P == &A, Q == &C }
|
|
|
B = 4;
|
|
|
<write barrier>
|
|
|
- P = &B
|
|
|
- Q = P;
|
|
|
- <data dependency barrier>
|
|
|
- D = *Q;
|
|
|
+ ACCESS_ONCE(P) = &B
|
|
|
+ Q = ACCESS_ONCE(P);
|
|
|
+ <data dependency barrier>
|
|
|
+ D = *Q;
|
|
|
|
|
|
This enforces the occurrence of one of the two implications, and prevents the
|
|
|
third possibility from arising.
|
|
@@ -504,21 +543,22 @@ Another example of where data dependency barriers might be required is where a
|
|
|
number is read from memory and then used to calculate the index for an array
|
|
|
access:
|
|
|
|
|
|
- CPU 1 CPU 2
|
|
|
- =============== ===============
|
|
|
+ CPU 1 CPU 2
|
|
|
+ =============== ===============
|
|
|
{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
|
|
|
M[1] = 4;
|
|
|
<write barrier>
|
|
|
- P = 1
|
|
|
- Q = P;
|
|
|
- <data dependency barrier>
|
|
|
- D = M[Q];
|
|
|
+ ACCESS_ONCE(P) = 1
|
|
|
+ Q = ACCESS_ONCE(P);
|
|
|
+ <data dependency barrier>
|
|
|
+ D = M[Q];
|
|
|
|
|
|
|
|
|
-The data dependency barrier is very important to the RCU system, for example.
|
|
|
-See rcu_dereference() in include/linux/rcupdate.h. This permits the current
|
|
|
-target of an RCU'd pointer to be replaced with a new modified target, without
|
|
|
-the replacement target appearing to be incompletely initialised.
|
|
|
+The data dependency barrier is very important to the RCU system,
|
|
|
+for example. See rcu_assign_pointer() and rcu_dereference() in
|
|
|
+include/linux/rcupdate.h. This permits the current target of an RCU'd
|
|
|
+pointer to be replaced with a new modified target, without the replacement
|
|
|
+target appearing to be incompletely initialised.
|
|
|
|
|
|
See also the subsection on "Cache Coherency" for a more thorough example.
|
|
|
|
|
@@ -530,22 +570,23 @@ A control dependency requires a full read memory barrier, not simply a data
|
|
|
dependency barrier to make it work correctly. Consider the following bit of
|
|
|
code:
|
|
|
|
|
|
- q = &a;
|
|
|
+ q = ACCESS_ONCE(a);
|
|
|
if (p) {
|
|
|
<data dependency barrier>
|
|
|
- q = &b;
|
|
|
+ q = ACCESS_ONCE(b);
|
|
|
}
|
|
|
x = *q;
|
|
|
|
|
|
This will not have the desired effect because there is no actual data
|
|
|
-dependency, but rather a control dependency that the CPU may short-circuit by
|
|
|
-attempting to predict the outcome in advance. In such a case what's actually
|
|
|
-required is:
|
|
|
+dependency, but rather a control dependency that the CPU may short-circuit
|
|
|
+by attempting to predict the outcome in advance, so that other CPUs see
|
|
|
+the load from b as having happened before the load from a. In such a
|
|
|
+case what's actually required is:
|
|
|
|
|
|
- q = &a;
|
|
|
+ q = ACCESS_ONCE(a);
|
|
|
if (p) {
|
|
|
<read barrier>
|
|
|
- q = &b;
|
|
|
+ q = ACCESS_ONCE(b);
|
|
|
}
|
|
|
x = *q;
|
|
|
|
|
@@ -561,23 +602,23 @@ barrier, though a general barrier would also be viable. Similarly a read
|
|
|
barrier or a data dependency barrier should always be paired with at least an
|
|
|
write barrier, though, again, a general barrier is viable:
|
|
|
|
|
|
- CPU 1 CPU 2
|
|
|
- =============== ===============
|
|
|
- a = 1;
|
|
|
+ CPU 1 CPU 2
|
|
|
+ =============== ===============
|
|
|
+ ACCESS_ONCE(a) = 1;
|
|
|
<write barrier>
|
|
|
- b = 2; x = b;
|
|
|
- <read barrier>
|
|
|
- y = a;
|
|
|
+ ACCESS_ONCE(b) = 2; x = ACCESS_ONCE(b);
|
|
|
+ <read barrier>
|
|
|
+ y = ACCESS_ONCE(a);
|
|
|
|
|
|
Or:
|
|
|
|
|
|
- CPU 1 CPU 2
|
|
|
- =============== ===============================
|
|
|
+ CPU 1 CPU 2
|
|
|
+ =============== ===============================
|
|
|
a = 1;
|
|
|
<write barrier>
|
|
|
- b = &a; x = b;
|
|
|
- <data dependency barrier>
|
|
|
- y = *x;
|
|
|
+ ACCESS_ONCE(b) = &a; x = ACCESS_ONCE(b);
|
|
|
+ <data dependency barrier>
|
|
|
+ y = *x;
|
|
|
|
|
|
Basically, the read barrier always has to be there, even though it can be of
|
|
|
the "weaker" type.
|
|
@@ -586,13 +627,13 @@ the "weaker" type.
|
|
|
match the loads after the read barrier or the data dependency barrier, and vice
|
|
|
versa:
|
|
|
|
|
|
- CPU 1 CPU 2
|
|
|
- =============== ===============
|
|
|
- a = 1; }---- --->{ v = c
|
|
|
- b = 2; } \ / { w = d
|
|
|
- <write barrier> \ <read barrier>
|
|
|
- c = 3; } / \ { x = a;
|
|
|
- d = 4; }---- --->{ y = b;
|
|
|
+ CPU 1 CPU 2
|
|
|
+ =================== ===================
|
|
|
+ ACCESS_ONCE(a) = 1; }---- --->{ v = ACCESS_ONCE(c);
|
|
|
+ ACCESS_ONCE(b) = 2; } \ / { w = ACCESS_ONCE(d);
|
|
|
+ <write barrier> \ <read barrier>
|
|
|
+ ACCESS_ONCE(c) = 3; } / \ { x = ACCESS_ONCE(a);
|
|
|
+ ACCESS_ONCE(d) = 4; }---- --->{ y = ACCESS_ONCE(b);
|
|
|
|
|
|
|
|
|
EXAMPLES OF MEMORY BARRIER SEQUENCES
|
|
@@ -1435,12 +1476,12 @@ three CPUs; then should the following sequence of events occur:
|
|
|
|
|
|
CPU 1 CPU 2
|
|
|
=============================== ===============================
|
|
|
- *A = a; *E = e;
|
|
|
+ ACCESS_ONCE(*A) = a; ACCESS_ONCE(*E) = e;
|
|
|
LOCK M LOCK Q
|
|
|
- *B = b; *F = f;
|
|
|
- *C = c; *G = g;
|
|
|
+ ACCESS_ONCE(*B) = b; ACCESS_ONCE(*F) = f;
|
|
|
+ ACCESS_ONCE(*C) = c; ACCESS_ONCE(*G) = g;
|
|
|
UNLOCK M UNLOCK Q
|
|
|
- *D = d; *H = h;
|
|
|
+ ACCESS_ONCE(*D) = d; ACCESS_ONCE(*H) = h;
|
|
|
|
|
|
Then there is no guarantee as to what order CPU 3 will see the accesses to *A
|
|
|
through *H occur in, other than the constraints imposed by the separate locks
|
|
@@ -1460,17 +1501,17 @@ However, if the following occurs:
|
|
|
|
|
|
CPU 1 CPU 2
|
|
|
=============================== ===============================
|
|
|
- *A = a;
|
|
|
- LOCK M [1]
|
|
|
- *B = b;
|
|
|
- *C = c;
|
|
|
- UNLOCK M [1]
|
|
|
- *D = d; *E = e;
|
|
|
- LOCK M [2]
|
|
|
- *F = f;
|
|
|
- *G = g;
|
|
|
- UNLOCK M [2]
|
|
|
- *H = h;
|
|
|
+ ACCESS_ONCE(*A) = a;
|
|
|
+ LOCK M [1]
|
|
|
+ ACCESS_ONCE(*B) = b;
|
|
|
+ ACCESS_ONCE(*C) = c;
|
|
|
+ UNLOCK M [1]
|
|
|
+ ACCESS_ONCE(*D) = d; ACCESS_ONCE(*E) = e;
|
|
|
+ LOCK M [2]
|
|
|
+ ACCESS_ONCE(*F) = f;
|
|
|
+ ACCESS_ONCE(*G) = g;
|
|
|
+ UNLOCK M [2]
|
|
|
+ ACCESS_ONCE(*H) = h;
|
|
|
|
|
|
CPU 3 might see:
|
|
|
|
|
@@ -2177,11 +2218,11 @@ A programmer might take it for granted that the CPU will perform memory
|
|
|
operations in exactly the order specified, so that if the CPU is, for example,
|
|
|
given the following piece of code to execute:
|
|
|
|
|
|
- a = *A;
|
|
|
- *B = b;
|
|
|
- c = *C;
|
|
|
- d = *D;
|
|
|
- *E = e;
|
|
|
+ a = ACCESS_ONCE(*A);
|
|
|
+ ACCESS_ONCE(*B) = b;
|
|
|
+ c = ACCESS_ONCE(*C);
|
|
|
+ d = ACCESS_ONCE(*D);
|
|
|
+ ACCESS_ONCE(*E) = e;
|
|
|
|
|
|
they would then expect that the CPU will complete the memory operation for each
|
|
|
instruction before moving on to the next one, leading to a definite sequence of
|
|
@@ -2228,12 +2269,12 @@ However, it is guaranteed that a CPU will be self-consistent: it will see its
|
|
|
_own_ accesses appear to be correctly ordered, without the need for a memory
|
|
|
barrier. For instance with the following code:
|
|
|
|
|
|
- U = *A;
|
|
|
- *A = V;
|
|
|
- *A = W;
|
|
|
- X = *A;
|
|
|
- *A = Y;
|
|
|
- Z = *A;
|
|
|
+ U = ACCESS_ONCE(*A);
|
|
|
+ ACCESS_ONCE(*A) = V;
|
|
|
+ ACCESS_ONCE(*A) = W;
|
|
|
+ X = ACCESS_ONCE(*A);
|
|
|
+ ACCESS_ONCE(*A) = Y;
|
|
|
+ Z = ACCESS_ONCE(*A);
|
|
|
|
|
|
and assuming no intervention by an external influence, it can be assumed that
|
|
|
the final result will appear to be:
|
|
@@ -2250,7 +2291,12 @@ accesses:
|
|
|
|
|
|
in that order, but, without intervention, the sequence may have almost any
|
|
|
combination of elements combined or discarded, provided the program's view of
|
|
|
-the world remains consistent.
|
|
|
+the world remains consistent. Note that ACCESS_ONCE() is -not- optional
|
|
|
+in the above example, as there are architectures where a given CPU might
|
|
|
+interchange successive loads to the same location. On such architectures,
|
|
|
+ACCESS_ONCE() does whatever is necessary to prevent this, for example, on
|
|
|
+Itanium the volatile casts used by ACCESS_ONCE() cause GCC to emit the
|
|
|
+special ld.acq and st.rel instructions that prevent such reordering.
|
|
|
|
|
|
The compiler may also combine, discard or defer elements of the sequence before
|
|
|
the CPU even sees them.
|
|
@@ -2264,13 +2310,13 @@ may be reduced to:
|
|
|
|
|
|
*A = W;
|
|
|
|
|
|
-since, without a write barrier, it can be assumed that the effect of the
|
|
|
-storage of V to *A is lost. Similarly:
|
|
|
+since, without either a write barrier or an ACCESS_ONCE(), it can be
|
|
|
+assumed that the effect of the storage of V to *A is lost. Similarly:
|
|
|
|
|
|
*A = Y;
|
|
|
Z = *A;
|
|
|
|
|
|
-may, without a memory barrier, be reduced to:
|
|
|
+may, without a memory barrier or an ACCESS_ONCE(), be reduced to:
|
|
|
|
|
|
*A = Y;
|
|
|
Z = Y;
|