9 years ago · 19d6f04cd3
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -232,7 +232,7 @@ And there are a number of things that _must_ or _must_not_ be assumed:
 
				      with memory references that are not protected by READ_ONCE() and
			
 
				      WRITE_ONCE().  Without them, the compiler is within its rights to
			
 
				      do all sorts of "creative" transformations, which are covered in
			
 
				-     the Compiler Barrier section.
			
 
				+     the COMPILER BARRIER section.
			
 
				 
			
 
				  (*) It _must_not_ be assumed that independent loads and stores will be issued
			
 
				      in the order given.  This means that for:
			
@@ -555,6 +555,30 @@ between the address load and the data load:
 
				 This enforces the occurrence of one of the two implications, and prevents the
			
 
				 third possibility from arising.
			
 
				 
			
 
				+A data-dependency barrier must also order against dependent writes:
			
 
				+
			
 
				+	CPU 1		      CPU 2
			
 
				+	===============	      ===============
			
 
				+	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
			
 
				+	B = 4;
			
 
				+	<write barrier>
			
 
				+	WRITE_ONCE(P, &B);
			
 
				+			      Q = READ_ONCE(P);
			
 
				+			      <data dependency barrier>
			
 
				+			      *Q = 5;
			
 
				+
			
 
				+The data-dependency barrier must order the read into Q with the store
			
 
				+into *Q.  This prohibits this outcome:
			
 
				+
			
 
				+	(Q == B) && (B == 4)
			
 
				+
			
 
				+Please note that this pattern should be rare.  After all, the whole point
			
 
				+of dependency ordering is to -prevent- writes to the data structure, along
			
 
				+with the expensive cache misses associated with those writes.  This pattern
			
 
				+can be used to record rare error conditions and the like, and the ordering
			
 
				+prevents such records from being lost.
			
 
				+
			
 
				+
			
 
				 [!] Note that this extremely counterintuitive situation arises most easily on
			
 
				 machines with split caches, so that, for example, one cache bank processes
			
 
				 even-numbered cache lines and the other bank processes odd-numbered cache
			
@@ -565,21 +589,6 @@ odd-numbered bank is idle, one can see the new value of the pointer P (&B),
 
				 but the old value of the variable B (2).
			
 
				 
			
 
				 
			
 
				-Another example of where data dependency barriers might be required is where a
			
 
				-number is read from memory and then used to calculate the index for an array
			
 
				-access:
			
 
				-
			
 
				-	CPU 1		      CPU 2
			
 
				-	===============	      ===============
			
 
				-	{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
			
 
				-	M[1] = 4;
			
 
				-	<write barrier>
			
 
				-	WRITE_ONCE(P, 1);
			
 
				-			      Q = READ_ONCE(P);
			
 
				-			      <data dependency barrier>
			
 
				-			      D = M[Q];
			
 
				-
			
 
				-
			
 
				 The data dependency barrier is very important to the RCU system,
			
 
				 for example.  See rcu_assign_pointer() and rcu_dereference() in
			
 
				 include/linux/rcupdate.h.  This permits the current target of an RCU'd
			
@@ -800,9 +809,13 @@ In summary:
 
				       use smp_rmb(), smp_wmb(), or, in the case of prior stores and
			
 
				       later loads, smp_mb().
			
 
				 
			
 
				-  (*) If both legs of the "if" statement begin with identical stores
			
 
				-      to the same variable, a barrier() statement is required at the
			
 
				-      beginning of each leg of the "if" statement.
			
 
				+  (*) If both legs of the "if" statement begin with identical stores to
			
 
				+      the same variable, then those stores must be ordered, either by
			
 
				+      preceding both of them with smp_mb() or by using smp_store_release()
			
 
				+      to carry out the stores.  Please note that it is -not- sufficient
			
 
				+      to use barrier() at beginning of each leg of the "if" statement,
			
 
				+      as optimizing compilers do not necessarily respect barrier()
			
 
				+      in this case.
			
 
				 
			
 
				   (*) Control dependencies require at least one run-time conditional
			
 
				       between the prior load and the subsequent store, and this
			
@@ -814,7 +827,7 @@ In summary:
 
				   (*) Control dependencies require that the compiler avoid reordering the
			
 
				       dependency into nonexistence.  Careful use of READ_ONCE() or
			
 
				       atomic{,64}_read() can help to preserve your control dependency.
			
 
				-      Please see the Compiler Barrier section for more information.
			
 
				+      Please see the COMPILER BARRIER section for more information.
			
 
				 
			
 
				   (*) Control dependencies pair normally with other types of barriers.
			
 
				 
			
@@ -1257,7 +1270,7 @@ TRANSITIVITY
 
				 
			
 
				 Transitivity is a deeply intuitive notion about ordering that is not
			
 
				 always provided by real computer systems.  The following example
			
 
				-demonstrates transitivity (also called "cumulativity"):
			
 
				+demonstrates transitivity:
			
 
				 
			
 
				 	CPU 1			CPU 2			CPU 3
			
 
				 	=======================	=======================	=======================
			
@@ -1305,8 +1318,86 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes.
 
				 General barriers are therefore required to ensure that all CPUs agree
			
 
				 on the combined order of CPU 1's and CPU 2's accesses.
			
 
				 
			
 
				-To reiterate, if your code requires transitivity, use general barriers
			
 
				-throughout.
			
 
				+General barriers provide "global transitivity", so that all CPUs will
			
 
				+agree on the order of operations.  In contrast, a chain of release-acquire
			
 
				+pairs provides only "local transitivity", so that only those CPUs on
			
 
				+the chain are guaranteed to agree on the combined order of the accesses.
			
 
				+For example, switching to C code in deference to Herman Hollerith:
			
 
				+
			
 
				+	int u, v, x, y, z;
			
 
				+
			
 
				+	void cpu0(void)
			
 
				+	{
			
 
				+		r0 = smp_load_acquire(&x);
			
 
				+		WRITE_ONCE(u, 1);
			
 
				+		smp_store_release(&y, 1);
			
 
				+	}
			
 
				+
			
 
				+	void cpu1(void)
			
 
				+	{
			
 
				+		r1 = smp_load_acquire(&y);
			
 
				+		r4 = READ_ONCE(v);
			
 
				+		r5 = READ_ONCE(u);
			
 
				+		smp_store_release(&z, 1);
			
 
				+	}
			
 
				+
			
 
				+	void cpu2(void)
			
 
				+	{
			
 
				+		r2 = smp_load_acquire(&z);
			
 
				+		smp_store_release(&x, 1);
			
 
				+	}
			
 
				+
			
 
				+	void cpu3(void)
			
 
				+	{
			
 
				+		WRITE_ONCE(v, 1);
			
 
				+		smp_mb();
			
 
				+		r3 = READ_ONCE(u);
			
 
				+	}
			
 
				+
			
 
				+Because cpu0(), cpu1(), and cpu2() participate in a local transitive
			
 
				+chain of smp_store_release()/smp_load_acquire() pairs, the following
			
 
				+outcome is prohibited:
			
 
				+
			
 
				+	r0 == 1 && r1 == 1 && r2 == 1
			
 
				+
			
 
				+Furthermore, because of the release-acquire relationship between cpu0()
			
 
				+and cpu1(), cpu1() must see cpu0()'s writes, so that the following
			
 
				+outcome is prohibited:
			
 
				+
			
 
				+	r1 == 1 && r5 == 0
			
 
				+
			
 
				+However, the transitivity of release-acquire is local to the participating
			
 
				+CPUs and does not apply to cpu3().  Therefore, the following outcome
			
 
				+is possible:
			
 
				+
			
 
				+	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
			
 
				+
			
 
				+As an aside, the following outcome is also possible:
			
 
				+
			
 
				+	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1
			
 
				+
			
 
				+Although cpu0(), cpu1(), and cpu2() will see their respective reads and
			
 
				+writes in order, CPUs not involved in the release-acquire chain might
			
 
				+well disagree on the order.  This disagreement stems from the fact that
			
 
				+the weak memory-barrier instructions used to implement smp_load_acquire()
			
 
				+and smp_store_release() are not required to order prior stores against
			
 
				+subsequent loads in all cases.  This means that cpu3() can see cpu0()'s
			
 
				+store to u as happening -after- cpu1()'s load from v, even though
			
 
				+both cpu0() and cpu1() agree that these two operations occurred in the
			
 
				+intended order.
			
 
				+
			
 
				+However, please keep in mind that smp_load_acquire() is not magic.
			
 
				+In particular, it simply reads from its argument with ordering.  It does
			
 
				+-not- ensure that any particular value will be read.  Therefore, the
			
 
				+following outcome is possible:
			
 
				+
			
 
				+	r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
			
 
				+
			
 
				+Note that this outcome can happen even on a mythical sequentially
			
 
				+consistent system where nothing is ever reordered.
			
 
				+
			
 
				+To reiterate, if your code requires global transitivity, use general
			
 
				+barriers throughout.
			
 
				 
			
 
				 
			
 
				 ========================
			
@@ -1459,7 +1550,7 @@ of optimizations:
 
				      the following:
			
 
				 
			
 
				 	a = 0;
			
 
				-	/* Code that does not store to variable a. */
			
 
				+	... Code that does not store to variable a ...
			
 
				 	a = 0;
			
 
				 
			
 
				      The compiler sees that the value of variable 'a' is already zero, so
			
@@ -1471,7 +1562,7 @@ of optimizations:
 
				      wrong guess:
			
 
				 
			
 
				 	WRITE_ONCE(a, 0);
			
 
				-	/* Code that does not store to variable a. */
			
 
				+	... Code that does not store to variable a ...
			
 
				 	WRITE_ONCE(a, 0);
			
 
				 
			
 
				  (*) The compiler is within its rights to reorder memory accesses unless
			
--- a/include/asm-generic/bitops/lock.h
+++ b/include/asm-generic/bitops/lock.h
@@ -29,16 +29,16 @@ do {					\
 
				  * @nr: the bit to set
			
 
				  * @addr: the address to start counting from
			
 
				  *
			
 
				- * This operation is like clear_bit_unlock, however it is not atomic.
			
 
				- * It does provide release barrier semantics so it can be used to unlock
			
 
				- * a bit lock, however it would only be used if no other CPU can modify
			
 
				- * any bits in the memory until the lock is released (a good example is
			
 
				- * if the bit lock itself protects access to the other bits in the word).
			
 
				+ * A weaker form of clear_bit_unlock() as used by __bit_lock_unlock(). If all
			
 
				+ * the bits in the word are protected by this lock some archs can use weaker
			
 
				+ * ops to safely unlock.
			
 
				+ *
			
 
				+ * See for example x86's implementation.
			
 
				  */
			
 
				 #define __clear_bit_unlock(nr, addr)	\
			
 
				 do {					\
			
 
				-	smp_mb();			\
			
 
				-	__clear_bit(nr, addr);		\
			
 
				+	smp_mb__before_atomic();	\
			
 
				+	clear_bit(nr, addr);		\
			
 
				 } while (0)
			
 
				 
			
 
				 #endif /* _ASM_GENERIC_BITOPS_LOCK_H_ */