qemu with hax to log dma reads & writes jcs.org/2018/11/12/vfio

atomics: update documentation

Some of the constraints on operand sizes have been relaxed, so adjust the
documentation.

Deprecate atomic_mb_read and atomic_mb_set; it is not really possible to
use them correctly because they do not interoperate with sequentially-consistent
RMW operations.

Finally, extend the memory barrier pairing section to cover acquire and
release semantics in general, roughly based on the KVM Forum 2016 talk,
"<atomic.h> weapons".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

+243 -182
+243 -182
docs/devel/atomics.rst
··· 11 11 The most basic tool is locking. Mutexes, condition variables and 12 12 semaphores are used in QEMU, and should be the default approach to 13 13 synchronization. Anything else is considerably harder, but it's 14 - also justified more often than one would like. The two tools that 15 - are provided by ``qemu/atomic.h`` are memory barriers and atomic operations. 14 + also justified more often than one would like; 15 + the most performance-critical parts of QEMU in particular require 16 + a very low level approach to concurrency, involving memory barriers 17 + and atomic operations. The semantics of concurrent memory accesses are governed 18 + by the C11 memory model. 16 19 17 - Macros defined by ``qemu/atomic.h`` fall in three camps: 20 + QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to 21 + provide better portability and a less verbose syntax. ``qemu/atomic.h`` 22 + provides macros that fall in three camps: 18 23 19 24 - compiler barriers: ``barrier()``; 20 25 ··· 24 29 25 30 - sequentially consistent atomic access: everything else. 26 31 32 + In general, use of ``qemu/atomic.h`` should be wrapped with more easily 33 + used data structures (e.g. the lock-free singly-linked list operations 34 + ``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization 35 + primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``). Bare use of 36 + atomic operations and memory barriers should be limited to inter-thread 37 + checking of flags and documented thoroughly. 38 + 39 + 27 40 28 41 Compiler memory barrier 29 42 ======================= 30 43 31 - ``barrier()`` prevents the compiler from moving the memory accesses either 32 - side of it to the other side. The compiler barrier has no direct effect 33 - on the CPU, which may then reorder things however it wishes. 44 + ``barrier()`` prevents the compiler from moving the memory accesses on 45 + either side of it to the other side. The compiler barrier has no direct 46 + effect on the CPU, which may then reorder things however it wishes. 34 47 35 48 ``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some 36 49 architectures, CPU guarantees are strong enough that blocking compiler ··· 73 86 typeof(*ptr) atomic_cmpxchg(ptr, old, new) 74 87 75 88 all of which return the old value of ``*ptr``. These operations are 76 - polymorphic; they operate on any type that is as wide as a pointer. 89 + polymorphic; they operate on any type that is as wide as a pointer or 90 + smaller. 77 91 78 92 Similar operations return the new value of ``*ptr``:: 79 93 ··· 85 99 typeof(*ptr) atomic_or_fetch(ptr, val) 86 100 typeof(*ptr) atomic_xor_fetch(ptr, val) 87 101 88 - Sequentially consistent loads and stores can be done using:: 89 - 90 - atomic_fetch_add(ptr, 0) for loads 91 - atomic_xchg(ptr, val) for stores 92 - 93 - However, they are quite expensive on some platforms, notably POWER and 94 - Arm. Therefore, qemu/atomic.h provides two primitives with slightly 95 - weaker constraints:: 102 + ``qemu/atomic.h`` also provides loads and stores that cannot be reordered 103 + with each other:: 96 104 97 105 typeof(*ptr) atomic_mb_read(ptr) 98 106 void atomic_mb_set(ptr, val) 99 107 100 - The semantics of these primitives map to Java volatile variables, 101 - and are strongly related to memory barriers as used in the Linux 102 - kernel (see below). 108 + However these do not provide sequential consistency and, in particular, 109 + they do not participate in the total ordering enforced by 110 + sequentially-consistent operations. For this reason they are deprecated. 111 + They should instead be replaced with any of the following (ordered from 112 + easiest to hardest): 103 113 104 - As long as you use atomic_mb_read and atomic_mb_set, accesses cannot 105 - be reordered with each other, and it is also not possible to reorder 106 - "normal" accesses around them. 114 + - accesses inside a mutex or spinlock 107 115 108 - However, and this is the important difference between 109 - atomic_mb_read/atomic_mb_set and sequential consistency, it is important 110 - for both threads to access the same volatile variable. It is not the 111 - case that everything visible to thread A when it writes volatile field f 112 - becomes visible to thread B after it reads volatile field g. The store 113 - and load have to "match" (i.e., be performed on the same volatile 114 - field) to achieve the right semantics. 116 + - lightweight synchronization primitives such as ``QemuEvent`` 115 117 118 + - RCU operations (``atomic_rcu_read``, ``atomic_rcu_set``) when publishing 119 + or accessing a new version of a data structure 116 120 117 - These operations operate on any type that is as wide as an int or smaller. 121 + - other atomic accesses: ``atomic_read`` and ``atomic_load_acquire`` for 122 + loads, ``atomic_set`` and ``atomic_store_release`` for stores, ``smp_mb`` 123 + to forbid reordering subsequent loads before a store. 118 124 119 125 120 126 Weak atomic access and manual memory barriers ··· 122 128 123 129 Compared to sequentially consistent atomic access, programming with 124 130 weaker consistency models can be considerably more complicated. 125 - In general, if the algorithm you are writing includes both writes 126 - and reads on the same side, it is generally simpler to use sequentially 127 - consistent primitives. 131 + The only guarantees that you can rely upon in this case are: 132 + 133 + - atomic accesses will not cause data races (and hence undefined behavior); 134 + ordinary accesses instead cause data races if they are concurrent with 135 + other accesses of which at least one is a write. In order to ensure this, 136 + the compiler will not optimize accesses out of existence, create unsolicited 137 + accesses, or perform other similar optimzations. 138 + 139 + - acquire operations will appear to happen, with respect to the other 140 + components of the system, before all the LOAD or STORE operations 141 + specified afterwards. 142 + 143 + - release operations will appear to happen, with respect to the other 144 + components of the system, after all the LOAD or STORE operations 145 + specified before. 146 + 147 + - release operations will *synchronize with* acquire operations; 148 + see :ref:`acqrel` for a detailed explanation. 128 149 129 150 When using this model, variables are accessed with: 130 151 ··· 142 163 143 164 - ``atomic_store_release()``, which guarantees the STORE to appear to 144 165 happen, with respect to the other components of the system, 145 - after all the LOAD or STORE operations specified afterwards. 166 + after all the LOAD or STORE operations specified before. 146 167 Operations coming after ``atomic_store_release()`` can still be 147 - reordered after it. 168 + reordered before it. 148 169 149 170 Restrictions to the ordering of accesses can also be specified 150 171 using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, ··· 208 229 dependency and a full read barrier or better is required. 209 230 210 231 211 - This is the set of barriers that is required *between* two ``atomic_read()`` 212 - and ``atomic_set()`` operations to achieve sequential consistency: 232 + Memory barriers and ``atomic_load_acquire``/``atomic_store_release`` are 233 + mostly used when a data structure has one thread that is always a writer 234 + and one thread that is always a reader: 213 235 214 - +----------------+-------------------------------------------------------+ 215 - | | 2nd operation | 216 - | +------------------+-----------------+------------------+ 217 - | 1st operation | (after last) | atomic_read | atomic_set | 218 - +----------------+------------------+-----------------+------------------+ 219 - | (before first) | .. | none | smp_mb_release() | 220 - +----------------+------------------+-----------------+------------------+ 221 - | atomic_read | smp_mb_acquire() | smp_rmb() [1]_ | [2]_ | 222 - +----------------+------------------+-----------------+------------------+ 223 - | atomic_set | none | smp_mb() [3]_ | smp_wmb() | 224 - +----------------+------------------+-----------------+------------------+ 236 + +----------------------------------+----------------------------------+ 237 + | thread 1 | thread 2 | 238 + +==================================+==================================+ 239 + | :: | :: | 240 + | | | 241 + | atomic_store_release(&a, x); | y = atomic_load_acquire(&b); | 242 + | atomic_store_release(&b, y); | x = atomic_load_acquire(&a); | 243 + +----------------------------------+----------------------------------+ 244 + 245 + In this case, correctness is easy to check for using the "pairing" 246 + trick that is explained below. 225 247 226 - .. [1] Or smp_read_barrier_depends(). 248 + Sometimes, a thread is accessing many variables that are otherwise 249 + unrelated to each other (for example because, apart from the current 250 + thread, exactly one other thread will read or write each of these 251 + variables). In this case, it is possible to "hoist" the barriers 252 + outside a loop. For example: 227 253 228 - .. [2] This requires a load-store barrier. This is achieved by 229 - either smp_mb_acquire() or smp_mb_release(). 254 + +------------------------------------------+----------------------------------+ 255 + | before | after | 256 + +==========================================+==================================+ 257 + | :: | :: | 258 + | | | 259 + | n = 0; | n = 0; | 260 + | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 261 + | n += atomic_load_acquire(&a[i]); | n += atomic_read(&a[i]); | 262 + | | smp_mb_acquire(); | 263 + +------------------------------------------+----------------------------------+ 264 + | :: | :: | 265 + | | | 266 + | | smp_mb_release(); | 267 + | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 268 + | atomic_store_release(&a[i], false); | atomic_set(&a[i], false); | 269 + +------------------------------------------+----------------------------------+ 230 270 231 - .. [3] This requires a store-load barrier. On most machines, the only 232 - way to achieve this is a full barrier. 271 + Splitting a loop can also be useful to reduce the number of barriers: 233 272 273 + +------------------------------------------+----------------------------------+ 274 + | before | after | 275 + +==========================================+==================================+ 276 + | :: | :: | 277 + | | | 278 + | n = 0; | smp_mb_release(); | 279 + | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | 280 + | atomic_store_release(&a[i], false); | atomic_set(&a[i], false); | 281 + | smp_mb(); | smb_mb(); | 282 + | n += atomic_read(&b[i]); | n = 0; | 283 + | } | for (i = 0; i < 10; i++) | 284 + | | n += atomic_read(&b[i]); | 285 + +------------------------------------------+----------------------------------+ 234 286 235 - You can see that the two possible definitions of ``atomic_mb_read()`` 236 - and ``atomic_mb_set()`` are the following: 287 + In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer 288 + as well) ``smp_wmb()``: 237 289 238 - 1) | atomic_mb_read(p) = atomic_read(p); smp_mb_acquire() 239 - | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb() 290 + +------------------------------------------+----------------------------------+ 291 + | before | after | 292 + +==========================================+==================================+ 293 + | :: | :: | 294 + | | | 295 + | | smp_mb_release(); | 296 + | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | 297 + | atomic_store_release(&a[i], false); | atomic_set(&a[i], false); | 298 + | atomic_store_release(&b[i], false); | smb_wmb(); | 299 + | } | for (i = 0; i < 10; i++) | 300 + | | atomic_set(&b[i], false); | 301 + +------------------------------------------+----------------------------------+ 240 302 241 - 2) | atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire() 242 - | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); 243 303 244 - Usually the former is used, because ``smp_mb()`` is expensive and a program 245 - normally has more reads than writes. Therefore it makes more sense to 246 - make ``atomic_mb_set()`` the more expensive operation. 304 + .. _acqrel: 247 305 248 - There are two common cases in which atomic_mb_read and atomic_mb_set 249 - generate too many memory barriers, and thus it can be useful to manually 250 - place barriers, or use atomic_load_acquire/atomic_store_release instead: 306 + Acquire/release pairing and the *synchronizes-with* relation 307 + ------------------------------------------------------------ 251 308 252 - - when a data structure has one thread that is always a writer 253 - and one thread that is always a reader, manual placement of 254 - memory barriers makes the write side faster. Furthermore, 255 - correctness is easy to check for in this case using the "pairing" 256 - trick that is explained below: 309 + Atomic operations other than ``atomic_set()`` and ``atomic_read()`` have 310 + either *acquire* or *release* semantics [#rmw]_. This has two effects: 257 311 258 - +----------------------------------------------------------------------+ 259 - | thread 1 | 260 - +-----------------------------------+----------------------------------+ 261 - | before | after | 262 - +===================================+==================================+ 263 - | :: | :: | 264 - | | | 265 - | (other writes) | | 266 - | atomic_mb_set(&a, x) | atomic_store_release(&a, x) | 267 - | atomic_mb_set(&b, y) | atomic_store_release(&b, y) | 268 - +-----------------------------------+----------------------------------+ 312 + .. [#rmw] Read-modify-write operations can have both---acquire applies to the 313 + read part, and release to the write. 269 314 270 - +----------------------------------------------------------------------+ 271 - | thread 2 | 272 - +-----------------------------------+----------------------------------+ 273 - | before | after | 274 - +===================================+==================================+ 275 - | :: | :: | 276 - | | | 277 - | y = atomic_mb_read(&b) | y = atomic_load_acquire(&b) | 278 - | x = atomic_mb_read(&a) | x = atomic_load_acquire(&a) | 279 - | (other reads) | | 280 - +-----------------------------------+----------------------------------+ 315 + - within a thread, they are ordered either before subsequent operations 316 + (for acquire) or after previous operations (for release). 281 317 282 - Note that the barrier between the stores in thread 1, and between 283 - the loads in thread 2, has been optimized here to a write or a 284 - read memory barrier respectively. On some architectures, notably 285 - ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as 286 - smp_mb, but smp_rmb and/or smp_wmb are more efficient. 318 + - if a release operation in one thread *synchronizes with* an acquire operation 319 + in another thread, the ordering constraints propagates from the first to the 320 + second thread. That is, everything before the release operation in the 321 + first thread is guaranteed to *happen before* everything after the 322 + acquire operation in the second thread. 287 323 288 - - sometimes, a thread is accessing many variables that are otherwise 289 - unrelated to each other (for example because, apart from the current 290 - thread, exactly one other thread will read or write each of these 291 - variables). In this case, it is possible to "hoist" the implicit 292 - barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside 293 - a loop. For example, the above definition ``atomic_mb_read()`` gives 294 - the following transformation: 324 + The concept of acquire and release semantics is not exclusive to atomic 325 + operations; almost all higher-level synchronization primitives also have 326 + acquire or release semantics. For example: 295 327 296 - +-----------------------------------+----------------------------------+ 297 - | before | after | 298 - +===================================+==================================+ 299 - | :: | :: | 300 - | | | 301 - | n = 0; | n = 0; | 302 - | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 303 - | n += atomic_mb_read(&a[i]); | n += atomic_read(&a[i]); | 304 - | | smp_mb_acquire(); | 305 - +-----------------------------------+----------------------------------+ 328 + - ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has 329 + release semantics and synchronizes with a ``pthread_mutex_lock`` for the 330 + same mutex. 306 331 307 - Similarly, atomic_mb_set() can be transformed as follows: 332 + - ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics; 333 + ``pthread_cond_wait`` has both release semantics (synchronizing with 334 + ``pthread_mutex_lock``) and acquire semantics (synchronizing with 335 + ``pthread_mutex_unlock`` and signaling of the condition variable). 308 336 309 - +-----------------------------------+----------------------------------+ 310 - | before | after | 311 - +===================================+==================================+ 312 - | :: | :: | 313 - | | | 314 - | | smp_mb_release(); | 315 - | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 316 - | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); | 317 - | | smp_mb(); | 318 - +-----------------------------------+----------------------------------+ 337 + - ``pthread_create`` has release semantics and synchronizes with the start 338 + of the new thread; ``pthread_join`` has acquire semantics and synchronizes 339 + with the exiting of the thread. 319 340 341 + - ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has 342 + acquire semantics. 320 343 321 - The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``. 344 + For example, in the following example there are no atomic accesses, but still 345 + thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit`` 346 + (release) and ``pthread_join`` (acquire): 322 347 323 - The two tricks can be combined. In this case, splitting a loop in 324 - two lets you hoist the barriers out of the loops _and_ eliminate the 325 - expensive ``smp_mb()``: 348 + +----------------------+-------------------------------+ 349 + | thread 1 | thread 2 | 350 + +======================+===============================+ 351 + | :: | :: | 352 + | | | 353 + | *a = 1; | | 354 + | pthread_exit(a); | pthread_join(thread1, &a); | 355 + | | x = *a; | 356 + +----------------------+-------------------------------+ 326 357 327 - +-----------------------------------+----------------------------------+ 328 - | before | after | 329 - +===================================+==================================+ 330 - | :: | :: | 331 - | | | 332 - | | smp_mb_release(); | 333 - | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | 334 - | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); | 335 - | atomic_mb_set(&b[i], false); | smb_wmb(); | 336 - | } | for (i = 0; i < 10; i++) | 337 - | | atomic_set(&a[i], false); | 338 - | | smp_mb(); | 339 - +-----------------------------------+----------------------------------+ 358 + Synchronization between threads basically descends from this pairing of 359 + a release operation and an acquire operation. Therefore, atomic operations 360 + other than ``atomic_set()`` and ``atomic_read()`` will almost always be 361 + paired with another operation of the opposite kind: an acquire operation 362 + will pair with a release operation and vice versa. This rule of thumb is 363 + extremely useful; in the case of QEMU, however, note that the other 364 + operation may actually be in a driver that runs in the guest! 340 365 366 + ``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``, 367 + ``atomic_load_acquire()`` and ``atomic_rcu_read()`` all count 368 + as acquire operations. ``smp_wmb()``, ``smp_mb_release()``, 369 + ``atomic_store_release()`` and ``atomic_rcu_set()`` all count as release 370 + operations. ``smp_mb()`` counts as both acquire and release, therefore 371 + it can pair with any other atomic operation. Here is an example: 341 372 342 - Memory barrier pairing 343 - ---------------------- 373 + +----------------------+------------------------------+ 374 + | thread 1 | thread 2 | 375 + +======================+==============================+ 376 + | :: | :: | 377 + | | | 378 + | atomic_set(&a, 1); | | 379 + | smp_wmb(); | | 380 + | atomic_set(&b, 2); | x = atomic_read(&b); | 381 + | | smp_rmb(); | 382 + | | y = atomic_read(&a); | 383 + +----------------------+------------------------------+ 344 384 345 - A useful rule of thumb is that memory barriers should always, or almost 346 - always, be paired with another barrier. In the case of QEMU, however, 347 - note that the other barrier may actually be in a driver that runs in 348 - the guest! 385 + Note that a load-store pair only counts if the two operations access the 386 + same variable: that is, a store-release on a variable ``x`` *synchronizes 387 + with* a load-acquire on a variable ``x``, while a release barrier 388 + synchronizes with any acquire operation. The following example shows 389 + correct synchronization: 349 390 350 - For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()`` 351 - both count as read barriers. A read barrier shall pair with a write 352 - barrier or a full barrier; a write barrier shall pair with a read 353 - barrier or a full barrier. A full barrier can pair with anything. 354 - For example: 391 + +--------------------------------+--------------------------------+ 392 + | thread 1 | thread 2 | 393 + +================================+================================+ 394 + | :: | :: | 395 + | | | 396 + | atomic_set(&a, 1); | | 397 + | atomic_store_release(&b, 2); | x = atomic_load_acquire(&b); | 398 + | | y = atomic_read(&a); | 399 + +--------------------------------+--------------------------------+ 355 400 356 - +--------------------+------------------------------+ 357 - | thread 1 | thread 2 | 358 - +====================+==============================+ 359 - | :: | :: | 360 - | | | 361 - | a = 1; | | 362 - | smp_wmb(); | | 363 - | b = 2; | x = b; | 364 - | | smp_rmb(); | 365 - | | y = a; | 366 - +--------------------+------------------------------+ 401 + Acquire and release semantics of higher-level primitives can also be 402 + relied upon for the purpose of establishing the *synchronizes with* 403 + relation. 367 404 368 405 Note that the "writing" thread is accessing the variables in the 369 406 opposite order as the "reading" thread. This is expected: stores 370 - before the write barrier will normally match the loads after the 371 - read barrier, and vice versa. The same is true for more than 2 372 - access and for data dependency barriers: 407 + before a release operation will normally match the loads after 408 + the acquire operation, and vice versa. In fact, this happened already 409 + in the ``pthread_exit``/``pthread_join`` example above. 410 + 411 + Finally, this more complex example has more than two accesses and data 412 + dependency barriers. It also does not use atomic accesses whenever there 413 + cannot be a data race: 373 414 374 415 +----------------------+------------------------------+ 375 416 | thread 1 | thread 2 | ··· 380 421 | smp_wmb(); | | 381 422 | x->i = 2; | | 382 423 | smp_wmb(); | | 383 - | a = x; | x = a; | 424 + | atomic_set(&a, x); | x = atomic_read(&a); | 384 425 | | smp_read_barrier_depends(); | 385 426 | | y = x->i; | 386 427 | | smp_read_barrier_depends(); | 387 428 | | z = b[y]; | 388 429 +----------------------+------------------------------+ 389 430 390 - ``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``. 391 - and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``. 392 - 393 - 394 - Comparison with Linux kernel memory barriers 395 - ============================================ 431 + Comparison with Linux kernel primitives 432 + ======================================= 396 433 397 434 Here is a list of differences between Linux kernel atomic operations 398 435 and memory barriers, and the equivalents in QEMU: ··· 426 463 ``atomic_cmpxchg`` returns the old value of the variable 427 464 ===================== ========================================= 428 465 429 - In QEMU, the second kind does not exist. Currently Linux has 430 - atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub. 466 + In QEMU, the second kind is named ``atomic_OP_fetch``. 431 467 432 468 - different atomic read-modify-write operations in Linux imply 433 469 a different set of memory barriers; in QEMU, all of them enforce 434 - sequential consistency, which means they imply full memory barriers 435 - before and after the operation. 470 + sequential consistency. 471 + 472 + - in QEMU, ``atomic_read()`` and ``atomic_set()`` do not participate in 473 + the total ordering enforced by sequentially-consistent operations. 474 + This is because QEMU uses the C11 memory model. The following example 475 + is correct in Linux but not in QEMU: 476 + 477 + +----------------------------------+--------------------------------+ 478 + | Linux (correct) | QEMU (incorrect) | 479 + +==================================+================================+ 480 + | :: | :: | 481 + | | | 482 + | a = atomic_fetch_add(&x, 2); | a = atomic_fetch_add(&x, 2); | 483 + | b = READ_ONCE(&y); | b = atomic_read(&y); | 484 + +----------------------------------+--------------------------------+ 485 + 486 + because the read of ``y`` can be moved (by either the processor or the 487 + compiler) before the write of ``x``. 436 488 437 - - Linux does not have an equivalent of ``atomic_mb_set()``. In particular, 438 - note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``. 439 - ``atomic_mb_read()`` compiles to the same instructions as Linux's 440 - ``smp_load_acquire()``, but this should be treated as an implementation 441 - detail. 489 + Fixing this requires an ``smp_mb()`` memory barrier between the write 490 + of ``x`` and the read of ``y``. In the common case where only one thread 491 + writes ``x``, it is also possible to write it like this: 492 + 493 + +--------------------------------+ 494 + | QEMU (correct) | 495 + +================================+ 496 + | :: | 497 + | | 498 + | a = atomic_read(&x); | 499 + | atomic_set(&x, a + 2); | 500 + | smp_mb(); | 501 + | b = atomic_read(&y); | 502 + +--------------------------------+ 442 503 443 504 Sources 444 505 =======