qemu with hax to log dma reads & writes jcs.org/2018/11/12/vfio
at master 358 lines 14 kB view raw
1Copyright (c) 2015-2016 Linaro Ltd. 2 3This work is licensed under the terms of the GNU GPL, version 2 or 4later. See the COPYING file in the top-level directory. 5 6Introduction 7============ 8 9This document outlines the design for multi-threaded TCG system-mode 10emulation. The current user-mode emulation mirrors the thread 11structure of the translated executable. Some of the work will be 12applicable to both system and linux-user emulation. 13 14The original system-mode TCG implementation was single threaded and 15dealt with multiple CPUs with simple round-robin scheduling. This 16simplified a lot of things but became increasingly limited as systems 17being emulated gained additional cores and per-core performance gains 18for host systems started to level off. 19 20vCPU Scheduling 21=============== 22 23We introduce a new running mode where each vCPU will run on its own 24user-space thread. This will be enabled by default for all FE/BE 25combinations that have had the required work done to support this 26safely. 27 28In the general case of running translated code there should be no 29inter-vCPU dependencies and all vCPUs should be able to run at full 30speed. Synchronisation will only be required while accessing internal 31shared data structures or when the emulated architecture requires a 32coherent representation of the emulated machine state. 33 34Shared Data Structures 35====================== 36 37Main Run Loop 38------------- 39 40Even when there is no code being generated there are a number of 41structures associated with the hot-path through the main run-loop. 42These are associated with looking up the next translation block to 43execute. These include: 44 45 tb_jmp_cache (per-vCPU, cache of recent jumps) 46 tb_ctx.htable (global hash table, phys address->tb lookup) 47 48As TB linking only occurs when blocks are in the same page this code 49is critical to performance as looking up the next TB to execute is the 50most common reason to exit the generated code. 51 52DESIGN REQUIREMENT: Make access to lookup structures safe with 53multiple reader/writer threads. Minimise any lock contention to do it. 54 55The hot-path avoids using locks where possible. The tb_jmp_cache is 56updated with atomic accesses to ensure consistent results. The fall 57back QHT based hash table is also designed for lockless lookups. Locks 58are only taken when code generation is required or TranslationBlocks 59have their block-to-block jumps patched. 60 61Global TCG State 62---------------- 63 64### User-mode emulation 65We need to protect the entire code generation cycle including any post 66generation patching of the translated code. This also implies a shared 67translation buffer which contains code running on all cores. Any 68execution path that comes to the main run loop will need to hold a 69mutex for code generation. This also includes times when we need flush 70code or entries from any shared lookups/caches. Structures held on a 71per-vCPU basis won't need locking unless other vCPUs will need to 72modify them. 73 74DESIGN REQUIREMENT: Add locking around all code generation and TB 75patching. 76 77(Current solution) 78 79Code generation is serialised with mmap_lock(). 80 81### !User-mode emulation 82Each vCPU has its own TCG context and associated TCG region, thereby 83requiring no locking. 84 85Translation Blocks 86------------------ 87 88Currently the whole system shares a single code generation buffer 89which when full will force a flush of all translations and start from 90scratch again. Some operations also force a full flush of translations 91including: 92 93 - debugging operations (breakpoint insertion/removal) 94 - some CPU helper functions 95 96This is done with the async_safe_run_on_cpu() mechanism to ensure all 97vCPUs are quiescent when changes are being made to shared global 98structures. 99 100More granular translation invalidation events are typically due 101to a change of the state of a physical page: 102 103 - code modification (self modify code, patching code) 104 - page changes (new page mapping in linux-user mode) 105 106While setting the invalid flag in a TranslationBlock will stop it 107being used when looked up in the hot-path there are a number of other 108book-keeping structures that need to be safely cleared. 109 110Any TranslationBlocks which have been patched to jump directly to the 111now invalid blocks need the jump patches reversing so they will return 112to the C code. 113 114There are a number of look-up caches that need to be properly updated 115including the: 116 117 - jump lookup cache 118 - the physical-to-tb lookup hash table 119 - the global page table 120 121The global page table (l1_map) which provides a multi-level look-up 122for PageDesc structures which contain pointers to the start of a 123linked list of all Translation Blocks in that page (see page_next). 124 125Both the jump patching and the page cache involve linked lists that 126the invalidated TranslationBlock needs to be removed from. 127 128DESIGN REQUIREMENT: Safely handle invalidation of TBs 129 - safely patch/revert direct jumps 130 - remove central PageDesc lookup entries 131 - ensure lookup caches/hashes are safely updated 132 133(Current solution) 134 135The direct jump themselves are updated atomically by the TCG 136tb_set_jmp_target() code. Modification to the linked lists that allow 137searching for linked pages are done under the protection of tb->jmp_lock, 138where tb is the destination block of a jump. Each origin block keeps a 139pointer to its destinations so that the appropriate lock can be acquired before 140iterating over a jump list. 141 142The global page table is a lockless radix tree; cmpxchg is used 143to atomically insert new elements. 144 145The lookup caches are updated atomically and the lookup hash uses QHT 146which is designed for concurrent safe lookup. 147 148Parallel code generation is supported. QHT is used at insertion time 149as the synchronization point across threads, thereby ensuring that we only 150keep track of a single TranslationBlock for each guest code block. 151 152Memory maps and TLBs 153-------------------- 154 155The memory handling code is fairly critical to the speed of memory 156access in the emulated system. The SoftMMU code is designed so the 157hot-path can be handled entirely within translated code. This is 158handled with a per-vCPU TLB structure which once populated will allow 159a series of accesses to the page to occur without exiting the 160translated code. It is possible to set flags in the TLB address which 161will ensure the slow-path is taken for each access. This can be done 162to support: 163 164 - Memory regions (dividing up access to PIO, MMIO and RAM) 165 - Dirty page tracking (for code gen, SMC detection, migration and display) 166 - Virtual TLB (for translating guest address->real address) 167 168When the TLB tables are updated by a vCPU thread other than their own 169we need to ensure it is done in a safe way so no inconsistent state is 170seen by the vCPU thread. 171 172Some operations require updating a number of vCPUs TLBs at the same 173time in a synchronised manner. 174 175DESIGN REQUIREMENTS: 176 177 - TLB Flush All/Page 178 - can be across-vCPUs 179 - cross vCPU TLB flush may need other vCPU brought to halt 180 - change may need to be visible to the calling vCPU immediately 181 - TLB Flag Update 182 - usually cross-vCPU 183 - want change to be visible as soon as possible 184 - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) 185 - This is a per-vCPU table - by definition can't race 186 - updated by its own thread when the slow-path is forced 187 188(Current solution) 189 190We have updated cputlb.c to defer operations when a cross-vCPU 191operation with async_run_on_cpu() which ensures each vCPU sees a 192coherent state when it next runs its work (in a few instructions 193time). 194 195A new set up operations (tlb_flush_*_all_cpus) take an additional flag 196which when set will force synchronisation by setting the source vCPUs 197work as "safe work" and exiting the cpu run loop. This ensure by the 198time execution restarts all flush operations have completed. 199 200TLB flag updates are all done atomically and are also protected by the 201corresponding page lock. 202 203(Known limitation) 204 205Not really a limitation but the wait mechanism is overly strict for 206some architectures which only need flushes completed by a barrier 207instruction. This could be a future optimisation. 208 209Emulated hardware state 210----------------------- 211 212Currently thanks to KVM work any access to IO memory is automatically 213protected by the global iothread mutex, also known as the BQL (Big 214Qemu Lock). Any IO region that doesn't use global mutex is expected to 215do its own locking. 216 217However IO memory isn't the only way emulated hardware state can be 218modified. Some architectures have model specific registers that 219trigger hardware emulation features. Generally any translation helper 220that needs to update more than a single vCPUs of state should take the 221BQL. 222 223As the BQL, or global iothread mutex is shared across the system we 224push the use of the lock as far down into the TCG code as possible to 225minimise contention. 226 227(Current solution) 228 229MMIO access automatically serialises hardware emulation by way of the 230BQL. Currently Arm targets serialise all ARM_CP_IO register accesses 231and also defer the reset/startup of vCPUs to the vCPU context by way 232of async_run_on_cpu(). 233 234Updates to interrupt state are also protected by the BQL as they can 235often be cross vCPU. 236 237Memory Consistency 238================== 239 240Between emulated guests and host systems there are a range of memory 241consistency models. Even emulating weakly ordered systems on strongly 242ordered hosts needs to ensure things like store-after-load re-ordering 243can be prevented when the guest wants to. 244 245Memory Barriers 246--------------- 247 248Barriers (sometimes known as fences) provide a mechanism for software 249to enforce a particular ordering of memory operations from the point 250of view of external observers (e.g. another processor core). They can 251apply to any memory operations as well as just loads or stores. 252 253The Linux kernel has an excellent write-up on the various forms of 254memory barrier and the guarantees they can provide [1]. 255 256Barriers are often wrapped around synchronisation primitives to 257provide explicit memory ordering semantics. However they can be used 258by themselves to provide safe lockless access by ensuring for example 259a change to a signal flag will only be visible once the changes to 260payload are. 261 262DESIGN REQUIREMENT: Add a new tcg_memory_barrier op 263 264This would enforce a strong load/store ordering so all loads/stores 265complete at the memory barrier. On single-core non-SMP strongly 266ordered backends this could become a NOP. 267 268Aside from explicit standalone memory barrier instructions there are 269also implicit memory ordering semantics which comes with each guest 270memory access instruction. For example all x86 load/stores come with 271fairly strong guarantees of sequential consistency whereas Arm has 272special variants of load/store instructions that imply acquire/release 273semantics. 274 275In the case of a strongly ordered guest architecture being emulated on 276a weakly ordered host the scope for a heavy performance impact is 277quite high. 278 279DESIGN REQUIREMENTS: Be efficient with use of memory barriers 280 - host systems with stronger implied guarantees can skip some barriers 281 - merge consecutive barriers to the strongest one 282 283(Current solution) 284 285The system currently has a tcg_gen_mb() which will add memory barrier 286operations if code generation is being done in a parallel context. The 287tcg_optimize() function attempts to merge barriers up to their 288strongest form before any load/store operations. The solution was 289originally developed and tested for linux-user based systems. All 290backends have been converted to emit fences when required. So far the 291following front-ends have been updated to emit fences when required: 292 293 - target-i386 294 - target-arm 295 - target-aarch64 296 - target-alpha 297 - target-mips 298 299Memory Control and Maintenance 300------------------------------ 301 302This includes a class of instructions for controlling system cache 303behaviour. While QEMU doesn't model cache behaviour these instructions 304are often seen when code modification has taken place to ensure the 305changes take effect. 306 307Synchronisation Primitives 308-------------------------- 309 310There are two broad types of synchronisation primitives found in 311modern ISAs: atomic instructions and exclusive regions. 312 313The first type offer a simple atomic instruction which will guarantee 314some sort of test and conditional store will be truly atomic w.r.t. 315other cores sharing access to the memory. The classic example is the 316x86 cmpxchg instruction. 317 318The second type offer a pair of load/store instructions which offer a 319guarantee that a region of memory has not been touched between the 320load and store instructions. An example of this is Arm's ldrex/strex 321pair where the strex instruction will return a flag indicating a 322successful store only if no other CPU has accessed the memory region 323since the ldrex. 324 325Traditionally TCG has generated a series of operations that work 326because they are within the context of a single translation block so 327will have completed before another CPU is scheduled. However with 328the ability to have multiple threads running to emulate multiple CPUs 329we will need to explicitly expose these semantics. 330 331DESIGN REQUIREMENTS: 332 - Support classic atomic instructions 333 - Support load/store exclusive (or load link/store conditional) pairs 334 - Generic enough infrastructure to support all guest architectures 335CURRENT OPEN QUESTIONS: 336 - How problematic is the ABA problem in general? 337 338(Current solution) 339 340The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which 341can be used directly or combined to emulate other instructions like 342Arm's ldrex/strex instructions. While they are susceptible to the ABA 343problem so far common guests have not implemented patterns where 344this may be a problem - typically presenting a locking ABI which 345assumes cmpxchg like semantics. 346 347The code also includes a fall-back for cases where multi-threaded TCG 348ops can't work (e.g. guest atomic width > host atomic width). In this 349case an EXCP_ATOMIC exit occurs and the instruction is emulated with 350an exclusive lock which ensures all emulation is serialised. 351 352While the atomic helpers look good enough for now there may be a need 353to look at solutions that can more closely model the guest 354architectures semantics. 355 356========== 357 358[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt