Git fork
at reftables-rust 270 lines 12 kB view raw
1Parallel Checkout Design Notes 2============================== 3 4The "Parallel Checkout" feature attempts to use multiple processes to 5parallelize the work of uncompressing the blobs, applying in-core 6filters, and writing the resulting contents to the working tree during a 7checkout operation. It can be used by all checkout-related commands, 8such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others. 9 10These commands share the following basic structure: 11 12* Step 1: Read the current index file into memory. 13 14* Step 2: Modify the in-memory index based upon the command, and 15 temporarily mark all cache entries that need to be updated. 16 17* Step 3: Populate the working tree to match the new candidate index. 18 This includes iterating over all of the to-be-updated cache entries 19 and delete, create, or overwrite the associated files in the working 20 tree. 21 22* Step 4: Write the new index to disk. 23 24Step 3 is the focus of the "parallel checkout" effort described here. 25 26Sequential Implementation 27------------------------- 28 29For the purposes of discussion here, the current sequential 30implementation of Step 3 is divided in 3 parts, each one implemented in 31its own function: 32 33* Step 3a: `unpack-trees.c:check_updates()` contains a series of 34 sequential loops iterating over the `cache_entry`'s array. The main 35 loop in this function calls the Step 3b function for each of the 36 to-be-updated entries. 37 38* Step 3b: `entry.c:checkout_entry()` examines the existing working tree 39 for file conflicts, collisions, and unsaved changes. It removes files 40 and creates leading directories as necessary. It calls the Step 3c 41 function for each entry to be written. 42 43* Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges 44 it if necessary, creates the file in the working tree, writes the 45 smudged contents, calls `fstat()` or `lstat()`, and updates the 46 associated `cache_entry` struct with the stat information gathered. 47 48It wouldn't be safe to perform Step 3b in parallel, as there could be 49race conditions between file creations and removals. Instead, the 50parallel checkout framework lets the sequential code handle Step 3b, 51and uses parallel workers to replace the sequential 52`entry.c:write_entry()` calls from Step 3c. 53 54Rejected Multi-Threaded Solution 55-------------------------------- 56 57The most "straightforward" implementation would be to spread the set of 58to-be-updated cache entries across multiple threads. But due to the 59thread-unsafe functions in the object database code, we would have to use locks to 60coordinate the parallel operation. An early prototype of this solution 61showed that the multi-threaded checkout would bring performance 62improvements over the sequential code, but there was still too much lock 63contention. A `perf` profiling indicated that around 20% of the runtime 64during a local Linux clone (on an SSD) was spent in locking functions. 65For this reason this approach was rejected in favor of using multiple 66child processes, which led to better performance. 67 68Multi-Process Solution 69---------------------- 70 71Parallel checkout alters the aforementioned Step 3 to use multiple 72`checkout--worker` background processes to distribute the work. The 73long-running worker processes are controlled by the foreground Git 74command using the existing run-command API. 75 76Overview 77~~~~~~~~ 78 79Step 3b is only slightly altered; for each entry to be checked out, the 80main process performs the following steps: 81 82* M1: Check whether there is any untracked or unclean file in the 83 working tree which would be overwritten by this entry, and decide 84 whether to proceed (removing the file(s)) or not. 85 86* M2: Create the leading directories. 87 88* M3: Load the conversion attributes for the entry's path. 89 90* M4: Check, based on the entry's type and conversion attributes, 91 whether the entry is eligible for parallel checkout (more on this 92 later). If it is eligible, enqueue the entry and the loaded 93 attributes to later write the entry in parallel. If not, write the 94 entry right away, using the default sequential code. 95 96Note: we save the conversion attributes associated with each entry 97because the workers don't have access to the main process' index state, 98so they can't load the attributes by themselves (and the attributes are 99needed to properly smudge the entry). Additionally, this has a positive 100impact on performance as (1) we don't need to load the attributes twice 101and (2) the attributes machinery is optimized to handle paths in 102sequential order. 103 104After all entries have passed through the above steps, the main process 105checks if the number of enqueued entries is sufficient to spread among 106the workers. If not, it just writes them sequentially. Otherwise, it 107spawns the workers and distributes the queued entries uniformly in 108continuous chunks. This aims to minimize the chances of two workers 109writing to the same directory simultaneously, which could increase lock 110contention in the kernel. 111 112Then, for each assigned item, each worker: 113 114* W1: Checks if there is any non-directory file in the leading part of 115 the entry's path or if there already exists a file at the entry' path. 116 If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on 117 this later). 118 119* W2: Creates the file (with O_CREAT and O_EXCL). 120 121* W3: Loads the blob into memory (inflating and delta reconstructing 122 it). 123 124* W4: Applies any required in-process filter, like end-of-line 125 conversion and re-encoding. 126 127* W5: Writes the result to the file descriptor opened at W2. 128 129* W6: Calls `fstat()` or `lstat()` on the just-written path, and sends 130 the result back to the main process, together with the end status of 131 the operation and the item's identification number. 132 133Note that, when possible, steps W3 to W5 are delegated to the streaming 134machinery, removing the need to keep the entire blob in memory. 135 136If the worker fails to read the blob or to write it to the working tree, 137it removes the created file to avoid leaving empty files behind. This is 138the *only* time a worker is allowed to remove a file. 139 140As mentioned earlier, it is the responsibility of the main process to 141remove any file that blocks the checkout operation (or abort if the 142removal(s) would cause data loss and the user didn't ask to `--force`). 143This is crucial to avoid race conditions and also to properly detect 144path collisions at Step W1. 145 146After the workers finish writing the items and sending back the required 147information, the main process handles the results in two steps: 148 149- First, it updates the in-memory index with the `lstat()` information 150 sent by the workers. (This must be done first as this information 151 might be required in the following step.) 152 153- Then it writes the items which collided on disk (i.e. items marked 154 with `PC_ITEM_COLLIDED`). More on this below. 155 156Path Collisions 157--------------- 158 159Path collisions happen when two different paths correspond to the same 160entry in the file system. E.g. the paths 'a' and 'A' would collide in a 161case-insensitive file system. 162 163The sequential checkout deals with collisions in the same way that it 164deals with files that were already present in the working tree before 165checkout. Basically, it checks if the path that it wants to write 166already exists on disk, makes sure the existing file doesn't have 167unsaved data, and then overwrites it. (To be more pedantic: it deletes 168the existing file and creates the new one.) So, if there are multiple 169colliding files to be checked out, the sequential code will write each 170one of them but only the last will actually survive on disk. 171 172Parallel checkout aims to reproduce the same behavior. However, we 173cannot let the workers racily write to the same file on disk. Instead, 174the workers detect when the entry that they want to check out would 175collide with an existing file, and mark it with `PC_ITEM_COLLIDED`. 176Later, the main process can sequentially feed these entries back to 177`checkout_entry()` without the risk of race conditions. On clone, this 178also has the effect of marking the colliding entries to later emit a 179warning for the user, like the classic sequential checkout does. 180 181The workers are able to detect both collisions among the entries being 182concurrently written and collisions between a parallel-eligible entry 183and an ineligible entry. The general idea for collision detection is 184quite straightforward: for each parallel-eligible entry, the main 185process must remove all files that prevent this entry from being written 186(before enqueueing it). This includes any non-directory file in the 187leading path of the entry. Later, when a worker gets assigned the entry, 188it looks again for the non-directory files and for an already existing 189file at the entry's path. If any of these checks finds something, the 190worker knows that there was a path collision. 191 192Because parallel checkout can distinguish path collisions from the case 193where the file was already present in the working tree before checkout, 194we could alternatively choose to skip the checkout of colliding entries. 195However, each entry that doesn't get written would have NULL `lstat()` 196fields on the index. This could cause performance penalties for 197subsequent commands that need to refresh the index, as they would have 198to go to the file system to see if the entry is dirty. Thus, if we have 199N entries in a colliding group and we decide to write and `lstat()` only 200one of them, every subsequent `git-status` will have to read, convert, 201and hash the written file N - 1 times. By checking out all colliding 202entries (like the sequential code does), we only pay the overhead once, 203during checkout. 204 205Eligible Entries for Parallel Checkout 206-------------------------------------- 207 208As previously mentioned, not all entries passed to `checkout_entry()` 209will be considered eligible for parallel checkout. More specifically, we 210exclude: 211 212- Symbolic links; to avoid race conditions that, in combination with 213 path collisions, could cause workers to write files at the wrong 214 place. For example, if we were to concurrently check out a symlink 215 'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system, 216 we could potentially end up writing the file 'A/f' at 'a/f', due to a 217 race condition. 218 219- Regular files that require external filters (either "one shot" filters 220 or long-running process filters). These filters are black-boxes to Git 221 and may have their own internal locking or non-concurrent assumptions. 222 So it might not be safe to run multiple instances in parallel. 223+ 224Besides, long-running filters may use the delayed checkout feature to 225postpone the return of some filtered blobs. The delayed checkout queue 226and the parallel checkout queue are not compatible and should remain 227separate. 228+ 229Note: regular files that only require internal filters, like end-of-line 230conversion and re-encoding, are eligible for parallel checkout. 231 232Ineligible entries are checked out by the classic sequential codepath 233*before* spawning workers. 234 235Note: submodules' files are also eligible for parallel checkout (as 236long as they don't fall into any of the excluding categories mentioned 237above). But since each submodule is checked out in its own child 238process, we don't mix the superproject's and the submodules' files in 239the same parallel checkout process or queue. 240 241The API 242------- 243 244The parallel checkout API was designed with the goal of minimizing 245changes to the current users of the checkout machinery. This means that 246they don't have to call a different function for sequential or parallel 247checkout. As already mentioned, `checkout_entry()` will automatically 248insert the given entry in the parallel checkout queue when this feature 249is enabled and the entry is eligible; otherwise, it will just write the 250entry right away, using the sequential code. In general, callers of the 251parallel checkout API should look similar to this: 252 253---------------------------------------------- 254int pc_workers, pc_threshold, err = 0; 255struct checkout state; 256 257get_parallel_checkout_configs(&pc_workers, &pc_threshold); 258 259/* 260 * This check is not strictly required, but it 261 * should save some time in sequential mode. 262 */ 263if (pc_workers > 1) 264 init_parallel_checkout(); 265 266for (each cache_entry ce to-be-updated) 267 err |= checkout_entry(ce, &state, NULL, NULL); 268 269err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL); 270----------------------------------------------