Merge branch 'rj/doc-technical-fixes' · freshlybakedca.ke/git@411903c

+1

Documentation/Makefile

··· 123 TECH_DOCS += technical/commit-graph 124 TECH_DOCS += technical/directory-rename-detection 125 TECH_DOCS += technical/hash-function-transition 126 TECH_DOCS += technical/long-running-process-protocol 127 TECH_DOCS += technical/multi-pack-index 128 TECH_DOCS += technical/packfile-uri

··· 123 TECH_DOCS += technical/commit-graph 124 TECH_DOCS += technical/directory-rename-detection 125 TECH_DOCS += technical/hash-function-transition 126 + TECH_DOCS += technical/large-object-promisors 127 TECH_DOCS += technical/long-running-process-protocol 128 TECH_DOCS += technical/multi-pack-index 129 TECH_DOCS += technical/packfile-uri

+19 -10

Documentation/technical/commit-graph.adoc

··· 39 Values 1-4 satisfy the requirements of parse_commit_gently(). 40 41 There are two definitions of generation number: 42 1. Corrected committer dates (generation number v2) 43 2. Topological levels (generation number v1) 44 ··· 158 we enable fast writes of new commit data without rewriting the entire commit 159 history -- at least, most of the time. 160 161 - ## File Layout 162 163 A commit-graph chain uses multiple files, and we use a fixed naming convention 164 to organize these files. Each commit-graph file has a name ··· 170 171 For example, if the `commit-graph-chain` file contains the lines 172 173 - ``` 174 {hash0} 175 {hash1} 176 {hash2} 177 - ``` 178 179 then the commit-graph chain looks like the following diagram: 180 ··· 213 `graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains 214 `{hash0}` and `{hash1}`. 215 216 - ## Merging commit-graph files 217 218 If we only added a new commit-graph file on every write, we would run into a 219 linear search problem through many commit-graph files. Instead, we use a merge ··· 225 the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}` 226 file. 227 228 +---------------------+ 229 | | 230 | (new commits) | ··· 250 | | 251 | | 252 +-----------------------+ 253 254 During this process, the commits to write are combined, sorted and we write the 255 contents to a temporary file, all while holding a `commit-graph-chain.lock` ··· 257 according to the computed `{hash3}`. Finally, we write the new chain data to 258 `commit-graph-chain.lock`: 259 260 - ``` 261 {hash3} 262 {hash0} 263 - ``` 264 265 We then close the lock-file. 266 267 - ## Merge Strategy 268 269 When writing a set of commits that do not exist in the commit-graph stack of 270 height N, we default to creating a new file at level N + 1. We then decide to ··· 289 number of commits) could be extracted into config settings for full 290 flexibility. 291 292 - ## Handling Mixed Generation Number Chains 293 294 With the introduction of generation number v2 and generation data chunk, the 295 following scenario is possible: ··· 318 rewriting split commit-graph as a single file (`--split=replace`) creates a 319 single layer with corrected commit dates. 320 321 - ## Deleting graph-{hash} files 322 323 After a new tip file is written, some `graph-{hash}` files may no longer 324 be part of a chain. It is important to remove these files from disk, eventually. ··· 333 defaults to zero, but can be changed using command-line arguments or a config 334 setting. 335 336 - ## Chains across multiple object directories 337 338 In a repo with alternates, we look for the `commit-graph-chain` file starting 339 in the local object directory and then in each alternate. The first file that

··· 39 Values 1-4 satisfy the requirements of parse_commit_gently(). 40 41 There are two definitions of generation number: 42 + 43 1. Corrected committer dates (generation number v2) 44 2. Topological levels (generation number v1) 45 ··· 159 we enable fast writes of new commit data without rewriting the entire commit 160 history -- at least, most of the time. 161 162 + File Layout 163 + ~~~~~~~~~~~ 164 165 A commit-graph chain uses multiple files, and we use a fixed naming convention 166 to organize these files. Each commit-graph file has a name ··· 172 173 For example, if the `commit-graph-chain` file contains the lines 174 175 + ---- 176 {hash0} 177 {hash1} 178 {hash2} 179 + ---- 180 181 then the commit-graph chain looks like the following diagram: 182 ··· 215 `graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains 216 `{hash0}` and `{hash1}`. 217 218 + Merging commit-graph files 219 + ~~~~~~~~~~~~~~~~~~~~~~~~~~ 220 221 If we only added a new commit-graph file on every write, we would run into a 222 linear search problem through many commit-graph files. Instead, we use a merge ··· 228 the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}` 229 file. 230 231 + .... 232 +---------------------+ 233 | | 234 | (new commits) | ··· 254 | | 255 | | 256 +-----------------------+ 257 + .... 258 259 During this process, the commits to write are combined, sorted and we write the 260 contents to a temporary file, all while holding a `commit-graph-chain.lock` ··· 262 according to the computed `{hash3}`. Finally, we write the new chain data to 263 `commit-graph-chain.lock`: 264 265 + ---- 266 {hash3} 267 {hash0} 268 + ---- 269 270 We then close the lock-file. 271 272 + Merge Strategy 273 + ~~~~~~~~~~~~~~ 274 275 When writing a set of commits that do not exist in the commit-graph stack of 276 height N, we default to creating a new file at level N + 1. We then decide to ··· 295 number of commits) could be extracted into config settings for full 296 flexibility. 297 298 + Handling Mixed Generation Number Chains 299 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 300 301 With the introduction of generation number v2 and generation data chunk, the 302 following scenario is possible: ··· 325 rewriting split commit-graph as a single file (`--split=replace`) creates a 326 single layer with corrected commit dates. 327 328 + Deleting graph-\{hash\} files 329 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 330 331 After a new tip file is written, some `graph-{hash}` files may no longer 332 be part of a chain. It is important to remove these files from disk, eventually. ··· 341 defaults to zero, but can be changed using command-line arguments or a config 342 setting. 343 344 + Chains across multiple object directories 345 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 346 347 In a repo with alternates, we look for the `commit-graph-chain` file starting 348 in the local object directory and then in each alternate. The first file that

+32 -32

Documentation/technical/large-object-promisors.adoc

··· 34 35 https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/ 36 37 - 0) Non goals 38 - ------------ 39 40 - We will not discuss those client side improvements here, as they 41 would require changes in different parts of Git than this effort. ··· 90 even more to host content with larger blobs or more large blobs 91 than currently. 92 93 - I) Issues with the current situation 94 - ------------------------------------ 95 96 - Some statistics made on GitLab repos have shown that more than 75% 97 of the disk space is used by blobs that are larger than 1MB and ··· 138 complaining that these tools require significant effort to set up, 139 learn and use correctly. 140 141 - II) Main features of the "Large Object Promisors" solution 142 - ---------------------------------------------------------- 143 144 The main features below should give a rough overview of how the 145 solution may work. Details about needed elements can be found in ··· 166 other objects. 167 168 Note 1 169 - ++++++ 170 171 To clarify, a LOP is a normal promisor remote, except that: 172 ··· 178 itself. 179 180 Note 2 181 - ++++++ 182 183 Git already makes it possible for a main remote to also be a promisor 184 remote storing both regular objects and large blobs for a client that ··· 186 to avoid that. 187 188 Rationale 189 - +++++++++ 190 191 LOPs aim to be good at handling large blobs while main remotes are 192 already good at handling other objects. 193 194 Implementation 195 - ++++++++++++++ 196 197 Git already has support for multiple promisor remotes, see 198 link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation]. ··· 213 underlying object storage appear like a remote to Git. 214 215 Note 216 - ++++ 217 218 A LOP can be a promisor remote accessed using a remote helper by 219 both some clients and the main remote. 220 221 Rationale 222 - +++++++++ 223 224 This looks like the simplest way to create LOPs that can cheaply 225 handle many large blobs. 226 227 Implementation 228 - ++++++++++++++ 229 230 Remote helpers are quite easy to write as shell scripts, but it might 231 be more efficient and maintainable to write them using other languages ··· 247 storage for large files handled by Git LFS. 248 249 Rationale 250 - +++++++++ 251 252 This would simplify the server side if it wants to both use a LOP and 253 act as a Git LFS server. ··· 259 LOP all its blobs with a size over a configurable threshold. 260 261 Rationale 262 - +++++++++ 263 264 This makes it easy to set things up and to clean things up. For 265 example, an admin could use this to manually convert a repo not using ··· 268 to regularly make sure the large blobs are moved to the LOP. 269 270 Implementation 271 - ++++++++++++++ 272 273 Using something based on `git repack --filter=...` to separate the 274 blobs we want to offload from the other Git objects could be a good ··· 284 perhaps pushed, into it. 285 286 Rationale 287 - +++++++++ 288 289 A main remote containing many oversize blobs would defeat the purpose 290 of LOPs. 291 292 Implementation 293 - ++++++++++++++ 294 295 The way to offload to a LOP discussed in 4) above can be used to 296 regularly offload oversize blobs. About preventing oversize blobs from ··· 326 fetch those blobs from the LOP to be able to serve the client. 327 328 Note 329 - ++++ 330 331 For fetches instead of clones, a protocol negotiation might not always 332 happen, see the "What about fetches?" FAQ entry below for details. 333 334 Rationale 335 - +++++++++ 336 337 Security, configurability and efficiency of setting things up. 338 339 Implementation 340 - ++++++++++++++ 341 342 A "promisor-remote" protocol v2 capability looks like a good way to 343 implement this. The way the client and server use this capability ··· 356 but might not need anymore, to the LOP. 357 358 Note 359 - ++++ 360 361 It might depend on the context if it should be OK or not for clients 362 to offload large blobs they have created, instead of fetched, directly ··· 367 implementing this feature. 368 369 Rationale 370 - +++++++++ 371 372 On the client, the easiest way to deal with unneeded large blobs is to 373 offload them. 374 375 Implementation 376 - ++++++++++++++ 377 378 This is very similar to what 4) above is about, except on the client 379 side instead of the server side. So a good solution to 4) could likely ··· 385 a LOP, it is likely, and can easily be confirmed, that the LOP still 386 has them, so that they can just be removed from the client. 387 388 - III) Benefits of using LOPs 389 - --------------------------- 390 391 Many benefits are related to the issues discussed in "I) Issues with 392 the current situation" above: ··· 406 407 - Reduced storage needs on the client side. 408 409 - IV) FAQ 410 - ------- 411 412 What about using multiple LOPs on the server and client side? 413 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ··· 533 on a promisor remote. 534 535 Regular fetch 536 - +++++++++++++ 537 538 In a regular fetch, the client will contact the main remote and a 539 protocol negotiation will happen between them. It's a good thing that ··· 551 using, or not using, the same LOP(s) as last time. 552 553 "Backfill" or "lazy" fetch 554 - ++++++++++++++++++++++++++ 555 556 When there is a backfill fetch, the client doesn't necessarily contact 557 the main remote first. It will try to fetch from its promisor remotes ··· 576 token when performing a protocol negotiation with the main remote (see 577 section II.6 above). 578 579 - V) Future improvements 580 - ---------------------- 581 582 It is expected that at the beginning using LOPs will be mostly worth 583 it either in a corporate context where the Git version that clients

··· 34 35 https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/ 36 37 + Non goals 38 + --------- 39 40 - We will not discuss those client side improvements here, as they 41 would require changes in different parts of Git than this effort. ··· 90 even more to host content with larger blobs or more large blobs 91 than currently. 92 93 + I Issues with the current situation 94 + ----------------------------------- 95 96 - Some statistics made on GitLab repos have shown that more than 75% 97 of the disk space is used by blobs that are larger than 1MB and ··· 138 complaining that these tools require significant effort to set up, 139 learn and use correctly. 140 141 + II Main features of the "Large Object Promisors" solution 142 + --------------------------------------------------------- 143 144 The main features below should give a rough overview of how the 145 solution may work. Details about needed elements can be found in ··· 166 other objects. 167 168 Note 1 169 + ^^^^^^ 170 171 To clarify, a LOP is a normal promisor remote, except that: 172 ··· 178 itself. 179 180 Note 2 181 + ^^^^^^ 182 183 Git already makes it possible for a main remote to also be a promisor 184 remote storing both regular objects and large blobs for a client that ··· 186 to avoid that. 187 188 Rationale 189 + ^^^^^^^^^ 190 191 LOPs aim to be good at handling large blobs while main remotes are 192 already good at handling other objects. 193 194 Implementation 195 + ^^^^^^^^^^^^^^ 196 197 Git already has support for multiple promisor remotes, see 198 link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation]. ··· 213 underlying object storage appear like a remote to Git. 214 215 Note 216 + ^^^^ 217 218 A LOP can be a promisor remote accessed using a remote helper by 219 both some clients and the main remote. 220 221 Rationale 222 + ^^^^^^^^^ 223 224 This looks like the simplest way to create LOPs that can cheaply 225 handle many large blobs. 226 227 Implementation 228 + ^^^^^^^^^^^^^^ 229 230 Remote helpers are quite easy to write as shell scripts, but it might 231 be more efficient and maintainable to write them using other languages ··· 247 storage for large files handled by Git LFS. 248 249 Rationale 250 + ^^^^^^^^^ 251 252 This would simplify the server side if it wants to both use a LOP and 253 act as a Git LFS server. ··· 259 LOP all its blobs with a size over a configurable threshold. 260 261 Rationale 262 + ^^^^^^^^^ 263 264 This makes it easy to set things up and to clean things up. For 265 example, an admin could use this to manually convert a repo not using ··· 268 to regularly make sure the large blobs are moved to the LOP. 269 270 Implementation 271 + ^^^^^^^^^^^^^^ 272 273 Using something based on `git repack --filter=...` to separate the 274 blobs we want to offload from the other Git objects could be a good ··· 284 perhaps pushed, into it. 285 286 Rationale 287 + ^^^^^^^^^ 288 289 A main remote containing many oversize blobs would defeat the purpose 290 of LOPs. 291 292 Implementation 293 + ^^^^^^^^^^^^^^ 294 295 The way to offload to a LOP discussed in 4) above can be used to 296 regularly offload oversize blobs. About preventing oversize blobs from ··· 326 fetch those blobs from the LOP to be able to serve the client. 327 328 Note 329 + ^^^^ 330 331 For fetches instead of clones, a protocol negotiation might not always 332 happen, see the "What about fetches?" FAQ entry below for details. 333 334 Rationale 335 + ^^^^^^^^^ 336 337 Security, configurability and efficiency of setting things up. 338 339 Implementation 340 + ^^^^^^^^^^^^^^ 341 342 A "promisor-remote" protocol v2 capability looks like a good way to 343 implement this. The way the client and server use this capability ··· 356 but might not need anymore, to the LOP. 357 358 Note 359 + ^^^^ 360 361 It might depend on the context if it should be OK or not for clients 362 to offload large blobs they have created, instead of fetched, directly ··· 367 implementing this feature. 368 369 Rationale 370 + ^^^^^^^^^ 371 372 On the client, the easiest way to deal with unneeded large blobs is to 373 offload them. 374 375 Implementation 376 + ^^^^^^^^^^^^^^ 377 378 This is very similar to what 4) above is about, except on the client 379 side instead of the server side. So a good solution to 4) could likely ··· 385 a LOP, it is likely, and can easily be confirmed, that the LOP still 386 has them, so that they can just be removed from the client. 387 388 + III Benefits of using LOPs 389 + -------------------------- 390 391 Many benefits are related to the issues discussed in "I) Issues with 392 the current situation" above: ··· 406 407 - Reduced storage needs on the client side. 408 409 + IV FAQ 410 + ------ 411 412 What about using multiple LOPs on the server and client side? 413 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ··· 533 on a promisor remote. 534 535 Regular fetch 536 + ^^^^^^^^^^^^^ 537 538 In a regular fetch, the client will contact the main remote and a 539 protocol negotiation will happen between them. It's a good thing that ··· 551 using, or not using, the same LOP(s) as last time. 552 553 "Backfill" or "lazy" fetch 554 + ^^^^^^^^^^^^^^^^^^^^^^^^^^ 555 556 When there is a backfill fetch, the client doesn't necessarily contact 557 the main remote first. It will try to fetch from its promisor remotes ··· 576 token when performing a protocol negotiation with the main remote (see 577 section II.6 above). 578 579 + V Future improvements 580 + --------------------- 581 582 It is expected that at the beginning using LOPs will be mostly worth 583 it either in a corporate context where the Git version that clients

+1

Documentation/technical/meson.build

··· 13 'commit-graph.adoc', 14 'directory-rename-detection.adoc', 15 'hash-function-transition.adoc', 16 'long-running-process-protocol.adoc', 17 'multi-pack-index.adoc', 18 'packfile-uri.adoc',

··· 13 'commit-graph.adoc', 14 'directory-rename-detection.adoc', 15 'hash-function-transition.adoc', 16 + 'large-object-promisors.adoc', 17 'long-running-process-protocol.adoc', 18 'multi-pack-index.adoc', 19 'packfile-uri.adoc',

+78 -42

Documentation/technical/remembering-renames.adoc

··· 10 11 Outline: 12 13 - 0. Assumptions 14 15 - 1. How rebasing and cherry-picking work 16 17 - 2. Why the renames on MERGE_SIDE1 in any given pick are *always* a 18 superset of the renames on MERGE_SIDE1 for the next pick. 19 20 - 3. Why any rename on MERGE_SIDE1 in any given pick is _almost_ always also 21 a rename on MERGE_SIDE1 for the next pick 22 23 - 4. A detailed description of the counter-examples to #3. 24 25 - 5. Why the special cases in #4 are still fully reasonable to use to pair 26 up files for three-way content merging in the merge machinery, and why 27 they do not affect the correctness of the merge. 28 29 - 6. Interaction with skipping of "irrelevant" renames 30 31 - 7. Additional items that need to be cached 32 33 - 8. How directory rename detection interacts with the above and why this 34 optimization is still safe even if merge.directoryRenames is set to 35 "true". 36 37 38 - === 0. Assumptions === 39 40 There are two assumptions that will hold throughout this document: 41 ··· 44 45 * All merges are fully automatic 46 47 - and a third that will hold in sections 2-5 for simplicity, that I'll later 48 - address in section 8: 49 50 * No directory renames occur 51 ··· 77 stored on disk, and thus is thrown away as soon as the rebase or cherry 78 pick stops for the user to resolve the operation. 79 80 - The third assumption makes sections 2-5 simpler, and allows people to 81 understand the basics of why this optimization is safe and effective, and 82 - then I can go back and address the specifics in section 8. It is probably 83 also worth noting that if directory renames do occur, then the default of 84 merge.directoryRenames being set to "conflict" means that the operation 85 will stop for users to resolve the conflicts and the cache will be thrown ··· 88 users will have set merge.directoryRenames to "true" to allow the merges to 89 continue to proceed automatically. The optimization is still safe with 90 this config setting, but we have to discuss a few more cases to show why; 91 - this discussion is deferred until section 8. 92 93 94 - === 1. How rebasing and cherry-picking work === 95 96 Consider the following setup (from the git-rebase manpage): 97 98 A---B---C topic 99 / 100 D---E---F---G main 101 102 After rebasing or cherry-picking topic onto main, this will appear as: 103 104 A'--B'--C' topic 105 / 106 D---E---F---G main 107 108 The way the commits A', B', and C' are created is through a series of 109 merges, where rebase or cherry-pick sequentially uses each of the three ··· 111 in the merge operation as MERGE_BASE, MERGE_SIDE1, and MERGE_SIDE2. For 112 this picture, the three commits for each of the three merges would be: 113 114 To create A': 115 MERGE_BASE: E 116 MERGE_SIDE1: G ··· 125 MERGE_BASE: B 126 MERGE_SIDE1: B' 127 MERGE_SIDE2: C 128 129 Sometimes, folks are surprised that these three-way merges are done. It 130 can be useful in understanding these three-way merges to view them in a ··· 138 B, B', and C, at least the parts before you decide to record a commit. 139 140 141 - === 2. Why the renames on MERGE_SIDE1 in any given pick are always a === 142 - === superset of the renames on MERGE_SIDE1 for the next pick. === 143 144 The merge machinery uses the filenames it is fed from MERGE_BASE, 145 MERGE_SIDE1, and MERGE_SIDE2. It will only move content to a different ··· 156 First, let's remember what commits are involved in the first and second 157 picks of the cherry-pick or rebase sequence: 158 159 To create A': 160 MERGE_BASE: E 161 MERGE_SIDE1: G ··· 165 MERGE_BASE: A 166 MERGE_SIDE1: A' 167 MERGE_SIDE2: B 168 169 So, in particular, we need to show that the renames between E and G are a 170 superset of those between A and A'. ··· 181 and G are a superset of those between A and A'. 182 183 184 - === 3. Why any rename on MERGE_SIDE1 in any given pick is _almost_ === 185 - === always also a rename on MERGE_SIDE1 for the next pick. === 186 187 Let's again look at the first two picks: 188 189 To create A': 190 MERGE_BASE: E 191 MERGE_SIDE1: G ··· 195 MERGE_BASE: A 196 MERGE_SIDE1: A' 197 MERGE_SIDE2: B 198 199 Now let's look at any given rename from MERGE_SIDE1 of the first pick, i.e. 200 any given rename from E to G. Let's use the filenames 'oldfile' and 201 'newfile' for demonstration purposes. That first pick will function as 202 follows; when the rename is detected, the merge machinery will do a 203 three-way content merge of the following: 204 E:oldfile 205 G:newfile 206 A:oldfile 207 and produce a new result: 208 A':newfile 209 210 Note above that I've assumed that E->A did not rename oldfile. If that 211 side did rename, then we most likely have a rename/rename(1to2) conflict ··· 254 detectable as renames almost always. 255 256 257 - === 4. A detailed description of the counter-examples to #3. === 258 259 - We already noted in section 3 that rename/rename(1to1) (i.e. both sides 260 renaming a file the same way) was one counter-example. The more 261 interesting bit, though, is why did we need to use the "almost" qualifier 262 when stating that A:oldfile and A':newfile are "almost" always detectable 263 as renames? 264 265 - Let's repeat an earlier point that section 3 made: 266 267 A':newfile was created by applying the changes between E:oldfile and 268 G:newfile to A:oldfile. The changes between E:oldfile and G:newfile were 269 <50% of the size of E:oldfile. 270 271 If those changes that were <50% of the size of E:oldfile are also <50% of 272 the size of A:oldfile, then A:oldfile and A':newfile will be detectable as ··· 276 detect A:oldfile and A':newfile as renames. 277 278 Here's an example where that can happen: 279 * E:oldfile had 20 lines 280 * G:newfile added 10 new lines at the beginning of the file 281 * A:oldfile kept the first 3 lines of the file, and deleted all the rest 282 then 283 => A':newfile would have 13 lines, 3 of which matches those in A:oldfile. 284 - E:oldfile -> G:newfile would be detected as a rename, but A:oldfile and 285 - A':newfile would not be. 286 287 288 - === 5. Why the special cases in #4 are still fully reasonable to use to === 289 - === pair up files for three-way content merging in the merge machinery, === 290 - === and why they do not affect the correctness of the merge. === 291 292 In the rename/rename(1to1) case, A:newfile and A':newfile are not renames 293 since they use the *same* filename. However, files with the same filename ··· 295 machinery has never employed break detection). The interesting 296 counter-example case is thus not the rename/rename(1to1) case, but the case 297 where A did not rename oldfile. That was the case that we spent most of 298 - the time discussing in sections 3 and 4. The remainder of this section 299 will be devoted to that case as well. 300 301 So, even if A:oldfile and A':newfile aren't detectable as renames, why is 302 it still reasonable to pair them up for three-way content merging in the 303 merge machinery? There are multiple reasons: 304 305 - * As noted in sections 3 and 4, the diff between A:oldfile and A':newfile 306 is *exactly* the same as the diff between E:oldfile and G:newfile. The 307 latter pair were detected as renames, so it seems unlikely to surprise 308 users for us to treat A:oldfile and A':newfile as renames. ··· 394 optimization than without. 395 396 397 - === 6. Interaction with skipping of "irrelevant" renames === 398 399 Previous optimizations involved skipping rename detection for paths 400 considered to be "irrelevant". See for example the following commits: ··· 421 already detected renames. 422 423 424 - === 7. Additional items that need to be cached === 425 426 It turns out we have to cache more than just renames; we also cache: 427 428 A) non-renames (i.e. unpaired deletes) 429 B) counts of renames within directories 430 C) sources that were marked as RELEVANT_LOCATION, but which were 431 downgraded to RELEVANT_NO_MORE 432 D) the toplevel trees involved in the merge 433 434 These are all stored in struct rename_info, and respectively appear in 435 * cached_pairs (along side actual renames, just with a value of NULL) 436 * dir_rename_counts 437 * cached_irrelevant 438 * merge_trees 439 440 - The reason for (A) comes from the irrelevant renames skipping 441 - optimization discussed in section 6. The fact that irrelevant renames 442 are skipped means we only get a subset of the potential renames 443 detected and subsequent commits may need to run rename detection on 444 the upstream side on a subset of the remaining renames (to get the ··· 447 repeatedly check that those paths remain unpaired on the upstream side 448 with every commit we are transplanting. 449 450 - The reason for (B) is that diffcore_rename_extended() is what 451 generates the counts of renames by directory which is needed in 452 directory rename detection, and if we don't run 453 diffcore_rename_extended() again then we need to have the output from 454 it, including dir_rename_counts, from the previous run. 455 456 - The reason for (C) is that merge-ort's tree traversal will again think 457 those paths are relevant (marking them as RELEVANT_LOCATION), but the 458 fact that they were downgraded to RELEVANT_NO_MORE means that 459 dir_rename_counts already has the information we need for directory 460 rename detection. (A path which becomes RELEVANT_CONTENT in a 461 subsequent commit will be removed from cached_irrelevant.) 462 463 - The reason for (D) is that is how we determine whether the remember 464 renames optimization can be used. In particular, remembering that our 465 sequence of merges looks like: 466 467 Merge 1: 468 MERGE_BASE: E 469 MERGE_SIDE1: G ··· 475 MERGE_SIDE1: A' 476 MERGE_SIDE2: B 477 => Creates B' 478 479 It is the fact that the trees A and A' appear both in Merge 1 and in 480 Merge 2, with A as a parent of A' that allows this optimization. So ··· 482 time. 483 484 485 - === 8. How directory rename detection interacts with the above and === 486 - === why this optimization is still safe even if === 487 - === merge.directoryRenames is set to "true". === 488 489 As noted in the assumptions section: 490 491 """ 492 ...if directory renames do occur, then the default of 493 merge.directoryRenames being set to "conflict" means that the operation ··· 497 is that some users will have set merge.directoryRenames to "true" to 498 allow the merges to continue to proceed automatically. 499 """ 500 501 Let's remember that we need to look at how any given pick affects the next 502 one. So let's again use the first two picks from the diagram in section 503 one: 504 505 First pick does this three-way merge: 506 MERGE_BASE: E 507 MERGE_SIDE1: G ··· 513 MERGE_SIDE1: A' 514 MERGE_SIDE2: B 515 => creates B' 516 517 Now, directory rename detection exists so that if one side of history 518 renames a directory, and the other side adds a new file to the old ··· 545 concerned; see the assumptions section). Two interesting sub-notes 546 about these counts: 547 548 - * If we need to perform rename-detection again on the given side (e.g. 549 some paths are relevant for rename detection that weren't before), 550 then we clear dir_rename_counts and recompute it, making use of 551 cached_pairs. The reason it is important to do this is optimizations ··· 556 easiest way to "fix up" dir_rename_counts in such cases is to just 557 recompute it. 558 559 - * If we prune rename/rename(1to1) entries from the cache, then we also 560 need to update dir_rename_counts to decrement the counts for the 561 involved directory and any relevant parent directories (to undo what 562 update_dir_rename_counts() in diffcore-rename.c incremented when the ··· 578 579 Case 1: MERGE_SIDE1 renames old dir, MERGE_SIDE2 adds new file to old dir 580 581 This case looks like this: 582 583 MERGE_BASE: E, Has olddir/ ··· 595 * MERGE_SIDE1 has cached olddir/newfile -> newdir/newfile 596 Given the cached rename noted above, the second merge can proceed as 597 expected without needing to perform rename detection from A -> A'. 598 599 Case 2: MERGE_SIDE1 renames old dir, MERGE_SIDE2 renames file into old dir 600 601 This case looks like this: 602 MERGE_BASE: E oldfile, olddir/ 603 MERGE_SIDE1: G oldfile, olddir/ -> newdir/ 604 MERGE_SIDE2: A oldfile -> olddir/newfile ··· 617 618 Given the cached rename noted above, the second merge can proceed as 619 expected without needing to perform rename detection from A -> A'. 620 621 Case 3: MERGE_SIDE1 adds new file to old dir, MERGE_SIDE2 renames old dir 622 623 This case looks like this: 624 625 MERGE_BASE: E, Has olddir/ ··· 635 In this case, with the optimization, note that after the first commit there 636 were no renames on MERGE_SIDE1, and any renames on MERGE_SIDE2 are tossed. 637 But the second merge didn't need any renames so this is fine. 638 639 Case 4: MERGE_SIDE1 renames file into old dir, MERGE_SIDE2 renames old dir 640 641 This case looks like this: 642 643 MERGE_BASE: E, Has olddir/ ··· 658 659 Given the cached rename noted above, the second merge can proceed as 660 expected without needing to perform rename detection from A -> A'. 661 662 Finally, I'll just note here that interactions with the 663 skip-irrelevant-renames optimization means we sometimes don't detect

··· 10 11 Outline: 12 13 + 1. Assumptions 14 15 + 2. How rebasing and cherry-picking work 16 17 + 3. Why the renames on MERGE_SIDE1 in any given pick are *always* a 18 superset of the renames on MERGE_SIDE1 for the next pick. 19 20 + 4. Why any rename on MERGE_SIDE1 in any given pick is _almost_ always also 21 a rename on MERGE_SIDE1 for the next pick 22 23 + 5. A detailed description of the counter-examples to #4. 24 25 + 6. Why the special cases in #5 are still fully reasonable to use to pair 26 up files for three-way content merging in the merge machinery, and why 27 they do not affect the correctness of the merge. 28 29 + 7. Interaction with skipping of "irrelevant" renames 30 31 + 8. Additional items that need to be cached 32 33 + 9. How directory rename detection interacts with the above and why this 34 optimization is still safe even if merge.directoryRenames is set to 35 "true". 36 37 38 + == 1. Assumptions == 39 40 There are two assumptions that will hold throughout this document: 41 ··· 44 45 * All merges are fully automatic 46 47 + and a third that will hold in sections 3-6 for simplicity, that I'll later 48 + address in section 9: 49 50 * No directory renames occur 51 ··· 77 stored on disk, and thus is thrown away as soon as the rebase or cherry 78 pick stops for the user to resolve the operation. 79 80 + The third assumption makes sections 3-6 simpler, and allows people to 81 understand the basics of why this optimization is safe and effective, and 82 + then I can go back and address the specifics in section 9. It is probably 83 also worth noting that if directory renames do occur, then the default of 84 merge.directoryRenames being set to "conflict" means that the operation 85 will stop for users to resolve the conflicts and the cache will be thrown ··· 88 users will have set merge.directoryRenames to "true" to allow the merges to 89 continue to proceed automatically. The optimization is still safe with 90 this config setting, but we have to discuss a few more cases to show why; 91 + this discussion is deferred until section 9. 92 93 94 + == 2. How rebasing and cherry-picking work == 95 96 Consider the following setup (from the git-rebase manpage): 97 98 + ------------ 99 A---B---C topic 100 / 101 D---E---F---G main 102 + ------------ 103 104 After rebasing or cherry-picking topic onto main, this will appear as: 105 106 + ------------ 107 A'--B'--C' topic 108 / 109 D---E---F---G main 110 + ------------ 111 112 The way the commits A', B', and C' are created is through a series of 113 merges, where rebase or cherry-pick sequentially uses each of the three ··· 115 in the merge operation as MERGE_BASE, MERGE_SIDE1, and MERGE_SIDE2. For 116 this picture, the three commits for each of the three merges would be: 117 118 + .... 119 To create A': 120 MERGE_BASE: E 121 MERGE_SIDE1: G ··· 130 MERGE_BASE: B 131 MERGE_SIDE1: B' 132 MERGE_SIDE2: C 133 + .... 134 135 Sometimes, folks are surprised that these three-way merges are done. It 136 can be useful in understanding these three-way merges to view them in a ··· 144 B, B', and C, at least the parts before you decide to record a commit. 145 146 147 + == 3. Why the renames on MERGE_SIDE1 in any given pick are always a superset of the renames on MERGE_SIDE1 for the next pick. == 148 149 The merge machinery uses the filenames it is fed from MERGE_BASE, 150 MERGE_SIDE1, and MERGE_SIDE2. It will only move content to a different ··· 161 First, let's remember what commits are involved in the first and second 162 picks of the cherry-pick or rebase sequence: 163 164 + .... 165 To create A': 166 MERGE_BASE: E 167 MERGE_SIDE1: G ··· 171 MERGE_BASE: A 172 MERGE_SIDE1: A' 173 MERGE_SIDE2: B 174 + .... 175 176 So, in particular, we need to show that the renames between E and G are a 177 superset of those between A and A'. ··· 188 and G are a superset of those between A and A'. 189 190 191 + == 4. Why any rename on MERGE_SIDE1 in any given pick is _almost_ always also a rename on MERGE_SIDE1 for the next pick. == 192 193 Let's again look at the first two picks: 194 195 + .... 196 To create A': 197 MERGE_BASE: E 198 MERGE_SIDE1: G ··· 202 MERGE_BASE: A 203 MERGE_SIDE1: A' 204 MERGE_SIDE2: B 205 + .... 206 207 Now let's look at any given rename from MERGE_SIDE1 of the first pick, i.e. 208 any given rename from E to G. Let's use the filenames 'oldfile' and 209 'newfile' for demonstration purposes. That first pick will function as 210 follows; when the rename is detected, the merge machinery will do a 211 three-way content merge of the following: 212 + 213 + .... 214 E:oldfile 215 G:newfile 216 A:oldfile 217 + .... 218 + 219 and produce a new result: 220 + 221 + .... 222 A':newfile 223 + .... 224 225 Note above that I've assumed that E->A did not rename oldfile. If that 226 side did rename, then we most likely have a rename/rename(1to2) conflict ··· 269 detectable as renames almost always. 270 271 272 + == 5. A detailed description of the counter-examples to #4. == 273 274 + We already noted in section 4 that rename/rename(1to1) (i.e. both sides 275 renaming a file the same way) was one counter-example. The more 276 interesting bit, though, is why did we need to use the "almost" qualifier 277 when stating that A:oldfile and A':newfile are "almost" always detectable 278 as renames? 279 280 + Let's repeat an earlier point that section 4 made: 281 282 + .... 283 A':newfile was created by applying the changes between E:oldfile and 284 G:newfile to A:oldfile. The changes between E:oldfile and G:newfile were 285 <50% of the size of E:oldfile. 286 + .... 287 288 If those changes that were <50% of the size of E:oldfile are also <50% of 289 the size of A:oldfile, then A:oldfile and A':newfile will be detectable as ··· 293 detect A:oldfile and A':newfile as renames. 294 295 Here's an example where that can happen: 296 + 297 * E:oldfile had 20 lines 298 * G:newfile added 10 new lines at the beginning of the file 299 * A:oldfile kept the first 3 lines of the file, and deleted all the rest 300 + 301 then 302 + 303 + .... 304 => A':newfile would have 13 lines, 3 of which matches those in A:oldfile. 305 + E:oldfile -> G:newfile would be detected as a rename, but A:oldfile and 306 + A':newfile would not be. 307 + .... 308 309 310 + == 6. Why the special cases in #5 are still fully reasonable to use to pair up files for three-way content merging in the merge machinery, and why they do not affect the correctness of the merge. == 311 312 In the rename/rename(1to1) case, A:newfile and A':newfile are not renames 313 since they use the *same* filename. However, files with the same filename ··· 315 machinery has never employed break detection). The interesting 316 counter-example case is thus not the rename/rename(1to1) case, but the case 317 where A did not rename oldfile. That was the case that we spent most of 318 + the time discussing in sections 4 and 5. The remainder of this section 319 will be devoted to that case as well. 320 321 So, even if A:oldfile and A':newfile aren't detectable as renames, why is 322 it still reasonable to pair them up for three-way content merging in the 323 merge machinery? There are multiple reasons: 324 325 + * As noted in sections 4 and 5, the diff between A:oldfile and A':newfile 326 is *exactly* the same as the diff between E:oldfile and G:newfile. The 327 latter pair were detected as renames, so it seems unlikely to surprise 328 users for us to treat A:oldfile and A':newfile as renames. ··· 414 optimization than without. 415 416 417 + == 7. Interaction with skipping of "irrelevant" renames == 418 419 Previous optimizations involved skipping rename detection for paths 420 considered to be "irrelevant". See for example the following commits: ··· 441 already detected renames. 442 443 444 + == 8. Additional items that need to be cached == 445 446 It turns out we have to cache more than just renames; we also cache: 447 448 + .... 449 A) non-renames (i.e. unpaired deletes) 450 B) counts of renames within directories 451 C) sources that were marked as RELEVANT_LOCATION, but which were 452 downgraded to RELEVANT_NO_MORE 453 D) the toplevel trees involved in the merge 454 + .... 455 456 These are all stored in struct rename_info, and respectively appear in 457 + 458 * cached_pairs (along side actual renames, just with a value of NULL) 459 * dir_rename_counts 460 * cached_irrelevant 461 * merge_trees 462 463 + The reason for `(A)` comes from the irrelevant renames skipping 464 + optimization discussed in section 7. The fact that irrelevant renames 465 are skipped means we only get a subset of the potential renames 466 detected and subsequent commits may need to run rename detection on 467 the upstream side on a subset of the remaining renames (to get the ··· 470 repeatedly check that those paths remain unpaired on the upstream side 471 with every commit we are transplanting. 472 473 + The reason for `(B)` is that diffcore_rename_extended() is what 474 generates the counts of renames by directory which is needed in 475 directory rename detection, and if we don't run 476 diffcore_rename_extended() again then we need to have the output from 477 it, including dir_rename_counts, from the previous run. 478 479 + The reason for `(C)` is that merge-ort's tree traversal will again think 480 those paths are relevant (marking them as RELEVANT_LOCATION), but the 481 fact that they were downgraded to RELEVANT_NO_MORE means that 482 dir_rename_counts already has the information we need for directory 483 rename detection. (A path which becomes RELEVANT_CONTENT in a 484 subsequent commit will be removed from cached_irrelevant.) 485 486 + The reason for `(D)` is that is how we determine whether the remember 487 renames optimization can be used. In particular, remembering that our 488 sequence of merges looks like: 489 490 + .... 491 Merge 1: 492 MERGE_BASE: E 493 MERGE_SIDE1: G ··· 499 MERGE_SIDE1: A' 500 MERGE_SIDE2: B 501 => Creates B' 502 + .... 503 504 It is the fact that the trees A and A' appear both in Merge 1 and in 505 Merge 2, with A as a parent of A' that allows this optimization. So ··· 507 time. 508 509 510 + == 9. How directory rename detection interacts with the above and why this optimization is still safe even if merge.directoryRenames is set to "true". == 511 512 As noted in the assumptions section: 513 514 + .... 515 """ 516 ...if directory renames do occur, then the default of 517 merge.directoryRenames being set to "conflict" means that the operation ··· 521 is that some users will have set merge.directoryRenames to "true" to 522 allow the merges to continue to proceed automatically. 523 """ 524 + .... 525 526 Let's remember that we need to look at how any given pick affects the next 527 one. So let's again use the first two picks from the diagram in section 528 one: 529 530 + .... 531 First pick does this three-way merge: 532 MERGE_BASE: E 533 MERGE_SIDE1: G ··· 539 MERGE_SIDE1: A' 540 MERGE_SIDE2: B 541 => creates B' 542 + .... 543 544 Now, directory rename detection exists so that if one side of history 545 renames a directory, and the other side adds a new file to the old ··· 572 concerned; see the assumptions section). Two interesting sub-notes 573 about these counts: 574 575 + ** If we need to perform rename-detection again on the given side (e.g. 576 some paths are relevant for rename detection that weren't before), 577 then we clear dir_rename_counts and recompute it, making use of 578 cached_pairs. The reason it is important to do this is optimizations ··· 583 easiest way to "fix up" dir_rename_counts in such cases is to just 584 recompute it. 585 586 + ** If we prune rename/rename(1to1) entries from the cache, then we also 587 need to update dir_rename_counts to decrement the counts for the 588 involved directory and any relevant parent directories (to undo what 589 update_dir_rename_counts() in diffcore-rename.c incremented when the ··· 605 606 Case 1: MERGE_SIDE1 renames old dir, MERGE_SIDE2 adds new file to old dir 607 608 + .... 609 This case looks like this: 610 611 MERGE_BASE: E, Has olddir/ ··· 623 * MERGE_SIDE1 has cached olddir/newfile -> newdir/newfile 624 Given the cached rename noted above, the second merge can proceed as 625 expected without needing to perform rename detection from A -> A'. 626 + .... 627 628 Case 2: MERGE_SIDE1 renames old dir, MERGE_SIDE2 renames file into old dir 629 630 + .... 631 This case looks like this: 632 + 633 MERGE_BASE: E oldfile, olddir/ 634 MERGE_SIDE1: G oldfile, olddir/ -> newdir/ 635 MERGE_SIDE2: A oldfile -> olddir/newfile ··· 648 649 Given the cached rename noted above, the second merge can proceed as 650 expected without needing to perform rename detection from A -> A'. 651 + .... 652 653 Case 3: MERGE_SIDE1 adds new file to old dir, MERGE_SIDE2 renames old dir 654 655 + .... 656 This case looks like this: 657 658 MERGE_BASE: E, Has olddir/ ··· 668 In this case, with the optimization, note that after the first commit there 669 were no renames on MERGE_SIDE1, and any renames on MERGE_SIDE2 are tossed. 670 But the second merge didn't need any renames so this is fine. 671 + .... 672 673 Case 4: MERGE_SIDE1 renames file into old dir, MERGE_SIDE2 renames old dir 674 675 + .... 676 This case looks like this: 677 678 MERGE_BASE: E, Has olddir/ ··· 693 694 Given the cached rename noted above, the second merge can proceed as 695 expected without needing to perform rename detection from A -> A'. 696 + .... 697 698 Finally, I'll just note here that interactions with the 699 skip-irrelevant-renames optimization means we sometimes don't detect

+362 -314

Documentation/technical/sparse-checkout.adoc

··· 14 * Reference Emails 15 16 17 - === Terminology === 18 19 - cone mode: one of two modes for specifying the desired subset of files 20 in a sparse-checkout. In cone-mode, the user specifies 21 directories (getting both everything under that directory as 22 well as everything in leading directories), while in non-cone 23 mode, the user specifies gitignore-style patterns. Controlled 24 by the --[no-]cone option to sparse-checkout init|set. 25 26 - SKIP_WORKTREE: When tracked files do not match the sparse specification and 27 are removed from the working tree, the file in the index is marked 28 with a SKIP_WORKTREE bit. Note that if a tracked file has the 29 SKIP_WORKTREE bit set but the file is later written by the user to 30 the working tree anyway, the SKIP_WORKTREE bit will be cleared at 31 the beginning of any subsequent Git operation. 32 - 33 - Most sparse checkout users are unaware of this implementation 34 - detail, and the term should generally be avoided in user-facing 35 - descriptions and command flags. Unfortunately, prior to the 36 - `sparse-checkout` subcommand this low-level detail was exposed, 37 - and as of time of writing, is still exposed in various places. 38 39 - sparse-checkout: a subcommand in git used to reduce the files present in 40 the working tree to a subset of all tracked files. Also, the 41 name of the file in the $GIT_DIR/info directory used to track 42 the sparsity patterns corresponding to the user's desired 43 subset. 44 45 - sparse cone: see cone mode 46 47 - sparse directory: An entry in the index corresponding to a directory, which 48 appears in the index instead of all the files under that directory 49 that would normally appear. See also sparse-index. Something that 50 can cause confusion is that the "sparse directory" does NOT match ··· 52 working tree. May be renamed in the future (e.g. to "skipped 53 directory"). 54 55 - sparse index: A special mode for sparse-checkout that also makes the 56 index sparse by recording a directory entry in lieu of all the 57 files underneath that directory (thus making that a "skipped 58 directory" which unfortunately has also been called a "sparse ··· 60 directories. Controlled by the --[no-]sparse-index option to 61 init|set|reapply. 62 63 - sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to 64 define the set of files of interest. A warning: It is easy to 65 over-use this term (or the shortened "patterns" term), for two 66 reasons: (1) users in cone mode specify directories rather than ··· 70 transiently differ in the working tree or index from the sparsity 71 patterns (see "Sparse specification vs. sparsity patterns"). 72 73 - sparse specification: The set of paths in the user's area of focus. This 74 is typically just the tracked files that match the sparsity 75 patterns, but the sparse specification can temporarily differ and 76 include additional files. (See also "Sparse specification ··· 87 * If working with the index and the working copy, the sparse 88 specification is the union of the paths from above. 89 90 - vivifying: When a command restores a tracked file to the working tree (and 91 hopefully also clears the SKIP_WORKTREE bit in the index for that 92 file), this is referred to as "vivifying" the file. 93 94 95 - === Purpose of sparse-checkouts === 96 97 sparse-checkouts exist to allow users to work with a subset of their 98 files. ··· 120 half dozen different ways. Let's start by considering the high level 121 usecases: 122 123 - A) Users are _only_ interested in the sparse portion of the repo 124 - 125 - A*) Users are _only_ interested in the sparse portion of the repo 126 - that they have downloaded so far 127 - 128 - B) Users want a sparse working tree, but are working in a larger whole 129 - 130 - C) sparse-checkout is a behind-the-scenes implementation detail allowing 131 Git to work with a specially crafted in-house virtual file system; 132 users are actually working with a "full" working tree that is 133 lazily populated, and sparse-checkout helps with the lazy population ··· 136 It may be worth explaining each of these in a bit more detail: 137 138 139 - (Behavior A) Users are _only_ interested in the sparse portion of the repo 140 141 These folks might know there are other things in the repository, but 142 don't care. They are uninterested in other parts of the repository, and ··· 163 after a merge or pull) can lead to worries about local repository size 164 growing unnecessarily[10]. 165 166 - (Behavior A*) Users are _only_ interested in the sparse portion of the repo 167 - that they have downloaded so far (a variant on the first usecase) 168 169 This variant is driven by folks who using partial clones together with 170 sparse checkouts and do disconnected development (so far sounding like a ··· 173 through history within their sparse specification may be too much, so they 174 only download some. They would still like operations to succeed without 175 network connectivity, though, so things like `git log -S${SEARCH_TERM} -p` 176 - or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide 177 partial results that depend on what happens to have been downloaded. 178 179 This variant could be viewed as Behavior A with the sparse specification 180 for history querying operations modified from "sparsity patterns" to 181 "sparsity patterns limited to the blobs we have already downloaded". 182 183 - (Behavior B) Users want a sparse working tree, but are working in a 184 - larger whole 185 186 Stolee described this usecase this way[11]: 187 ··· 229 prefer getting "unrelated" results from their history queries over having 230 slow commands. 231 232 - (Behavior C) sparse-checkout is an implementational detail supporting a 233 - special VFS. 234 235 This usecase goes slightly against the traditional definition of 236 sparse-checkout in that it actually tries to present a full or dense ··· 255 all files are present. 256 257 258 - === Usecases of primary concern === 259 260 Most of the rest of this document will focus on Behavior A and Behavior 261 B. Some notes about the other two cases and why we are not focusing on 262 them: 263 264 - (Behavior A*) 265 266 Supporting this usecase is estimated to be difficult and a lot of work. 267 There are no plans to implement it currently, but it may be a potential ··· 275 sparse specification to restrict it to already-downloaded blobs. The hard 276 part is in making commands capable of respecting that modified definition. 277 278 - (Behavior C) 279 280 This usecase violates some of the early sparse-checkout documented 281 assumptions (since files marked as SKIP_WORKTREE will be displayed to users ··· 300 patches that break things for the real Behavior B folks. 301 302 303 - === Oversimplified mental models === 304 305 An oversimplification of the differences in the above behaviors is: 306 307 - Behavior A: Restrict worktree and history operations to sparse specification 308 - Behavior B: Restrict worktree operations to sparse specification; have any 309 - history operations work across all files 310 - Behavior C: Do not restrict either worktree or history operations to the 311 - sparse specification...with the exception of branch checkouts or 312 - switches which avoid writing files that will match the index so 313 - they can later lazily be populated instead. 314 315 316 - === Desired behavior === 317 318 As noted previously, despite the simple idea of just working with a subset 319 of files, there are a range of different behavioral changes that need to be ··· 326 327 * Commands behaving the same regardless of high-level use-case 328 329 - * commands that only look at files within the sparsity specification 330 331 - * diff (without --cached or REVISION arguments) 332 - * grep (without --cached or REVISION arguments) 333 - * diff-files 334 335 - * commands that restore files to the working tree that match sparsity 336 patterns, and remove unmodified files that don't match those 337 patterns: 338 339 - * switch 340 - * checkout (the switch-like half) 341 - * read-tree 342 - * reset --hard 343 344 - * commands that write conflicted files to the working tree, but otherwise 345 will omit writing files to the working tree that do not match the 346 sparsity patterns: 347 348 - * merge 349 - * rebase 350 - * cherry-pick 351 - * revert 352 353 - * `am` and `apply --cached` should probably be in this section but 354 are buggy (see the "Known bugs" section below) 355 356 The behavior for these commands somewhat depends upon the merge 357 strategy being used: 358 - * `ort` behaves as described above 359 - * `octopus` and `resolve` will always vivify any file changed in the merge 360 relative to the first parent, which is rather suboptimal. 361 362 It is also important to note that these commands WILL update the index ··· 372 specification and the sparsity patterns (much like the commands in the 373 previous section). 374 375 - * commands that always ignore sparsity since commits must be full-tree 376 377 - * archive 378 - * bundle 379 - * commit 380 - * format-patch 381 - * fast-export 382 - * fast-import 383 - * commit-tree 384 385 - * commands that write any modified file to the working tree (conflicted 386 or not, and whether those paths match sparsity patterns or not): 387 388 - * stash 389 - * apply (without `--index` or `--cached`) 390 391 * Commands that may slightly differ for behavior A vs. behavior B: 392 ··· 394 behaviors, but may differ in verbosity and types of warning and error 395 messages. 396 397 - * commands that make modifications to which files are tracked: 398 - * add 399 - * rm 400 - * mv 401 - * update-index 402 403 The fact that files can move between the 'tracked' and 'untracked' 404 categories means some commands will have to treat untracked files 405 differently. But if we have to treat untracked files differently, 406 then additional commands may also need changes: 407 408 - * status 409 - * clean 410 411 In particular, `status` may need to report any untracked files outside 412 the sparsity specification as an erroneous condition (especially to ··· 420 may need to ignore the sparse specification by its nature. Also, its 421 current --[no-]ignore-skip-worktree-entries default is totally bogus. 422 423 - * commands for manually tweaking paths in both the index and the working tree 424 - * `restore` 425 - * the restore-like half of `checkout` 426 427 These commands should be similar to add/rm/mv in that they should 428 only operate on the sparse specification by default, and require a ··· 433 434 * Commands that significantly differ for behavior A vs. behavior B: 435 436 - * commands that query history 437 - * diff (with --cached or REVISION arguments) 438 - * grep (with --cached or REVISION arguments) 439 - * show (when given commit arguments) 440 - * blame (only matters when one or more -C flags are passed) 441 - * and annotate 442 - * log 443 - * whatchanged (may not exist anymore) 444 - * ls-files 445 - * diff-index 446 - * diff-tree 447 - * ls-tree 448 449 Note: for log and whatchanged, revision walking logic is unaffected 450 but displaying of patches is affected by scoping the command to the ··· 458 459 * Commands I don't know how to classify 460 461 - * range-diff 462 463 Is this like `log` or `format-patch`? 464 465 - * cherry 466 467 See range-diff 468 469 * Commands unaffected by sparse-checkouts 470 471 - * shortlog 472 - * show-branch 473 - * rev-list 474 - * bisect 475 476 - * branch 477 - * describe 478 - * fetch 479 - * gc 480 - * init 481 - * maintenance 482 - * notes 483 - * pull (merge & rebase have the necessary changes) 484 - * push 485 - * submodule 486 - * tag 487 488 - * config 489 - * filter-branch (works in separate checkout without sparse-checkout setup) 490 - * pack-refs 491 - * prune 492 - * remote 493 - * repack 494 - * replace 495 496 - * bugreport 497 - * count-objects 498 - * fsck 499 - * gitweb 500 - * help 501 - * instaweb 502 - * merge-tree (doesn't touch worktree or index, and merges always compute full-tree) 503 - * rerere 504 - * verify-commit 505 - * verify-tag 506 507 - * commit-graph 508 - * hash-object 509 - * index-pack 510 - * mktag 511 - * mktree 512 - * multi-pack-index 513 - * pack-objects 514 - * prune-packed 515 - * symbolic-ref 516 - * unpack-objects 517 - * update-ref 518 - * write-tree (operates on index, possibly optimized to use sparse dir entries) 519 520 - * for-each-ref 521 - * get-tar-commit-id 522 - * ls-remote 523 - * merge-base (merges are computed full tree, so merge base should be too) 524 - * name-rev 525 - * pack-redundant 526 - * rev-parse 527 - * show-index 528 - * show-ref 529 - * unpack-file 530 - * var 531 - * verify-pack 532 533 - * <Everything under 'Interacting with Others' in 'git help --all'> 534 - * <Everything under 'Low-level...Syncing' in 'git help --all'> 535 - * <Everything under 'Low-level...Internal Helpers' in 'git help --all'> 536 - * <Everything under 'External commands' in 'git help --all'> 537 538 * Commands that might be affected, but who cares? 539 540 - * merge-file 541 - * merge-index 542 - * gitk? 543 544 545 - === Behavior classes === 546 547 From the above there are a few classes of behavior: 548 ··· 573 574 Commands in this class generally behave like the "restrict" class, 575 except that: 576 - (1) they will ignore the sparse specification and write files with 577 - conflicts to the working tree (thus temporarily expanding the 578 - sparse specification to include such files.) 579 - (2) they are grouped with commands which move to a new commit, since 580 - they often create a commit and then move to it, even though we 581 - know there are many exceptions to moving to the new commit. (For 582 - example, the user may rebase a commit that becomes empty, or have 583 - a cherry-pick which conflicts, or a user could run `merge 584 - --no-commit`, and we also view `apply --index` kind of like `am 585 - --no-commit`.) As such, these commands can make changes to index 586 - files outside the sparse specification, though they'll mark such 587 - files with SKIP_WORKTREE. 588 589 * "restrict also specially applied to untracked files" 590 ··· 609 specification. 610 611 612 - === Subcommand-dependent defaults === 613 614 Note that we have different defaults depending on the command for the 615 desired behavior : 616 617 * Commands defaulting to "restrict": 618 - * diff-files 619 - * diff (without --cached or REVISION arguments) 620 - * grep (without --cached or REVISION arguments) 621 - * switch 622 - * checkout (the switch-like half) 623 - * reset (<commit>) 624 625 - * restore 626 - * checkout (the restore-like half) 627 - * checkout-index 628 - * reset (with pathspec) 629 630 This behavior makes sense; these interact with the working tree. 631 632 * Commands defaulting to "restrict modulo conflicts": 633 - * merge 634 - * rebase 635 - * cherry-pick 636 - * revert 637 638 - * am 639 - * apply --index (which is kind of like an `am --no-commit`) 640 641 - * read-tree (especially with -m or -u; is kind of like a --no-commit merge) 642 - * reset (<tree-ish>, due to similarity to read-tree) 643 644 These also interact with the working tree, but require slightly 645 different behavior either so that (a) conflicts can be resolved or (b) ··· 648 (See also the "Known bugs" section below regarding `am` and `apply`) 649 650 * Commands defaulting to "no restrict": 651 - * archive 652 - * bundle 653 - * commit 654 - * format-patch 655 - * fast-export 656 - * fast-import 657 - * commit-tree 658 659 - * stash 660 - * apply (without `--index`) 661 662 These have completely different defaults and perhaps deserve the most 663 detailed explanation: ··· 679 sparse specification then we'll lose changes from the user. 680 681 * Commands defaulting to "restrict also specially applied to untracked files": 682 - * add 683 - * rm 684 - * mv 685 - * update-index 686 - * status 687 - * clean (?) 688 689 - Our original implementation for the first three of these commands was 690 - "no restrict", but it had some severe usability issues: 691 - * `git add <somefile>` if honored and outside the sparse 692 - specification, can result in the file randomly disappearing later 693 - when some subsequent command is run (since various commands 694 - automatically clean up unmodified files outside the sparse 695 - specification). 696 - * `git rm '*.jpg'` could very negatively surprise users if it deletes 697 - files outside the range of the user's interest. 698 - * `git mv` has similar surprises when moving into or out of the cone, 699 - so best to restrict by default 700 701 - So, we switched `add` and `rm` to default to "restrict", which made 702 - usability problems much less severe and less frequent, but we still got 703 - complaints because commands like: 704 - git add <file-outside-sparse-specification> 705 - git rm <file-outside-sparse-specification> 706 - would silently do nothing. We should instead print an error in those 707 - cases to get usability right. 708 709 - update-index needs to be updated to match, and status and maybe clean 710 - also need to be updated to specially handle untracked paths. 711 712 - There may be a difference in here between behavior A and behavior B in 713 - terms of verboseness of errors or additional warnings. 714 715 * Commands falling under "restrict or no restrict dependent upon behavior 716 A vs. behavior B" 717 718 - * diff (with --cached or REVISION arguments) 719 - * grep (with --cached or REVISION arguments) 720 - * show (when given commit arguments) 721 - * blame (only matters when one or more -C flags passed) 722 - * and annotate 723 - * log 724 - * and variants: shortlog, gitk, show-branch, whatchanged, rev-list 725 - * ls-files 726 - * diff-index 727 - * diff-tree 728 - * ls-tree 729 730 For now, we default to behavior B for these, which want a default of 731 "no restrict". ··· 749 implemented. 750 751 752 - === Sparse specification vs. sparsity patterns === 753 754 In a well-behaved situation, the sparse specification is given directly 755 by the $GIT_DIR/info/sparse-checkout file. However, it can transiently ··· 821 operate full-tree. 822 823 824 - === Implementation Questions === 825 826 - * Do the options --scope={sparse,all} sound good to others? Are there better 827 - options? 828 - * Names in use, or appearing in patches, or previously suggested: 829 - * --sparse/--dense 830 - * --ignore-skip-worktree-bits 831 - * --ignore-skip-worktree-entries 832 - * --ignore-sparsity 833 - * --[no-]restrict-to-sparse-paths 834 - * --full-tree/--sparse-tree 835 - * --[no-]restrict 836 - * --scope={sparse,all} 837 - * --focus/--unfocus 838 - * --limit/--unlimited 839 - * Rationale making me lean slightly towards --scope={sparse,all}: 840 - * We want a name that works for many commands, so we need a name that 841 does not conflict 842 - * We know that we have more than two possible usecases, so it is best 843 to avoid a flag that appears to be binary. 844 - * --scope={sparse,all} isn't overly long and seems relatively 845 explanatory 846 - * `--sparse`, as used in add/rm/mv, is totally backwards for 847 grep/log/etc. Changing the meaning of `--sparse` for these 848 commands would fix the backwardness, but possibly break existing 849 scripts. Using a new name pairing would allow us to treat 850 `--sparse` in these commands as a deprecated alias. 851 - * There is a different `--sparse`/`--dense` pair for commands using 852 revision machinery, so using that naming might cause confusion 853 - * There is also a `--sparse` in both pack-objects and show-branch, which 854 don't conflict but do suggest that `--sparse` is overloaded 855 - * The name --ignore-skip-worktree-bits is a double negative, is 856 quite a mouthful, refers to an implementation detail that many 857 users may not be familiar with, and we'd need a negation for it 858 which would probably be even more ridiculously long. (But we 859 can make --ignore-skip-worktree-bits a deprecated alias for 860 --no-restrict.) 861 862 - * If a config option is added (sparse.scope?) what should the values and 863 description be? "sparse" (behavior A), "worktree-sparse-history-dense" 864 (behavior B), "dense" (behavior C)? There's a risk of confusion, 865 because even for Behaviors A and B we want some commands to be ··· 868 the primary difference we are focusing is just the history-querying 869 commands (log/diff/grep). Previous config suggestion here: [13] 870 871 - * Is `--no-expand` a good alias for ls-files's `--sparse` option? 872 (`--sparse` does not map to either `--scope=sparse` or `--scope=all`, 873 because in non-cone mode it does nothing and in cone-mode it shows the 874 sparse directory entries which are technically outside the sparse 875 specification) 876 877 - * Under Behavior A: 878 - * Does ls-files' `--no-expand` override the default `--scope=all`, or 879 does it need an extra flag? 880 - * Does ls-files' `-t` option imply `--scope=all`? 881 - * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`? 882 883 - * sparse-checkout: once behavior A is fully implemented, should we take 884 an interim measure to ease people into switching the default? Namely, 885 if folks are not already in a sparse checkout, then require 886 `sparse-checkout init/set` to take a ··· 892 is seamless for them. 893 894 895 - === Implementation Goals/Plans === 896 897 * Get buy-in on this document in general. 898 ··· 910 request that they not trigger this bug." flag 911 912 * Flags & Config 913 - * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all` 914 - * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore 915 a deprecated aliases for `--scope=all` 916 - * Create config option (sparse.scope?), tie it to the "Cliff notes" 917 overview 918 919 - * Add --scope=sparse (and --scope=all) flag to each of the history querying 920 commands. IMPORTANT: make sure diff machinery changes don't mess with 921 format-patch, fast-export, etc. 922 923 - === Known bugs === 924 925 This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've 926 been working on it. 927 928 - 0. Behavior A is not well supported in Git. (Behavior B didn't used to 929 be either, but was the easier of the two to implement.) 930 931 - 1. am and apply: 932 933 apply, without `--index` or `--cached`, relies on files being present 934 in the working copy, and also writes to them unconditionally. As ··· 948 files and then complain that those vivified files would be 949 overwritten by merge. 950 951 - 2. reset --hard: 952 953 reset --hard provides confusing error message (works correctly, but 954 misleads the user into believing it didn't): ··· 971 `git reset --hard` DID remove addme from the index and the working tree, contrary 972 to the error message, but in line with how reset --hard should behave. 973 974 - 3. read-tree 975 976 `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the 977 entries it reads into the index, resulting in all your files suddenly 978 appearing to be "deleted". 979 980 - 4. Checkout, restore: 981 982 These command do not handle path & revision arguments appropriately: 983 ··· 1030 S tracked 1031 H tracked-but-maybe-skipped 1032 1033 - 5. checkout and restore --staged, continued: 1034 1035 These commands do not correctly scope operations to the sparse 1036 specification, and make it worse by not setting important SKIP_WORKTREE ··· 1046 the sparse specification, but then it will be important to set the 1047 SKIP_WORKTREE bits appropriately. 1048 1049 - 6. Performance issues; see: 1050 - https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/ 1051 1052 1053 - === Reference Emails === 1054 1055 Emails that detail various bugs we've had in sparse-checkout: 1056 1057 - [1] (Original descriptions of behavior A & behavior B) 1058 - https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/ 1059 - [2] (Fix stash applications in sparse checkouts; bugs from behavioral differences) 1060 - https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/ 1061 - [3] (Present-despite-skipped entries) 1062 - https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/ 1063 - [4] (Clone --no-checkout interaction) 1064 - https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout) 1065 - [5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`) 1066 - https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/ 1067 - [6] (SKIP_WORKTREE is advisory, not mandatory) 1068 - https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/ 1069 - [7] (`worktree add` should copy sparsity settings from current worktree) 1070 - https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/ 1071 - [8] (Avoid negative surprises in add, rm, and mv) 1072 - https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/ 1073 - https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/ 1074 - [9] (Move from out-of-cone to in-cone) 1075 - https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/ 1076 - https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/ 1077 - [10] (Unnecessarily downloading objects outside sparse specification) 1078 - https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/ 1079 1080 - [11] (Stolee's comments on high-level usecases) 1081 - https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/ 1082 1083 [12] Others commenting on eventually switching default to behavior A: 1084 * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/ 1085 * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/ 1086 * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/ 1087 1088 - [13] Previous config name suggestion and description 1089 - * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/ 1090 1091 [14] Tangential issue: switch to cone mode as default sparse specification mechanism: 1092 - https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/ 1093 1094 [15] Lengthy email on grep behavior, covering what should be searched: 1095 - * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/ 1096 1097 [16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations, 1098 search for the parenthetical comment starting "We do not check". 1099 - https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/ 1100 1101 [17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/

··· 14 * Reference Emails 15 16 17 + == Terminology == 18 19 + *`cone mode`*:: 20 + one of two modes for specifying the desired subset of files 21 in a sparse-checkout. In cone-mode, the user specifies 22 directories (getting both everything under that directory as 23 well as everything in leading directories), while in non-cone 24 mode, the user specifies gitignore-style patterns. Controlled 25 by the --[no-]cone option to sparse-checkout init|set. 26 27 + *`SKIP_WORKTREE`*:: 28 + When tracked files do not match the sparse specification and 29 are removed from the working tree, the file in the index is marked 30 with a SKIP_WORKTREE bit. Note that if a tracked file has the 31 SKIP_WORKTREE bit set but the file is later written by the user to 32 the working tree anyway, the SKIP_WORKTREE bit will be cleared at 33 the beginning of any subsequent Git operation. 34 + + 35 + Most sparse checkout users are unaware of this implementation 36 + detail, and the term should generally be avoided in user-facing 37 + descriptions and command flags. Unfortunately, prior to the 38 + `sparse-checkout` subcommand this low-level detail was exposed, 39 + and as of time of writing, is still exposed in various places. 40 41 + *`sparse-checkout`*:: 42 + a subcommand in git used to reduce the files present in 43 the working tree to a subset of all tracked files. Also, the 44 name of the file in the $GIT_DIR/info directory used to track 45 the sparsity patterns corresponding to the user's desired 46 subset. 47 48 + *`sparse cone`*:: see cone mode 49 50 + *`sparse directory`*:: 51 + An entry in the index corresponding to a directory, which 52 appears in the index instead of all the files under that directory 53 that would normally appear. See also sparse-index. Something that 54 can cause confusion is that the "sparse directory" does NOT match ··· 56 working tree. May be renamed in the future (e.g. to "skipped 57 directory"). 58 59 + *`sparse index`*:: 60 + A special mode for sparse-checkout that also makes the 61 index sparse by recording a directory entry in lieu of all the 62 files underneath that directory (thus making that a "skipped 63 directory" which unfortunately has also been called a "sparse ··· 65 directories. Controlled by the --[no-]sparse-index option to 66 init|set|reapply. 67 68 + *`sparsity patterns`*:: 69 + patterns from $GIT_DIR/info/sparse-checkout used to 70 define the set of files of interest. A warning: It is easy to 71 over-use this term (or the shortened "patterns" term), for two 72 reasons: (1) users in cone mode specify directories rather than ··· 76 transiently differ in the working tree or index from the sparsity 77 patterns (see "Sparse specification vs. sparsity patterns"). 78 79 + *`sparse specification`*:: 80 + The set of paths in the user's area of focus. This 81 is typically just the tracked files that match the sparsity 82 patterns, but the sparse specification can temporarily differ and 83 include additional files. (See also "Sparse specification ··· 94 * If working with the index and the working copy, the sparse 95 specification is the union of the paths from above. 96 97 + *`vivifying`*:: 98 + When a command restores a tracked file to the working tree (and 99 hopefully also clears the SKIP_WORKTREE bit in the index for that 100 file), this is referred to as "vivifying" the file. 101 102 103 + == Purpose of sparse-checkouts == 104 105 sparse-checkouts exist to allow users to work with a subset of their 106 files. ··· 128 half dozen different ways. Let's start by considering the high level 129 usecases: 130 131 + [horizontal] 132 + A):: Users are _only_ interested in the sparse portion of the repo 133 + A*):: Users are _only_ interested in the sparse portion of the repo 134 + that they have downloaded so far 135 + B):: Users want a sparse working tree, but are working in a larger whole 136 + C):: sparse-checkout is a behind-the-scenes implementation detail allowing 137 Git to work with a specially crafted in-house virtual file system; 138 users are actually working with a "full" working tree that is 139 lazily populated, and sparse-checkout helps with the lazy population ··· 142 It may be worth explaining each of these in a bit more detail: 143 144 145 + === (Behavior A) Users are _only_ interested in the sparse portion of the repo 146 147 These folks might know there are other things in the repository, but 148 don't care. They are uninterested in other parts of the repository, and ··· 169 after a merge or pull) can lead to worries about local repository size 170 growing unnecessarily[10]. 171 172 + === (Behavior A*) Users are _only_ interested in the sparse portion of the repo that they have downloaded so far (a variant on the first usecase) 173 174 This variant is driven by folks who using partial clones together with 175 sparse checkouts and do disconnected development (so far sounding like a ··· 178 through history within their sparse specification may be too much, so they 179 only download some. They would still like operations to succeed without 180 network connectivity, though, so things like `git log -S${SEARCH_TERM} -p` 181 + or `git grep ${SEARCH_TERM} OLDREV` would need to be prepared to provide 182 partial results that depend on what happens to have been downloaded. 183 184 This variant could be viewed as Behavior A with the sparse specification 185 for history querying operations modified from "sparsity patterns" to 186 "sparsity patterns limited to the blobs we have already downloaded". 187 188 + === (Behavior B) Users want a sparse working tree, but are working in a larger whole 189 190 Stolee described this usecase this way[11]: 191 ··· 233 prefer getting "unrelated" results from their history queries over having 234 slow commands. 235 236 + === (Behavior C) sparse-checkout is an implementational detail supporting a special VFS. 237 238 This usecase goes slightly against the traditional definition of 239 sparse-checkout in that it actually tries to present a full or dense ··· 258 all files are present. 259 260 261 + == Usecases of primary concern == 262 263 Most of the rest of this document will focus on Behavior A and Behavior 264 B. Some notes about the other two cases and why we are not focusing on 265 them: 266 267 + === (Behavior A*) 268 269 Supporting this usecase is estimated to be difficult and a lot of work. 270 There are no plans to implement it currently, but it may be a potential ··· 278 sparse specification to restrict it to already-downloaded blobs. The hard 279 part is in making commands capable of respecting that modified definition. 280 281 + === (Behavior C) 282 283 This usecase violates some of the early sparse-checkout documented 284 assumptions (since files marked as SKIP_WORKTREE will be displayed to users ··· 303 patches that break things for the real Behavior B folks. 304 305 306 + == Oversimplified mental models == 307 308 An oversimplification of the differences in the above behaviors is: 309 310 + (Behavior A):: Restrict worktree and history operations to sparse specification 311 + (Behavior B):: Restrict worktree operations to sparse specification; have any 312 + history operations work across all files 313 + (Behavior C):: Do not restrict either worktree or history operations to the 314 + sparse specification...with the exception of branch checkouts or 315 + switches which avoid writing files that will match the index so 316 + they can later lazily be populated instead. 317 318 319 + == Desired behavior == 320 321 As noted previously, despite the simple idea of just working with a subset 322 of files, there are a range of different behavioral changes that need to be ··· 329 330 * Commands behaving the same regardless of high-level use-case 331 332 + ** commands that only look at files within the sparsity specification 333 334 + *** diff (without --cached or REVISION arguments) 335 + *** grep (without --cached or REVISION arguments) 336 + *** diff-files 337 338 + ** commands that restore files to the working tree that match sparsity 339 patterns, and remove unmodified files that don't match those 340 patterns: 341 342 + *** switch 343 + *** checkout (the switch-like half) 344 + *** read-tree 345 + *** reset --hard 346 347 + ** commands that write conflicted files to the working tree, but otherwise 348 will omit writing files to the working tree that do not match the 349 sparsity patterns: 350 351 + *** merge 352 + *** rebase 353 + *** cherry-pick 354 + *** revert 355 356 + *** `am` and `apply --cached` should probably be in this section but 357 are buggy (see the "Known bugs" section below) 358 359 The behavior for these commands somewhat depends upon the merge 360 strategy being used: 361 + 362 + *** `ort` behaves as described above 363 + *** `octopus` and `resolve` will always vivify any file changed in the merge 364 relative to the first parent, which is rather suboptimal. 365 366 It is also important to note that these commands WILL update the index ··· 376 specification and the sparsity patterns (much like the commands in the 377 previous section). 378 379 + ** commands that always ignore sparsity since commits must be full-tree 380 381 + *** archive 382 + *** bundle 383 + *** commit 384 + *** format-patch 385 + *** fast-export 386 + *** fast-import 387 + *** commit-tree 388 389 + ** commands that write any modified file to the working tree (conflicted 390 or not, and whether those paths match sparsity patterns or not): 391 392 + *** stash 393 + *** apply (without `--index` or `--cached`) 394 395 * Commands that may slightly differ for behavior A vs. behavior B: 396 ··· 398 behaviors, but may differ in verbosity and types of warning and error 399 messages. 400 401 + ** commands that make modifications to which files are tracked: 402 + 403 + *** add 404 + *** rm 405 + *** mv 406 + *** update-index 407 408 The fact that files can move between the 'tracked' and 'untracked' 409 categories means some commands will have to treat untracked files 410 differently. But if we have to treat untracked files differently, 411 then additional commands may also need changes: 412 413 + *** status 414 + *** clean 415 416 In particular, `status` may need to report any untracked files outside 417 the sparsity specification as an erroneous condition (especially to ··· 425 may need to ignore the sparse specification by its nature. Also, its 426 current --[no-]ignore-skip-worktree-entries default is totally bogus. 427 428 + ** commands for manually tweaking paths in both the index and the working tree 429 + 430 + *** `restore` 431 + *** the restore-like half of `checkout` 432 433 These commands should be similar to add/rm/mv in that they should 434 only operate on the sparse specification by default, and require a ··· 439 440 * Commands that significantly differ for behavior A vs. behavior B: 441 442 + ** commands that query history 443 + 444 + *** diff (with --cached or REVISION arguments) 445 + *** grep (with --cached or REVISION arguments) 446 + *** show (when given commit arguments) 447 + *** blame (only matters when one or more -C flags are passed) 448 + **** and annotate 449 + *** log 450 + *** whatchanged (may not exist anymore) 451 + *** ls-files 452 + *** diff-index 453 + *** diff-tree 454 + *** ls-tree 455 456 Note: for log and whatchanged, revision walking logic is unaffected 457 but displaying of patches is affected by scoping the command to the ··· 465 466 * Commands I don't know how to classify 467 468 + ** range-diff 469 470 Is this like `log` or `format-patch`? 471 472 + ** cherry 473 474 See range-diff 475 476 * Commands unaffected by sparse-checkouts 477 478 + ** shortlog 479 + ** show-branch 480 + ** rev-list 481 + ** bisect 482 483 + ** branch 484 + ** describe 485 + ** fetch 486 + ** gc 487 + ** init 488 + ** maintenance 489 + ** notes 490 + ** pull (merge & rebase have the necessary changes) 491 + ** push 492 + ** submodule 493 + ** tag 494 495 + ** config 496 + ** filter-branch (works in separate checkout without sparse-checkout setup) 497 + ** pack-refs 498 + ** prune 499 + ** remote 500 + ** repack 501 + ** replace 502 503 + ** bugreport 504 + ** count-objects 505 + ** fsck 506 + ** gitweb 507 + ** help 508 + ** instaweb 509 + ** merge-tree (doesn't touch worktree or index, and merges always compute full-tree) 510 + ** rerere 511 + ** verify-commit 512 + ** verify-tag 513 514 + ** commit-graph 515 + ** hash-object 516 + ** index-pack 517 + ** mktag 518 + ** mktree 519 + ** multi-pack-index 520 + ** pack-objects 521 + ** prune-packed 522 + ** symbolic-ref 523 + ** unpack-objects 524 + ** update-ref 525 + ** write-tree (operates on index, possibly optimized to use sparse dir entries) 526 527 + ** for-each-ref 528 + ** get-tar-commit-id 529 + ** ls-remote 530 + ** merge-base (merges are computed full tree, so merge base should be too) 531 + ** name-rev 532 + ** pack-redundant 533 + ** rev-parse 534 + ** show-index 535 + ** show-ref 536 + ** unpack-file 537 + ** var 538 + ** verify-pack 539 540 + ** <Everything under 'Interacting with Others' in 'git help --all'> 541 + ** <Everything under 'Low-level...Syncing' in 'git help --all'> 542 + ** <Everything under 'Low-level...Internal Helpers' in 'git help --all'> 543 + ** <Everything under 'External commands' in 'git help --all'> 544 545 * Commands that might be affected, but who cares? 546 547 + ** merge-file 548 + ** merge-index 549 + ** gitk? 550 551 552 + == Behavior classes == 553 554 From the above there are a few classes of behavior: 555 ··· 580 581 Commands in this class generally behave like the "restrict" class, 582 except that: 583 + 584 + (1) they will ignore the sparse specification and write files with 585 + conflicts to the working tree (thus temporarily expanding the 586 + sparse specification to include such files.) 587 + (2) they are grouped with commands which move to a new commit, since 588 + they often create a commit and then move to it, even though we 589 + know there are many exceptions to moving to the new commit. (For 590 + example, the user may rebase a commit that becomes empty, or have 591 + a cherry-pick which conflicts, or a user could run `merge 592 + --no-commit`, and we also view `apply --index` kind of like `am 593 + --no-commit`.) As such, these commands can make changes to index 594 + files outside the sparse specification, though they'll mark such 595 + files with SKIP_WORKTREE. 596 597 * "restrict also specially applied to untracked files" 598 ··· 617 specification. 618 619 620 + == Subcommand-dependent defaults == 621 622 Note that we have different defaults depending on the command for the 623 desired behavior : 624 625 * Commands defaulting to "restrict": 626 + 627 + ** diff-files 628 + ** diff (without --cached or REVISION arguments) 629 + ** grep (without --cached or REVISION arguments) 630 + ** switch 631 + ** checkout (the switch-like half) 632 + ** reset (<commit>) 633 634 + ** restore 635 + ** checkout (the restore-like half) 636 + ** checkout-index 637 + ** reset (with pathspec) 638 639 This behavior makes sense; these interact with the working tree. 640 641 * Commands defaulting to "restrict modulo conflicts": 642 + 643 + ** merge 644 + ** rebase 645 + ** cherry-pick 646 + ** revert 647 648 + ** am 649 + ** apply --index (which is kind of like an `am --no-commit`) 650 651 + ** read-tree (especially with -m or -u; is kind of like a --no-commit merge) 652 + ** reset (<tree-ish>, due to similarity to read-tree) 653 654 These also interact with the working tree, but require slightly 655 different behavior either so that (a) conflicts can be resolved or (b) ··· 658 (See also the "Known bugs" section below regarding `am` and `apply`) 659 660 * Commands defaulting to "no restrict": 661 662 + ** archive 663 + ** bundle 664 + ** commit 665 + ** format-patch 666 + ** fast-export 667 + ** fast-import 668 + ** commit-tree 669 + 670 + ** stash 671 + ** apply (without `--index`) 672 673 These have completely different defaults and perhaps deserve the most 674 detailed explanation: ··· 690 sparse specification then we'll lose changes from the user. 691 692 * Commands defaulting to "restrict also specially applied to untracked files": 693 + 694 + ** add 695 + ** rm 696 + ** mv 697 + ** update-index 698 + ** status 699 + ** clean (?) 700 + 701 + .... 702 + Our original implementation for the first three of these commands was 703 + "no restrict", but it had some severe usability issues: 704 705 + * `git add <somefile>` if honored and outside the sparse 706 + specification, can result in the file randomly disappearing later 707 + when some subsequent command is run (since various commands 708 + automatically clean up unmodified files outside the sparse 709 + specification). 710 + * `git rm '*.jpg'` could very negatively surprise users if it deletes 711 + files outside the range of the user's interest. 712 + * `git mv` has similar surprises when moving into or out of the cone, 713 + so best to restrict by default 714 715 + So, we switched `add` and `rm` to default to "restrict", which made 716 + usability problems much less severe and less frequent, but we still got 717 + complaints because commands like: 718 719 + git add <file-outside-sparse-specification> 720 + git rm <file-outside-sparse-specification> 721 722 + would silently do nothing. We should instead print an error in those 723 + cases to get usability right. 724 + 725 + update-index needs to be updated to match, and status and maybe clean 726 + also need to be updated to specially handle untracked paths. 727 + 728 + There may be a difference in here between behavior A and behavior B in 729 + terms of verboseness of errors or additional warnings. 730 + .... 731 732 * Commands falling under "restrict or no restrict dependent upon behavior 733 A vs. behavior B" 734 735 + ** diff (with --cached or REVISION arguments) 736 + ** grep (with --cached or REVISION arguments) 737 + ** show (when given commit arguments) 738 + ** blame (only matters when one or more -C flags passed) 739 + *** and annotate 740 + ** log 741 + *** and variants: shortlog, gitk, show-branch, whatchanged, rev-list 742 + ** ls-files 743 + ** diff-index 744 + ** diff-tree 745 + ** ls-tree 746 747 For now, we default to behavior B for these, which want a default of 748 "no restrict". ··· 766 implemented. 767 768 769 + == Sparse specification vs. sparsity patterns == 770 771 In a well-behaved situation, the sparse specification is given directly 772 by the $GIT_DIR/info/sparse-checkout file. However, it can transiently ··· 838 operate full-tree. 839 840 841 + == Implementation Questions == 842 843 + * Do the options --scope={sparse,all} sound good to others? Are there better options? 844 + 845 + ** Names in use, or appearing in patches, or previously suggested: 846 + 847 + *** --sparse/--dense 848 + *** --ignore-skip-worktree-bits 849 + *** --ignore-skip-worktree-entries 850 + *** --ignore-sparsity 851 + *** --[no-]restrict-to-sparse-paths 852 + *** --full-tree/--sparse-tree 853 + *** --[no-]restrict 854 + *** --scope={sparse,all} 855 + *** --focus/--unfocus 856 + *** --limit/--unlimited 857 + 858 + ** Rationale making me lean slightly towards --scope={sparse,all}: 859 + 860 + *** We want a name that works for many commands, so we need a name that 861 does not conflict 862 + *** We know that we have more than two possible usecases, so it is best 863 to avoid a flag that appears to be binary. 864 + *** --scope={sparse,all} isn't overly long and seems relatively 865 explanatory 866 + *** `--sparse`, as used in add/rm/mv, is totally backwards for 867 grep/log/etc. Changing the meaning of `--sparse` for these 868 commands would fix the backwardness, but possibly break existing 869 scripts. Using a new name pairing would allow us to treat 870 `--sparse` in these commands as a deprecated alias. 871 + *** There is a different `--sparse`/`--dense` pair for commands using 872 revision machinery, so using that naming might cause confusion 873 + *** There is also a `--sparse` in both pack-objects and show-branch, which 874 don't conflict but do suggest that `--sparse` is overloaded 875 + *** The name --ignore-skip-worktree-bits is a double negative, is 876 quite a mouthful, refers to an implementation detail that many 877 users may not be familiar with, and we'd need a negation for it 878 which would probably be even more ridiculously long. (But we 879 can make --ignore-skip-worktree-bits a deprecated alias for 880 --no-restrict.) 881 882 + ** If a config option is added (sparse.scope?) what should the values and 883 description be? "sparse" (behavior A), "worktree-sparse-history-dense" 884 (behavior B), "dense" (behavior C)? There's a risk of confusion, 885 because even for Behaviors A and B we want some commands to be ··· 888 the primary difference we are focusing is just the history-querying 889 commands (log/diff/grep). Previous config suggestion here: [13] 890 891 + ** Is `--no-expand` a good alias for ls-files's `--sparse` option? 892 (`--sparse` does not map to either `--scope=sparse` or `--scope=all`, 893 because in non-cone mode it does nothing and in cone-mode it shows the 894 sparse directory entries which are technically outside the sparse 895 specification) 896 897 + ** Under Behavior A: 898 + 899 + *** Does ls-files' `--no-expand` override the default `--scope=all`, or 900 does it need an extra flag? 901 + *** Does ls-files' `-t` option imply `--scope=all`? 902 + *** Does update-index's `--[no-]skip-worktree` option imply `--scope=all`? 903 904 + ** sparse-checkout: once behavior A is fully implemented, should we take 905 an interim measure to ease people into switching the default? Namely, 906 if folks are not already in a sparse checkout, then require 907 `sparse-checkout init/set` to take a ··· 913 is seamless for them. 914 915 916 + == Implementation Goals/Plans == 917 918 * Get buy-in on this document in general. 919 ··· 931 request that they not trigger this bug." flag 932 933 * Flags & Config 934 + 935 + ** Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all` 936 + ** Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore 937 a deprecated aliases for `--scope=all` 938 + ** Create config option (sparse.scope?), tie it to the "Cliff notes" 939 overview 940 941 + ** Add --scope=sparse (and --scope=all) flag to each of the history querying 942 commands. IMPORTANT: make sure diff machinery changes don't mess with 943 format-patch, fast-export, etc. 944 945 + == Known bugs == 946 947 This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've 948 been working on it. 949 950 + 1. Behavior A is not well supported in Git. (Behavior B didn't used to 951 be either, but was the easier of the two to implement.) 952 953 + 2. am and apply: 954 955 apply, without `--index` or `--cached`, relies on files being present 956 in the working copy, and also writes to them unconditionally. As ··· 970 files and then complain that those vivified files would be 971 overwritten by merge. 972 973 + 3. reset --hard: 974 975 reset --hard provides confusing error message (works correctly, but 976 misleads the user into believing it didn't): ··· 993 `git reset --hard` DID remove addme from the index and the working tree, contrary 994 to the error message, but in line with how reset --hard should behave. 995 996 + 4. read-tree 997 998 `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the 999 entries it reads into the index, resulting in all your files suddenly 1000 appearing to be "deleted". 1001 1002 + 5. Checkout, restore: 1003 1004 These command do not handle path & revision arguments appropriately: 1005 ··· 1052 S tracked 1053 H tracked-but-maybe-skipped 1054 1055 + 6. checkout and restore --staged, continued: 1056 1057 These commands do not correctly scope operations to the sparse 1058 specification, and make it worse by not setting important SKIP_WORKTREE ··· 1068 the sparse specification, but then it will be important to set the 1069 SKIP_WORKTREE bits appropriately. 1070 1071 + 7. Performance issues; see: 1072 1073 + https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/ 1074 1075 + 1076 + == Reference Emails == 1077 1078 Emails that detail various bugs we've had in sparse-checkout: 1079 1080 + [1] (Original descriptions of behavior A & behavior B): 1081 + 1082 + https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/ 1083 + 1084 + [2] (Fix stash applications in sparse checkouts; bugs from behavioral differences): 1085 + 1086 + https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/ 1087 + 1088 + [3] (Present-despite-skipped entries): 1089 + 1090 + https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/ 1091 1092 + [4] (Clone --no-checkout interaction): 1093 + 1094 + https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout) 1095 + 1096 + [5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`): 1097 + 1098 + https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/ 1099 + 1100 + [6] (SKIP_WORKTREE is advisory, not mandatory): 1101 + 1102 + https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/ 1103 + 1104 + [7] (`worktree add` should copy sparsity settings from current worktree): 1105 + 1106 + https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/ 1107 + 1108 + [8] (Avoid negative surprises in add, rm, and mv): 1109 + 1110 + * https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/ 1111 + * https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/ 1112 + 1113 + [9] (Move from out-of-cone to in-cone): 1114 + 1115 + * https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/ 1116 + * https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/ 1117 + 1118 + [10] (Unnecessarily downloading objects outside sparse specification): 1119 + 1120 + https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/ 1121 + 1122 + [11] (Stolee's comments on high-level usecases): 1123 + 1124 + https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/ 1125 1126 [12] Others commenting on eventually switching default to behavior A: 1127 + 1128 * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/ 1129 * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/ 1130 * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/ 1131 1132 + [13] Previous config name suggestion and description: 1133 + 1134 + https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/ 1135 1136 [14] Tangential issue: switch to cone mode as default sparse specification mechanism: 1137 + 1138 + https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/ 1139 1140 [15] Lengthy email on grep behavior, covering what should be searched: 1141 + 1142 + https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/ 1143 1144 [16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations, 1145 search for the parenthetical comment starting "We do not check". 1146 + 1147 + https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/ 1148 1149 [17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/