Git fork

Merge branch 'en/sparse-checkout-design'

Design doc.

* en/sparse-checkout-design:
sparse-checkout.txt: new document with sparse-checkout directions

+1103
+1103
Documentation/technical/sparse-checkout.txt
··· 1 + Table of contents: 2 + 3 + * Terminology 4 + * Purpose of sparse-checkouts 5 + * Usecases of primary concern 6 + * Oversimplified mental models ("Cliff Notes" for this document!) 7 + * Desired behavior 8 + * Behavior classes 9 + * Subcommand-dependent defaults 10 + * Sparse specification vs. sparsity patterns 11 + * Implementation Questions 12 + * Implementation Goals/Plans 13 + * Known bugs 14 + * Reference Emails 15 + 16 + 17 + === Terminology === 18 + 19 + cone mode: one of two modes for specifying the desired subset of files 20 + in a sparse-checkout. In cone-mode, the user specifies 21 + directories (getting both everything under that directory as 22 + well as everything in leading directories), while in non-cone 23 + mode, the user specifies gitignore-style patterns. Controlled 24 + by the --[no-]cone option to sparse-checkout init|set. 25 + 26 + SKIP_WORKTREE: When tracked files do not match the sparse specification and 27 + are removed from the working tree, the file in the index is marked 28 + with a SKIP_WORKTREE bit. Note that if a tracked file has the 29 + SKIP_WORKTREE bit set but the file is later written by the user to 30 + the working tree anyway, the SKIP_WORKTREE bit will be cleared at 31 + the beginning of any subsequent Git operation. 32 + 33 + Most sparse checkout users are unaware of this implementation 34 + detail, and the term should generally be avoided in user-facing 35 + descriptions and command flags. Unfortunately, prior to the 36 + `sparse-checkout` subcommand this low-level detail was exposed, 37 + and as of time of writing, is still exposed in various places. 38 + 39 + sparse-checkout: a subcommand in git used to reduce the files present in 40 + the working tree to a subset of all tracked files. Also, the 41 + name of the file in the $GIT_DIR/info directory used to track 42 + the sparsity patterns corresponding to the user's desired 43 + subset. 44 + 45 + sparse cone: see cone mode 46 + 47 + sparse directory: An entry in the index corresponding to a directory, which 48 + appears in the index instead of all the files under that directory 49 + that would normally appear. See also sparse-index. Something that 50 + can cause confusion is that the "sparse directory" does NOT match 51 + the sparse specification, i.e. the directory is NOT present in the 52 + working tree. May be renamed in the future (e.g. to "skipped 53 + directory"). 54 + 55 + sparse index: A special mode for sparse-checkout that also makes the 56 + index sparse by recording a directory entry in lieu of all the 57 + files underneath that directory (thus making that a "skipped 58 + directory" which unfortunately has also been called a "sparse 59 + directory"), and does this for potentially multiple 60 + directories. Controlled by the --[no-]sparse-index option to 61 + init|set|reapply. 62 + 63 + sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to 64 + define the set of files of interest. A warning: It is easy to 65 + over-use this term (or the shortened "patterns" term), for two 66 + reasons: (1) users in cone mode specify directories rather than 67 + patterns (their directories are transformed into patterns, but 68 + users may think you are talking about non-cone mode if you use the 69 + word "patterns"), and (b) the sparse specification might 70 + transiently differ in the working tree or index from the sparsity 71 + patterns (see "Sparse specification vs. sparsity patterns"). 72 + 73 + sparse specification: The set of paths in the user's area of focus. This 74 + is typically just the tracked files that match the sparsity 75 + patterns, but the sparse specification can temporarily differ and 76 + include additional files. (See also "Sparse specification 77 + vs. sparsity patterns") 78 + 79 + * When working with history, the sparse specification is exactly 80 + the set of files matching the sparsity patterns. 81 + * When interacting with the working tree, the sparse specification 82 + is the set of tracked files with a clear SKIP_WORKTREE bit or 83 + tracked files present in the working copy. 84 + * When modifying or showing results from the index, the sparse 85 + specification is the set of files with a clear SKIP_WORKTREE bit 86 + or that differ in the index from HEAD. 87 + * If working with the index and the working copy, the sparse 88 + specification is the union of the paths from above. 89 + 90 + vivifying: When a command restores a tracked file to the working tree (and 91 + hopefully also clears the SKIP_WORKTREE bit in the index for that 92 + file), this is referred to as "vivifying" the file. 93 + 94 + 95 + === Purpose of sparse-checkouts === 96 + 97 + sparse-checkouts exist to allow users to work with a subset of their 98 + files. 99 + 100 + You can think of sparse-checkouts as subdividing "tracked" files into two 101 + categories -- a sparse subset, and all the rest. Implementationally, we 102 + mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them 103 + out of the working tree. The SKIP_WORKTREE files are still tracked, just 104 + not present in the working tree. 105 + 106 + In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file 107 + is missing from the working tree but pretend the file contents match HEAD". 108 + That was not only bogus (it actually meant the file missing from the 109 + working tree matched the index rather than HEAD), but it was also a 110 + low-level detail which only provided decent behavior for a few commands. 111 + There were a surprising number of ways in which that guiding principle gave 112 + command results that violated user expectations, and as such was a bad 113 + mental model. However, it persisted for many years and may still be found 114 + in some corners of the code base. 115 + 116 + Anyway, the idea of "working with a subset of files" is simple enough, but 117 + there are multiple different high-level usecases which affect how some Git 118 + subcommands should behave. Further, even if we only considered one of 119 + those usecases, sparse-checkouts can modify different subcommands in over a 120 + half dozen different ways. Let's start by considering the high level 121 + usecases: 122 + 123 + A) Users are _only_ interested in the sparse portion of the repo 124 + 125 + A*) Users are _only_ interested in the sparse portion of the repo 126 + that they have downloaded so far 127 + 128 + B) Users want a sparse working tree, but are working in a larger whole 129 + 130 + C) sparse-checkout is a behind-the-scenes implementation detail allowing 131 + Git to work with a specially crafted in-house virtual file system; 132 + users are actually working with a "full" working tree that is 133 + lazily populated, and sparse-checkout helps with the lazy population 134 + piece. 135 + 136 + It may be worth explaining each of these in a bit more detail: 137 + 138 + 139 + (Behavior A) Users are _only_ interested in the sparse portion of the repo 140 + 141 + These folks might know there are other things in the repository, but 142 + don't care. They are uninterested in other parts of the repository, and 143 + only want to know about changes within their area of interest. Showing 144 + them other files from history (e.g. from diff/log/grep/etc.) is a 145 + usability annoyance, potentially a huge one since other changes in 146 + history may dwarf the changes they are interested in. 147 + 148 + Some of these users also arrive at this usecase from wanting to use partial 149 + clones together with sparse checkouts (in a way where they have downloaded 150 + blobs within the sparse specification) and do disconnected development. 151 + Not only do these users generally not care about other parts of the 152 + repository, but consider it a blocker for Git commands to try to operate on 153 + those. If commands attempt to access paths in history outside the sparsity 154 + specification, then the partial clone will attempt to download additional 155 + blobs on demand, fail, and then fail the user's command. (This may be 156 + unavoidable in some cases, e.g. when `git merge` has non-trivial changes to 157 + reconcile outside the sparse specification, but we should limit how often 158 + users are forced to connect to the network.) 159 + 160 + Also, even for users using partial clones that do not mind being 161 + always connected to the network, the need to download blobs as 162 + side-effects of various other commands (such as the printed diffstat 163 + after a merge or pull) can lead to worries about local repository size 164 + growing unnecessarily[10]. 165 + 166 + (Behavior A*) Users are _only_ interested in the sparse portion of the repo 167 + that they have downloaded so far (a variant on the first usecase) 168 + 169 + This variant is driven by folks who using partial clones together with 170 + sparse checkouts and do disconnected development (so far sounding like a 171 + subset of behavior A users) and doing so on very large repositories. The 172 + reason for yet another variant is that downloading even just the blobs 173 + through history within their sparse specification may be too much, so they 174 + only download some. They would still like operations to succeed without 175 + network connectivity, though, so things like `git log -S${SEARCH_TERM} -p` 176 + or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide 177 + partial results that depend on what happens to have been downloaded. 178 + 179 + This variant could be viewed as Behavior A with the sparse specification 180 + for history querying operations modified from "sparsity patterns" to 181 + "sparsity patterns limited to the blobs we have already downloaded". 182 + 183 + (Behavior B) Users want a sparse working tree, but are working in a 184 + larger whole 185 + 186 + Stolee described this usecase this way[11]: 187 + 188 + "I'm also focused on users that know that they are a part of a larger 189 + whole. They know they are operating on a large repository but focus on 190 + what they need to contribute their part. I expect multiple "roles" to 191 + use very different, almost disjoint parts of the codebase. Some other 192 + "architect" users operate across the entire tree or hop between different 193 + sections of the codebase as necessary. In this situation, I'm wary of 194 + scoping too many features to the sparse-checkout definition, especially 195 + "git log," as it can be too confusing to have their view of the codebase 196 + depend on your "point of view." 197 + 198 + People might also end up wanting behavior B due to complex inter-project 199 + dependencies. The initial attempts to use sparse-checkouts usually involve 200 + the directories you are directly interested in plus what those directories 201 + depend upon within your repository. But there's a monkey wrench here: if 202 + you have integration tests, they invert the hierarchy: to run integration 203 + tests, you need not only what you are interested in and its in-tree 204 + dependencies, you also need everything that depends upon what you are 205 + interested in or that depends upon one of your dependencies...AND you need 206 + all the in-tree dependencies of that expanded group. That can easily 207 + change your sparse-checkout into a nearly dense one. 208 + 209 + Naturally, that tends to kill the benefits of sparse-checkouts. There are 210 + a couple solutions to this conundrum: either avoid grabbing in-repo 211 + dependencies (maybe have built versions of your in-repo dependencies pulled 212 + from a CI cache somewhere), or say that users shouldn't run integration 213 + tests directly and instead do it on the CI server when they submit a code 214 + review. Or do both. Regardless of whether you stub out your in-repo 215 + dependencies or stub out the things that depend upon you, there is 216 + certainly a reason to want to query and be aware of those other stubbed-out 217 + parts of the repository, particularly when the dependencies are complex or 218 + change relatively frequently. Thus, for such uses, sparse-checkouts can be 219 + used to limit what you directly build and modify, but these users do not 220 + necessarily want their sparse checkout paths to limit their queries of 221 + versions in history. 222 + 223 + Some people may also be interested in behavior B over behavior A simply as 224 + a performance workaround: if they are using non-cone mode, then they have 225 + to deal with its inherent quadratic performance problems. In that mode, 226 + every operation that checks whether paths match the sparsity specification 227 + can be expensive. As such, these users may only be willing to pay for 228 + those expensive checks when interacting with the working copy, and may 229 + prefer getting "unrelated" results from their history queries over having 230 + slow commands. 231 + 232 + (Behavior C) sparse-checkout is an implementational detail supporting a 233 + special VFS. 234 + 235 + This usecase goes slightly against the traditional definition of 236 + sparse-checkout in that it actually tries to present a full or dense 237 + checkout to the user. However, this usecase utilizes the same underlying 238 + technical underpinnings in a new way which does provide some performance 239 + advantages to users. The basic idea is that a company can have an in-house 240 + Git-aware Virtual File System which pretends all files are present in the 241 + working tree, by intercepting all file system accesses and using those to 242 + fetch and write accessed files on demand via partial clones. The VFS uses 243 + sparse-checkout to prevent Git from writing or paying attention to many 244 + files, and manually updates the sparse checkout patterns itself based on 245 + user access and modification of files in the working tree. See commit 246 + ecc7c8841d ("repo_read_index: add config to expect files outside sparse 247 + patterns", 2022-02-25) and the link at [17] for a more detailed description 248 + of such a VFS. 249 + 250 + The biggest difference here is that users are completely unaware that the 251 + sparse-checkout machinery is even in use. The sparse patterns are not 252 + specified by the user but rather are under the complete control of the VFS 253 + (and the patterns are updated frequently and dynamically by it). The user 254 + will perceive the checkout as dense, and commands should thus behave as if 255 + all files are present. 256 + 257 + 258 + === Usecases of primary concern === 259 + 260 + Most of the rest of this document will focus on Behavior A and Behavior 261 + B. Some notes about the other two cases and why we are not focusing on 262 + them: 263 + 264 + (Behavior A*) 265 + 266 + Supporting this usecase is estimated to be difficult and a lot of work. 267 + There are no plans to implement it currently, but it may be a potential 268 + future alternative. Knowing about the existence of additional alternatives 269 + may affect our choice of command line flags (e.g. if we need tri-state or 270 + quad-state flags rather than just binary flags), so it was still important 271 + to at least note. 272 + 273 + Further, I believe the descriptions below for Behavior A are probably still 274 + valid for this usecase, with the only exception being that it redefines the 275 + sparse specification to restrict it to already-downloaded blobs. The hard 276 + part is in making commands capable of respecting that modified definition. 277 + 278 + (Behavior C) 279 + 280 + This usecase violates some of the early sparse-checkout documented 281 + assumptions (since files marked as SKIP_WORKTREE will be displayed to users 282 + as present in the working tree). That violation may mean various 283 + sparse-checkout related behaviors are not well suited to this usecase and 284 + we may need tweaks -- to both documentation and code -- to handle it. 285 + However, this usecase is also perhaps the simplest model to support in that 286 + everything behaves like a dense checkout with a few exceptions (e.g. branch 287 + checkouts and switches write fewer things, knowing the VFS will lazily 288 + write the rest on an as-needed basis). 289 + 290 + Since there is no publically available VFS-related code for folks to try, 291 + the number of folks who can test such a usecase is limited. 292 + 293 + The primary reason to note the Behavior C usecase is that as we fix things 294 + to better support Behaviors A and B, there may be additional places where 295 + we need to make tweaks allowing folks in this usecase to get the original 296 + non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index: 297 + add config to expect files outside sparse patterns", 2022-02-25). The 298 + secondary reason to note Behavior C, is so that folks taking advantage of 299 + Behavior C do not assume they are part of the Behavior B camp and propose 300 + patches that break things for the real Behavior B folks. 301 + 302 + 303 + === Oversimplified mental models === 304 + 305 + An oversimplification of the differences in the above behaviors is: 306 + 307 + Behavior A: Restrict worktree and history operations to sparse specification 308 + Behavior B: Restrict worktree operations to sparse specification; have any 309 + history operations work across all files 310 + Behavior C: Do not restrict either worktree or history operations to the 311 + sparse specification...with the exception of branch checkouts or 312 + switches which avoid writing files that will match the index so 313 + they can later lazily be populated instead. 314 + 315 + 316 + === Desired behavior === 317 + 318 + As noted previously, despite the simple idea of just working with a subset 319 + of files, there are a range of different behavioral changes that need to be 320 + made to different subcommands to work well with such a feature. See 321 + [1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw 322 + that mere composition of other commands that individually worked correctly 323 + in a sparse-checkout context did not imply that the higher level command 324 + would work correctly; it sometimes requires further tweaks. So, 325 + understanding these differences can be beneficial. 326 + 327 + * Commands behaving the same regardless of high-level use-case 328 + 329 + * commands that only look at files within the sparsity specification 330 + 331 + * diff (without --cached or REVISION arguments) 332 + * grep (without --cached or REVISION arguments) 333 + * diff-files 334 + 335 + * commands that restore files to the working tree that match sparsity 336 + patterns, and remove unmodified files that don't match those 337 + patterns: 338 + 339 + * switch 340 + * checkout (the switch-like half) 341 + * read-tree 342 + * reset --hard 343 + 344 + * commands that write conflicted files to the working tree, but otherwise 345 + will omit writing files to the working tree that do not match the 346 + sparsity patterns: 347 + 348 + * merge 349 + * rebase 350 + * cherry-pick 351 + * revert 352 + 353 + * `am` and `apply --cached` should probably be in this section but 354 + are buggy (see the "Known bugs" section below) 355 + 356 + The behavior for these commands somewhat depends upon the merge 357 + strategy being used: 358 + * `ort` behaves as described above 359 + * `recursive` tries to not vivify files unnecessarily, but does sometimes 360 + vivify files without conflicts. 361 + * `octopus` and `resolve` will always vivify any file changed in the merge 362 + relative to the first parent, which is rather suboptimal. 363 + 364 + It is also important to note that these commands WILL update the index 365 + outside the sparse specification relative to when the operation began, 366 + BUT these commands often make a commit just before or after such that 367 + by the end of the operation there is no change to the index outside the 368 + sparse specification. Of course, if the operation hits conflicts or 369 + does not make a commit, then these operations clearly can modify the 370 + index outside the sparse specification. 371 + 372 + Finally, it is important to note that at least the first four of these 373 + commands also try to remove differences between the sparse 374 + specification and the sparsity patterns (much like the commands in the 375 + previous section). 376 + 377 + * commands that always ignore sparsity since commits must be full-tree 378 + 379 + * archive 380 + * bundle 381 + * commit 382 + * format-patch 383 + * fast-export 384 + * fast-import 385 + * commit-tree 386 + 387 + * commands that write any modified file to the working tree (conflicted 388 + or not, and whether those paths match sparsity patterns or not): 389 + 390 + * stash 391 + * apply (without `--index` or `--cached`) 392 + 393 + * Commands that may slightly differ for behavior A vs. behavior B: 394 + 395 + Commands in this category behave mostly the same between the two 396 + behaviors, but may differ in verbosity and types of warning and error 397 + messages. 398 + 399 + * commands that make modifications to which files are tracked: 400 + * add 401 + * rm 402 + * mv 403 + * update-index 404 + 405 + The fact that files can move between the 'tracked' and 'untracked' 406 + categories means some commands will have to treat untracked files 407 + differently. But if we have to treat untracked files differently, 408 + then additional commands may also need changes: 409 + 410 + * status 411 + * clean 412 + 413 + In particular, `status` may need to report any untracked files outside 414 + the sparsity specification as an erroneous condition (especially to 415 + avoid the user trying to `git add` them, forcing `git add` to display 416 + an error). 417 + 418 + It's not clear to me exactly how (or even if) `clean` would change, 419 + but it's the other command that also affects untracked files. 420 + 421 + `update-index` may be slightly special. Its --[no-]skip-worktree flag 422 + may need to ignore the sparse specification by its nature. Also, its 423 + current --[no-]ignore-skip-worktree-entries default is totally bogus. 424 + 425 + * commands for manually tweaking paths in both the index and the working tree 426 + * `restore` 427 + * the restore-like half of `checkout` 428 + 429 + These commands should be similar to add/rm/mv in that they should 430 + only operate on the sparse specification by default, and require a 431 + special flag to operate on all files. 432 + 433 + Also, note that these commands currently have a number of issues (see 434 + the "Known bugs" section below) 435 + 436 + * Commands that significantly differ for behavior A vs. behavior B: 437 + 438 + * commands that query history 439 + * diff (with --cached or REVISION arguments) 440 + * grep (with --cached or REVISION arguments) 441 + * show (when given commit arguments) 442 + * blame (only matters when one or more -C flags are passed) 443 + * and annotate 444 + * log 445 + * whatchanged 446 + * ls-files 447 + * diff-index 448 + * diff-tree 449 + * ls-tree 450 + 451 + Note: for log and whatchanged, revision walking logic is unaffected 452 + but displaying of patches is affected by scoping the command to the 453 + sparse-checkout. (The fact that revision walking is unaffected is 454 + why rev-list, shortlog, show-branch, and bisect are not in this 455 + list.) 456 + 457 + ls-files may be slightly special in that e.g. `git ls-files -t` is 458 + often used to see what is sparse and what is not. Perhaps -t should 459 + always work on the full tree? 460 + 461 + * Commands I don't know how to classify 462 + 463 + * range-diff 464 + 465 + Is this like `log` or `format-patch`? 466 + 467 + * cherry 468 + 469 + See range-diff 470 + 471 + * Commands unaffected by sparse-checkouts 472 + 473 + * shortlog 474 + * show-branch 475 + * rev-list 476 + * bisect 477 + 478 + * branch 479 + * describe 480 + * fetch 481 + * gc 482 + * init 483 + * maintenance 484 + * notes 485 + * pull (merge & rebase have the necessary changes) 486 + * push 487 + * submodule 488 + * tag 489 + 490 + * config 491 + * filter-branch (works in separate checkout without sparse-checkout setup) 492 + * pack-refs 493 + * prune 494 + * remote 495 + * repack 496 + * replace 497 + 498 + * bugreport 499 + * count-objects 500 + * fsck 501 + * gitweb 502 + * help 503 + * instaweb 504 + * merge-tree (doesn't touch worktree or index, and merges always compute full-tree) 505 + * rerere 506 + * verify-commit 507 + * verify-tag 508 + 509 + * commit-graph 510 + * hash-object 511 + * index-pack 512 + * mktag 513 + * mktree 514 + * multi-pack-index 515 + * pack-objects 516 + * prune-packed 517 + * symbolic-ref 518 + * unpack-objects 519 + * update-ref 520 + * write-tree (operates on index, possibly optimized to use sparse dir entries) 521 + 522 + * for-each-ref 523 + * get-tar-commit-id 524 + * ls-remote 525 + * merge-base (merges are computed full tree, so merge base should be too) 526 + * name-rev 527 + * pack-redundant 528 + * rev-parse 529 + * show-index 530 + * show-ref 531 + * unpack-file 532 + * var 533 + * verify-pack 534 + 535 + * <Everything under 'Interacting with Others' in 'git help --all'> 536 + * <Everything under 'Low-level...Syncing' in 'git help --all'> 537 + * <Everything under 'Low-level...Internal Helpers' in 'git help --all'> 538 + * <Everything under 'External commands' in 'git help --all'> 539 + 540 + * Commands that might be affected, but who cares? 541 + 542 + * merge-file 543 + * merge-index 544 + * gitk? 545 + 546 + 547 + === Behavior classes === 548 + 549 + From the above there are a few classes of behavior: 550 + 551 + * "restrict" 552 + 553 + Commands in this class only read or write files in the working tree 554 + within the sparse specification. 555 + 556 + When moving to a new commit (e.g. switch, reset --hard), these commands 557 + may update index files outside the sparse specification as of the start 558 + of the operation, but by the end of the operation those index files 559 + will match HEAD again and thus those files will again be outside the 560 + sparse specification. 561 + 562 + When paths are explicitly specified, these paths are intersected with 563 + the sparse specification and will only operate on such paths. 564 + (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`) 565 + 566 + Some of these commands may also attempt, at the end of their operation, 567 + to cull transient differences between the sparse specification and the 568 + sparsity patterns (see "Sparse specification vs. sparsity patterns" for 569 + details, but this basically means either removing unmodified files not 570 + matching the sparsity patterns and marking those files as 571 + SKIP_WORKTREE, or vivifying files that match the sparsity patterns and 572 + marking those files as !SKIP_WORKTREE). 573 + 574 + * "restrict modulo conflicts" 575 + 576 + Commands in this class generally behave like the "restrict" class, 577 + except that: 578 + (1) they will ignore the sparse specification and write files with 579 + conflicts to the working tree (thus temporarily expanding the 580 + sparse specification to include such files.) 581 + (2) they are grouped with commands which move to a new commit, since 582 + they often create a commit and then move to it, even though we 583 + know there are many exceptions to moving to the new commit. (For 584 + example, the user may rebase a commit that becomes empty, or have 585 + a cherry-pick which conflicts, or a user could run `merge 586 + --no-commit`, and we also view `apply --index` kind of like `am 587 + --no-commit`.) As such, these commands can make changes to index 588 + files outside the sparse specification, though they'll mark such 589 + files with SKIP_WORKTREE. 590 + 591 + * "restrict also specially applied to untracked files" 592 + 593 + Commands in this class generally behave like the "restrict" class, 594 + except that they have to handle untracked files differently too, often 595 + because these commands are dealing with files changing state between 596 + 'tracked' and 'untracked'. Often, this may mean printing an error 597 + message if the command had nothing to do, but the arguments may have 598 + referred to files whose tracked-ness state could have changed were it 599 + not for the sparsity patterns excluding them. 600 + 601 + * "no restrict" 602 + 603 + Commands in this class ignore the sparse specification entirely. 604 + 605 + * "restrict or no restrict dependent upon behavior A vs. behavior B" 606 + 607 + Commands in this class behave like "no restrict" for folks in the 608 + behavior B camp, and like "restrict" for folks in the behavior A camp. 609 + However, when behaving like "restrict" a warning of some sort might be 610 + provided that history queries have been limited by the sparse-checkout 611 + specification. 612 + 613 + 614 + === Subcommand-dependent defaults === 615 + 616 + Note that we have different defaults depending on the command for the 617 + desired behavior : 618 + 619 + * Commands defaulting to "restrict": 620 + * diff-files 621 + * diff (without --cached or REVISION arguments) 622 + * grep (without --cached or REVISION arguments) 623 + * switch 624 + * checkout (the switch-like half) 625 + * reset (<commit>) 626 + 627 + * restore 628 + * checkout (the restore-like half) 629 + * checkout-index 630 + * reset (with pathspec) 631 + 632 + This behavior makes sense; these interact with the working tree. 633 + 634 + * Commands defaulting to "restrict modulo conflicts": 635 + * merge 636 + * rebase 637 + * cherry-pick 638 + * revert 639 + 640 + * am 641 + * apply --index (which is kind of like an `am --no-commit`) 642 + 643 + * read-tree (especially with -m or -u; is kind of like a --no-commit merge) 644 + * reset (<tree-ish>, due to similarity to read-tree) 645 + 646 + These also interact with the working tree, but require slightly 647 + different behavior either so that (a) conflicts can be resolved or (b) 648 + because they are kind of like a merge-without-commit operation. 649 + 650 + (See also the "Known bugs" section below regarding `am` and `apply`) 651 + 652 + * Commands defaulting to "no restrict": 653 + * archive 654 + * bundle 655 + * commit 656 + * format-patch 657 + * fast-export 658 + * fast-import 659 + * commit-tree 660 + 661 + * stash 662 + * apply (without `--index`) 663 + 664 + These have completely different defaults and perhaps deserve the most 665 + detailed explanation: 666 + 667 + In the case of commands in the first group (format-patch, 668 + fast-export, bundle, archive, etc.), these are commands for 669 + communicating history, which will be broken if they restrict to a 670 + subset of the repository. As such, they operate on full paths and 671 + have no `--restrict` option for overriding. Some of these commands may 672 + take paths for manually restricting what is exported, but it needs to 673 + be very explicit. 674 + 675 + In the case of stash, it needs to vivify files to avoid losing the 676 + user's changes. 677 + 678 + In the case of apply without `--index`, that command needs to update 679 + the working tree without the index (or the index without the working 680 + tree if `--cached` is passed), and if we restrict those updates to the 681 + sparse specification then we'll lose changes from the user. 682 + 683 + * Commands defaulting to "restrict also specially applied to untracked files": 684 + * add 685 + * rm 686 + * mv 687 + * update-index 688 + * status 689 + * clean (?) 690 + 691 + Our original implementation for the first three of these commands was 692 + "no restrict", but it had some severe usability issues: 693 + * `git add <somefile>` if honored and outside the sparse 694 + specification, can result in the file randomly disappearing later 695 + when some subsequent command is run (since various commands 696 + automatically clean up unmodified files outside the sparse 697 + specification). 698 + * `git rm '*.jpg'` could very negatively surprise users if it deletes 699 + files outside the range of the user's interest. 700 + * `git mv` has similar surprises when moving into or out of the cone, 701 + so best to restrict by default 702 + 703 + So, we switched `add` and `rm` to default to "restrict", which made 704 + usability problems much less severe and less frequent, but we still got 705 + complaints because commands like: 706 + git add <file-outside-sparse-specification> 707 + git rm <file-outside-sparse-specification> 708 + would silently do nothing. We should instead print an error in those 709 + cases to get usability right. 710 + 711 + update-index needs to be updated to match, and status and maybe clean 712 + also need to be updated to specially handle untracked paths. 713 + 714 + There may be a difference in here between behavior A and behavior B in 715 + terms of verboseness of errors or additional warnings. 716 + 717 + * Commands falling under "restrict or no restrict dependent upon behavior 718 + A vs. behavior B" 719 + 720 + * diff (with --cached or REVISION arguments) 721 + * grep (with --cached or REVISION arguments) 722 + * show (when given commit arguments) 723 + * blame (only matters when one or more -C flags passed) 724 + * and annotate 725 + * log 726 + * and variants: shortlog, gitk, show-branch, whatchanged, rev-list 727 + * ls-files 728 + * diff-index 729 + * diff-tree 730 + * ls-tree 731 + 732 + For now, we default to behavior B for these, which want a default of 733 + "no restrict". 734 + 735 + Note that two of these commands -- diff and grep -- also appeared in a 736 + different list with a default of "restrict", but only when limited to 737 + searching the working tree. The working tree vs. history distinction 738 + is fundamental in how behavior B operates, so this is expected. Note, 739 + though, that for diff and grep with --cached, when doing "restrict" 740 + behavior, the difference between sparse specification and sparsity 741 + patterns is important to handle. 742 + 743 + "restrict" may make more sense as the long term default for these[12]. 744 + Also, supporting "restrict" for these commands might be a fair amount 745 + of work to implement, meaning it might be implemented over multiple 746 + releases. If that behavior were the default in the commands that 747 + supported it, that would force behavior B users to need to learn to 748 + slowly add additional flags to their commands, depending on git 749 + version, to get the behavior they want. That gradual switchover would 750 + be painful, so we should avoid it at least until it's fully 751 + implemented. 752 + 753 + 754 + === Sparse specification vs. sparsity patterns === 755 + 756 + In a well-behaved situation, the sparse specification is given directly 757 + by the $GIT_DIR/info/sparse-checkout file. However, it can transiently 758 + diverge for a few reasons: 759 + 760 + * needing to resolve conflicts (merging will vivify conflicted files) 761 + * running Git commands that implicitly vivify files (e.g. "git stash apply") 762 + * running Git commands that explicitly vivify files (e.g. "git checkout 763 + --ignore-skip-worktree-bits FILENAME") 764 + * other commands that write to these files (perhaps a user copies it 765 + from elsewhere) 766 + 767 + For the last item, note that we do automatically clear the SKIP_WORKTREE 768 + bit for files that are present in the working tree. This has been true 769 + since 82386b4496 ("Merge branch 'en/present-despite-skipped'", 770 + 2022-03-09) 771 + 772 + However, such a situation is transient because: 773 + 774 + * Such transient differences can and will be automatically removed as 775 + a side-effect of commands which call unpack_trees() (checkout, 776 + merge, reset, etc.). 777 + * Users can also request such transient differences be corrected via 778 + running `git sparse-checkout reapply`. Various places recommend 779 + running that command. 780 + * Additional commands are also welcome to implicitly fix these 781 + differences; we may add more in the future. 782 + 783 + While we avoid dropping unstaged changes or files which have conflicts, 784 + we otherwise aggressively try to fix these transient differences. If 785 + users want these differences to persist, they should run the `set` or 786 + `add` subcommands of `git sparse-checkout` to reflect their intended 787 + sparse specification. 788 + 789 + However, when we need to do a query on history restricted to the 790 + "relevant subset of files" such a transiently expanded sparse 791 + specification is ignored. There are a couple reasons for this: 792 + 793 + * The behavior wanted when doing something like 794 + git grep expression REVISION 795 + is roughly what the users would expect from 796 + git checkout REVISION && git grep expression 797 + (modulo a "REVISION:" prefix), which has a couple ramifications: 798 + 799 + * REVISION may have paths not in the current index, so there is no 800 + path we can consult for a SKIP_WORKTREE setting for those paths. 801 + 802 + * Since `checkout` is one of those commands that tries to remove 803 + transient differences in the sparse specification, it makes sense 804 + to use the corrected sparse specification 805 + (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to 806 + consult SKIP_WORKTREE anyway. 807 + 808 + So, a transiently expanded (or restricted) sparse specification applies to 809 + the working tree, but not to history queries where we always use the 810 + sparsity patterns. (See [16] for an early discussion of this.) 811 + 812 + Similar to a transiently expanded sparse specification of the working tree 813 + based on additional files being present in the working tree, we also need 814 + to consider additional files being modified in the index. In particular, 815 + if the user has staged changes to files (relative to HEAD) that do not 816 + match the sparsity patterns, and the file is not present in the working 817 + tree, we still want to consider the file part of the sparse specification 818 + if we are specifically performing a query related to the index (e.g. git 819 + diff --cached [REVISION], git diff-index [REVISION], git restore --staged 820 + --source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse 821 + specification for the index usually only matters under behavior A, since 822 + under behavior B index operations are lumped with history and tend to 823 + operate full-tree. 824 + 825 + 826 + === Implementation Questions === 827 + 828 + * Do the options --scope={sparse,all} sound good to others? Are there better 829 + options? 830 + * Names in use, or appearing in patches, or previously suggested: 831 + * --sparse/--dense 832 + * --ignore-skip-worktree-bits 833 + * --ignore-skip-worktree-entries 834 + * --ignore-sparsity 835 + * --[no-]restrict-to-sparse-paths 836 + * --full-tree/--sparse-tree 837 + * --[no-]restrict 838 + * --scope={sparse,all} 839 + * --focus/--unfocus 840 + * --limit/--unlimited 841 + * Rationale making me lean slightly towards --scope={sparse,all}: 842 + * We want a name that works for many commands, so we need a name that 843 + does not conflict 844 + * We know that we have more than two possible usecases, so it is best 845 + to avoid a flag that appears to be binary. 846 + * --scope={sparse,all} isn't overly long and seems relatively 847 + explanatory 848 + * `--sparse`, as used in add/rm/mv, is totally backwards for 849 + grep/log/etc. Changing the meaning of `--sparse` for these 850 + commands would fix the backwardness, but possibly break existing 851 + scripts. Using a new name pairing would allow us to treat 852 + `--sparse` in these commands as a deprecated alias. 853 + * There is a different `--sparse`/`--dense` pair for commands using 854 + revision machinery, so using that naming might cause confusion 855 + * There is also a `--sparse` in both pack-objects and show-branch, which 856 + don't conflict but do suggest that `--sparse` is overloaded 857 + * The name --ignore-skip-worktree-bits is a double negative, is 858 + quite a mouthful, refers to an implementation detail that many 859 + users may not be familiar with, and we'd need a negation for it 860 + which would probably be even more ridiculously long. (But we 861 + can make --ignore-skip-worktree-bits a deprecated alias for 862 + --no-restrict.) 863 + 864 + * If a config option is added (sparse.scope?) what should the values and 865 + description be? "sparse" (behavior A), "worktree-sparse-history-dense" 866 + (behavior B), "dense" (behavior C)? There's a risk of confusion, 867 + because even for Behaviors A and B we want some commands to be 868 + full-tree and others to operate sparsely, so the wording may need to be 869 + more tied to the usecases and somehow explain that. Also, right now, 870 + the primary difference we are focusing is just the history-querying 871 + commands (log/diff/grep). Previous config suggestion here: [13] 872 + 873 + * Is `--no-expand` a good alias for ls-files's `--sparse` option? 874 + (`--sparse` does not map to either `--scope=sparse` or `--scope=all`, 875 + because in non-cone mode it does nothing and in cone-mode it shows the 876 + sparse directory entries which are technically outside the sparse 877 + specification) 878 + 879 + * Under Behavior A: 880 + * Does ls-files' `--no-expand` override the default `--scope=all`, or 881 + does it need an extra flag? 882 + * Does ls-files' `-t` option imply `--scope=all`? 883 + * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`? 884 + 885 + * sparse-checkout: once behavior A is fully implemented, should we take 886 + an interim measure to ease people into switching the default? Namely, 887 + if folks are not already in a sparse checkout, then require 888 + `sparse-checkout init/set` to take a 889 + `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which 890 + would set sparse.scope according to the setting given), and throw an 891 + error if the flag is not provided? That error would be a great place 892 + to warn folks that the default may change in the future, and get them 893 + used to specifying what they want so that the eventual default switch 894 + is seamless for them. 895 + 896 + 897 + === Implementation Goals/Plans === 898 + 899 + * Get buy-in on this document in general. 900 + 901 + * Figure out answers to the 'Implementation Questions' sections (above) 902 + 903 + * Fix bugs in the 'Known bugs' section (below) 904 + 905 + * Provide some kind of method for backfilling the blobs within the sparse 906 + specification in a partial clone 907 + 908 + [Below here is kind of spitballing since the first two haven't been resolved] 909 + 910 + * update-index: flip the default to --no-ignore-skip-worktree-entries, 911 + nuke this stupid "Oh, there's a bug? Let me add a flag to let users 912 + request that they not trigger this bug." flag 913 + 914 + * Flags & Config 915 + * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all` 916 + * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore 917 + a deprecated aliases for `--scope=all` 918 + * Create config option (sparse.scope?), tie it to the "Cliff notes" 919 + overview 920 + 921 + * Add --scope=sparse (and --scope=all) flag to each of the history querying 922 + commands. IMPORTANT: make sure diff machinery changes don't mess with 923 + format-patch, fast-export, etc. 924 + 925 + === Known bugs === 926 + 927 + This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've 928 + been working on it. 929 + 930 + 0. Behavior A is not well supported in Git. (Behavior B didn't used to 931 + be either, but was the easier of the two to implement.) 932 + 933 + 1. am and apply: 934 + 935 + apply, without `--index` or `--cached`, relies on files being present 936 + in the working copy, and also writes to them unconditionally. As 937 + such, it should first check for the files' presence, and if found to 938 + be SKIP_WORKTREE, then clear the bit and vivify the paths, then do 939 + its work. Currently, it just throws an error. 940 + 941 + apply, with either `--cached` or `--index`, will not preserve the 942 + SKIP_WORKTREE bit. This is fine if the file has conflicts, but 943 + otherwise SKIP_WORKTREE bits should be preserved for --cached and 944 + probably also for --index. 945 + 946 + am, if there are no conflicts, will vivify files and fail to preserve 947 + the SKIP_WORKTREE bit. If there are conflicts and `-3` is not 948 + specified, it will vivify files and then complain the patch doesn't 949 + apply. If there are conflicts and `-3` is specified, it will vivify 950 + files and then complain that those vivified files would be 951 + overwritten by merge. 952 + 953 + 2. reset --hard: 954 + 955 + reset --hard provides confusing error message (works correctly, but 956 + misleads the user into believing it didn't): 957 + 958 + $ touch addme 959 + $ git add addme 960 + $ git ls-files -t 961 + H addme 962 + H tracked 963 + S tracked-but-maybe-skipped 964 + $ git reset --hard # usually works great 965 + error: Path 'addme' not uptodate; will not remove from working tree. 966 + HEAD is now at bdbbb6f third 967 + $ git ls-files -t 968 + H tracked 969 + S tracked-but-maybe-skipped 970 + $ ls -1 971 + tracked 972 + 973 + `git reset --hard` DID remove addme from the index and the working tree, contrary 974 + to the error message, but in line with how reset --hard should behave. 975 + 976 + 3. read-tree 977 + 978 + `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the 979 + entries it reads into the index, resulting in all your files suddenly 980 + appearing to be "deleted". 981 + 982 + 4. Checkout, restore: 983 + 984 + These command do not handle path & revision arguments appropriately: 985 + 986 + $ ls 987 + tracked 988 + $ git ls-files -t 989 + H tracked 990 + S tracked-but-maybe-skipped 991 + $ git status --porcelain 992 + $ git checkout -- '*skipped' 993 + error: pathspec '*skipped' did not match any file(s) known to git 994 + $ git ls-files -- '*skipped' 995 + tracked-but-maybe-skipped 996 + $ git checkout HEAD -- '*skipped' 997 + error: pathspec '*skipped' did not match any file(s) known to git 998 + $ git ls-tree HEAD | grep skipped 999 + 100644 blob 276f5a64354b791b13840f02047738c77ad0584f tracked-but-maybe-skipped 1000 + $ git status --porcelain 1001 + $ git checkout HEAD~1 -- '*skipped' 1002 + $ git ls-files -t 1003 + H tracked 1004 + H tracked-but-maybe-skipped 1005 + $ git status --porcelain 1006 + M tracked-but-maybe-skipped 1007 + $ git checkout HEAD -- '*skipped' 1008 + $ git status --porcelain 1009 + $ 1010 + 1011 + Note that checkout without a revision (or restore --staged) fails to 1012 + find a file to restore from the index, even though ls-files shows 1013 + such a file certainly exists. 1014 + 1015 + Similar issues occur with HEAD (--source=HEAD in restore's case), 1016 + but suddenly works when HEAD~1 is specified. And then after that it 1017 + will work with HEAD specified, even though it didn't before. 1018 + 1019 + Directories are also an issue: 1020 + 1021 + $ git sparse-checkout set nomatches 1022 + $ git status 1023 + On branch main 1024 + You are in a sparse checkout with 0% of tracked files present. 1025 + 1026 + nothing to commit, working tree clean 1027 + $ git checkout . 1028 + error: pathspec '.' did not match any file(s) known to git 1029 + $ git checkout HEAD~1 . 1030 + Updated 1 path from 58916d9 1031 + $ git ls-files -t 1032 + S tracked 1033 + H tracked-but-maybe-skipped 1034 + 1035 + 5. checkout and restore --staged, continued: 1036 + 1037 + These commands do not correctly scope operations to the sparse 1038 + specification, and make it worse by not setting important SKIP_WORKTREE 1039 + bits: 1040 + 1041 + $ git restore --source OLDREV --staged outside-sparse-cone/ 1042 + $ git status --porcelain 1043 + MD outside-sparse-cone/file1 1044 + MD outside-sparse-cone/file2 1045 + MD outside-sparse-cone/file3 1046 + 1047 + We can add a --scope=all mode to `git restore` to let it operate outside 1048 + the sparse specification, but then it will be important to set the 1049 + SKIP_WORKTREE bits appropriately. 1050 + 1051 + 6. Performance issues; see: 1052 + https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/ 1053 + 1054 + 1055 + === Reference Emails === 1056 + 1057 + Emails that detail various bugs we've had in sparse-checkout: 1058 + 1059 + [1] (Original descriptions of behavior A & behavior B) 1060 + https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/ 1061 + [2] (Fix stash applications in sparse checkouts; bugs from behavioral differences) 1062 + https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/ 1063 + [3] (Present-despite-skipped entries) 1064 + https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/ 1065 + [4] (Clone --no-checkout interaction) 1066 + https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout) 1067 + [5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`) 1068 + https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/ 1069 + [6] (SKIP_WORKTREE is advisory, not mandatory) 1070 + https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/ 1071 + [7] (`worktree add` should copy sparsity settings from current worktree) 1072 + https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/ 1073 + [8] (Avoid negative surprises in add, rm, and mv) 1074 + https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/ 1075 + https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/ 1076 + [9] (Move from out-of-cone to in-cone) 1077 + https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/ 1078 + https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/ 1079 + [10] (Unnecessarily downloading objects outside sparse specification) 1080 + https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/ 1081 + 1082 + [11] (Stolee's comments on high-level usecases) 1083 + https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/ 1084 + 1085 + [12] Others commenting on eventually switching default to behavior A: 1086 + * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/ 1087 + * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/ 1088 + * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/ 1089 + 1090 + [13] Previous config name suggestion and description 1091 + * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/ 1092 + 1093 + [14] Tangential issue: switch to cone mode as default sparse specification mechanism: 1094 + https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/ 1095 + 1096 + [15] Lengthy email on grep behavior, covering what should be searched: 1097 + * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/ 1098 + 1099 + [16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations, 1100 + search for the parenthetical comment starting "We do not check". 1101 + https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/ 1102 + 1103 + [17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/