Git fork
at reftables-rust 1149 lines 48 kB view raw
1Table of contents: 2 3 * Terminology 4 * Purpose of sparse-checkouts 5 * Usecases of primary concern 6 * Oversimplified mental models ("Cliff Notes" for this document!) 7 * Desired behavior 8 * Behavior classes 9 * Subcommand-dependent defaults 10 * Sparse specification vs. sparsity patterns 11 * Implementation Questions 12 * Implementation Goals/Plans 13 * Known bugs 14 * Reference Emails 15 16 17== Terminology == 18 19*`cone mode`*:: 20 one of two modes for specifying the desired subset of files 21 in a sparse-checkout. In cone-mode, the user specifies 22 directories (getting both everything under that directory as 23 well as everything in leading directories), while in non-cone 24 mode, the user specifies gitignore-style patterns. Controlled 25 by the --[no-]cone option to sparse-checkout init|set. 26 27*`SKIP_WORKTREE`*:: 28 When tracked files do not match the sparse specification and 29 are removed from the working tree, the file in the index is marked 30 with a SKIP_WORKTREE bit. Note that if a tracked file has the 31 SKIP_WORKTREE bit set but the file is later written by the user to 32 the working tree anyway, the SKIP_WORKTREE bit will be cleared at 33 the beginning of any subsequent Git operation. 34+ 35Most sparse checkout users are unaware of this implementation 36detail, and the term should generally be avoided in user-facing 37descriptions and command flags. Unfortunately, prior to the 38`sparse-checkout` subcommand this low-level detail was exposed, 39and as of time of writing, is still exposed in various places. 40 41*`sparse-checkout`*:: 42 a subcommand in git used to reduce the files present in 43 the working tree to a subset of all tracked files. Also, the 44 name of the file in the $GIT_DIR/info directory used to track 45 the sparsity patterns corresponding to the user's desired 46 subset. 47 48*`sparse cone`*:: see cone mode 49 50*`sparse directory`*:: 51 An entry in the index corresponding to a directory, which 52 appears in the index instead of all the files under that directory 53 that would normally appear. See also sparse-index. Something that 54 can cause confusion is that the "sparse directory" does NOT match 55 the sparse specification, i.e. the directory is NOT present in the 56 working tree. May be renamed in the future (e.g. to "skipped 57 directory"). 58 59*`sparse index`*:: 60 A special mode for sparse-checkout that also makes the 61 index sparse by recording a directory entry in lieu of all the 62 files underneath that directory (thus making that a "skipped 63 directory" which unfortunately has also been called a "sparse 64 directory"), and does this for potentially multiple 65 directories. Controlled by the --[no-]sparse-index option to 66 init|set|reapply. 67 68*`sparsity patterns`*:: 69 patterns from $GIT_DIR/info/sparse-checkout used to 70 define the set of files of interest. A warning: It is easy to 71 over-use this term (or the shortened "patterns" term), for two 72 reasons: (1) users in cone mode specify directories rather than 73 patterns (their directories are transformed into patterns, but 74 users may think you are talking about non-cone mode if you use the 75 word "patterns"), and (2) the sparse specification might 76 transiently differ in the working tree or index from the sparsity 77 patterns (see "Sparse specification vs. sparsity patterns"). 78 79*`sparse specification`*:: 80 The set of paths in the user's area of focus. This 81 is typically just the tracked files that match the sparsity 82 patterns, but the sparse specification can temporarily differ and 83 include additional files. (See also "Sparse specification 84 vs. sparsity patterns") 85 86 * When working with history, the sparse specification is exactly 87 the set of files matching the sparsity patterns. 88 * When interacting with the working tree, the sparse specification 89 is the set of tracked files with a clear SKIP_WORKTREE bit or 90 tracked files present in the working copy. 91 * When modifying or showing results from the index, the sparse 92 specification is the set of files with a clear SKIP_WORKTREE bit 93 or that differ in the index from HEAD. 94 * If working with the index and the working copy, the sparse 95 specification is the union of the paths from above. 96 97*`vivifying`*:: 98 When a command restores a tracked file to the working tree (and 99 hopefully also clears the SKIP_WORKTREE bit in the index for that 100 file), this is referred to as "vivifying" the file. 101 102 103== Purpose of sparse-checkouts == 104 105sparse-checkouts exist to allow users to work with a subset of their 106files. 107 108You can think of sparse-checkouts as subdividing "tracked" files into two 109categories -- a sparse subset, and all the rest. Implementationally, we 110mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them 111out of the working tree. The SKIP_WORKTREE files are still tracked, just 112not present in the working tree. 113 114In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file 115is missing from the working tree but pretend the file contents match HEAD". 116That was not only bogus (it actually meant the file missing from the 117working tree matched the index rather than HEAD), but it was also a 118low-level detail which only provided decent behavior for a few commands. 119There were a surprising number of ways in which that guiding principle gave 120command results that violated user expectations, and as such was a bad 121mental model. However, it persisted for many years and may still be found 122in some corners of the code base. 123 124Anyway, the idea of "working with a subset of files" is simple enough, but 125there are multiple different high-level usecases which affect how some Git 126subcommands should behave. Further, even if we only considered one of 127those usecases, sparse-checkouts can modify different subcommands in over a 128half dozen different ways. Let's start by considering the high level 129usecases: 130 131[horizontal] 132A):: Users are _only_ interested in the sparse portion of the repo 133A*):: Users are _only_ interested in the sparse portion of the repo 134 that they have downloaded so far 135B):: Users want a sparse working tree, but are working in a larger whole 136C):: sparse-checkout is a behind-the-scenes implementation detail allowing 137 Git to work with a specially crafted in-house virtual file system; 138 users are actually working with a "full" working tree that is 139 lazily populated, and sparse-checkout helps with the lazy population 140 piece. 141 142It may be worth explaining each of these in a bit more detail: 143 144 145=== (Behavior A) Users are _only_ interested in the sparse portion of the repo 146 147These folks might know there are other things in the repository, but 148don't care. They are uninterested in other parts of the repository, and 149only want to know about changes within their area of interest. Showing 150them other files from history (e.g. from diff/log/grep/etc.) is a 151usability annoyance, potentially a huge one since other changes in 152history may dwarf the changes they are interested in. 153 154Some of these users also arrive at this usecase from wanting to use partial 155clones together with sparse checkouts (in a way where they have downloaded 156blobs within the sparse specification) and do disconnected development. 157Not only do these users generally not care about other parts of the 158repository, but consider it a blocker for Git commands to try to operate on 159those. If commands attempt to access paths in history outside the sparsity 160specification, then the partial clone will attempt to download additional 161blobs on demand, fail, and then fail the user's command. (This may be 162unavoidable in some cases, e.g. when `git merge` has non-trivial changes to 163reconcile outside the sparse specification, but we should limit how often 164users are forced to connect to the network.) 165 166Also, even for users using partial clones that do not mind being 167always connected to the network, the need to download blobs as 168side-effects of various other commands (such as the printed diffstat 169after a merge or pull) can lead to worries about local repository size 170growing unnecessarily[10]. 171 172=== (Behavior A*) Users are _only_ interested in the sparse portion of the repo that they have downloaded so far (a variant on the first usecase) 173 174This variant is driven by folks who using partial clones together with 175sparse checkouts and do disconnected development (so far sounding like a 176subset of behavior A users) and doing so on very large repositories. The 177reason for yet another variant is that downloading even just the blobs 178through history within their sparse specification may be too much, so they 179only download some. They would still like operations to succeed without 180network connectivity, though, so things like `git log -S${SEARCH_TERM} -p` 181or `git grep ${SEARCH_TERM} OLDREV` would need to be prepared to provide 182partial results that depend on what happens to have been downloaded. 183 184This variant could be viewed as Behavior A with the sparse specification 185for history querying operations modified from "sparsity patterns" to 186"sparsity patterns limited to the blobs we have already downloaded". 187 188=== (Behavior B) Users want a sparse working tree, but are working in a larger whole 189 190Stolee described this usecase this way[11]: 191 192"I'm also focused on users that know that they are a part of a larger 193whole. They know they are operating on a large repository but focus on 194what they need to contribute their part. I expect multiple "roles" to 195use very different, almost disjoint parts of the codebase. Some other 196"architect" users operate across the entire tree or hop between different 197sections of the codebase as necessary. In this situation, I'm wary of 198scoping too many features to the sparse-checkout definition, especially 199"git log," as it can be too confusing to have their view of the codebase 200depend on your "point of view." 201 202People might also end up wanting behavior B due to complex inter-project 203dependencies. The initial attempts to use sparse-checkouts usually involve 204the directories you are directly interested in plus what those directories 205depend upon within your repository. But there's a monkey wrench here: if 206you have integration tests, they invert the hierarchy: to run integration 207tests, you need not only what you are interested in and its in-tree 208dependencies, you also need everything that depends upon what you are 209interested in or that depends upon one of your dependencies...AND you need 210all the in-tree dependencies of that expanded group. That can easily 211change your sparse-checkout into a nearly dense one. 212 213Naturally, that tends to kill the benefits of sparse-checkouts. There are 214a couple solutions to this conundrum: either avoid grabbing in-repo 215dependencies (maybe have built versions of your in-repo dependencies pulled 216from a CI cache somewhere), or say that users shouldn't run integration 217tests directly and instead do it on the CI server when they submit a code 218review. Or do both. Regardless of whether you stub out your in-repo 219dependencies or stub out the things that depend upon you, there is 220certainly a reason to want to query and be aware of those other stubbed-out 221parts of the repository, particularly when the dependencies are complex or 222change relatively frequently. Thus, for such uses, sparse-checkouts can be 223used to limit what you directly build and modify, but these users do not 224necessarily want their sparse checkout paths to limit their queries of 225versions in history. 226 227Some people may also be interested in behavior B over behavior A simply as 228a performance workaround: if they are using non-cone mode, then they have 229to deal with its inherent quadratic performance problems. In that mode, 230every operation that checks whether paths match the sparsity specification 231can be expensive. As such, these users may only be willing to pay for 232those expensive checks when interacting with the working copy, and may 233prefer getting "unrelated" results from their history queries over having 234slow commands. 235 236=== (Behavior C) sparse-checkout is an implementational detail supporting a special VFS. 237 238This usecase goes slightly against the traditional definition of 239sparse-checkout in that it actually tries to present a full or dense 240checkout to the user. However, this usecase utilizes the same underlying 241technical underpinnings in a new way which does provide some performance 242advantages to users. The basic idea is that a company can have an in-house 243Git-aware Virtual File System which pretends all files are present in the 244working tree, by intercepting all file system accesses and using those to 245fetch and write accessed files on demand via partial clones. The VFS uses 246sparse-checkout to prevent Git from writing or paying attention to many 247files, and manually updates the sparse checkout patterns itself based on 248user access and modification of files in the working tree. See commit 249ecc7c8841d ("repo_read_index: add config to expect files outside sparse 250patterns", 2022-02-25) and the link at [17] for a more detailed description 251of such a VFS. 252 253The biggest difference here is that users are completely unaware that the 254sparse-checkout machinery is even in use. The sparse patterns are not 255specified by the user but rather are under the complete control of the VFS 256(and the patterns are updated frequently and dynamically by it). The user 257will perceive the checkout as dense, and commands should thus behave as if 258all files are present. 259 260 261== Usecases of primary concern == 262 263Most of the rest of this document will focus on Behavior A and Behavior 264B. Some notes about the other two cases and why we are not focusing on 265them: 266 267=== (Behavior A*) 268 269Supporting this usecase is estimated to be difficult and a lot of work. 270There are no plans to implement it currently, but it may be a potential 271future alternative. Knowing about the existence of additional alternatives 272may affect our choice of command line flags (e.g. if we need tri-state or 273quad-state flags rather than just binary flags), so it was still important 274to at least note. 275 276Further, I believe the descriptions below for Behavior A are probably still 277valid for this usecase, with the only exception being that it redefines the 278sparse specification to restrict it to already-downloaded blobs. The hard 279part is in making commands capable of respecting that modified definition. 280 281=== (Behavior C) 282 283This usecase violates some of the early sparse-checkout documented 284assumptions (since files marked as SKIP_WORKTREE will be displayed to users 285as present in the working tree). That violation may mean various 286sparse-checkout related behaviors are not well suited to this usecase and 287we may need tweaks -- to both documentation and code -- to handle it. 288However, this usecase is also perhaps the simplest model to support in that 289everything behaves like a dense checkout with a few exceptions (e.g. branch 290checkouts and switches write fewer things, knowing the VFS will lazily 291write the rest on an as-needed basis). 292 293Since there is no publicly available VFS-related code for folks to try, 294the number of folks who can test such a usecase is limited. 295 296The primary reason to note the Behavior C usecase is that as we fix things 297to better support Behaviors A and B, there may be additional places where 298we need to make tweaks allowing folks in this usecase to get the original 299non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index: 300add config to expect files outside sparse patterns", 2022-02-25). The 301secondary reason to note Behavior C, is so that folks taking advantage of 302Behavior C do not assume they are part of the Behavior B camp and propose 303patches that break things for the real Behavior B folks. 304 305 306== Oversimplified mental models == 307 308An oversimplification of the differences in the above behaviors is: 309 310(Behavior A):: Restrict worktree and history operations to sparse specification 311(Behavior B):: Restrict worktree operations to sparse specification; have any 312 history operations work across all files 313(Behavior C):: Do not restrict either worktree or history operations to the 314 sparse specification...with the exception of branch checkouts or 315 switches which avoid writing files that will match the index so 316 they can later lazily be populated instead. 317 318 319== Desired behavior == 320 321As noted previously, despite the simple idea of just working with a subset 322of files, there are a range of different behavioral changes that need to be 323made to different subcommands to work well with such a feature. See 324[1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw 325that mere composition of other commands that individually worked correctly 326in a sparse-checkout context did not imply that the higher level command 327would work correctly; it sometimes requires further tweaks. So, 328understanding these differences can be beneficial. 329 330* Commands behaving the same regardless of high-level use-case 331 332 ** commands that only look at files within the sparsity specification 333 334 *** diff (without --cached or REVISION arguments) 335 *** grep (without --cached or REVISION arguments) 336 *** diff-files 337 338 ** commands that restore files to the working tree that match sparsity 339 patterns, and remove unmodified files that don't match those 340 patterns: 341 342 *** switch 343 *** checkout (the switch-like half) 344 *** read-tree 345 *** reset --hard 346 347 ** commands that write conflicted files to the working tree, but otherwise 348 will omit writing files to the working tree that do not match the 349 sparsity patterns: 350 351 *** merge 352 *** rebase 353 *** cherry-pick 354 *** revert 355 356 *** `am` and `apply --cached` should probably be in this section but 357 are buggy (see the "Known bugs" section below) 358 359 The behavior for these commands somewhat depends upon the merge 360 strategy being used: 361 362 *** `ort` behaves as described above 363 *** `octopus` and `resolve` will always vivify any file changed in the merge 364 relative to the first parent, which is rather suboptimal. 365 366 It is also important to note that these commands WILL update the index 367 outside the sparse specification relative to when the operation began, 368 BUT these commands often make a commit just before or after such that 369 by the end of the operation there is no change to the index outside the 370 sparse specification. Of course, if the operation hits conflicts or 371 does not make a commit, then these operations clearly can modify the 372 index outside the sparse specification. 373 374 Finally, it is important to note that at least the first four of these 375 commands also try to remove differences between the sparse 376 specification and the sparsity patterns (much like the commands in the 377 previous section). 378 379 ** commands that always ignore sparsity since commits must be full-tree 380 381 *** archive 382 *** bundle 383 *** commit 384 *** format-patch 385 *** fast-export 386 *** fast-import 387 *** commit-tree 388 389 ** commands that write any modified file to the working tree (conflicted 390 or not, and whether those paths match sparsity patterns or not): 391 392 *** stash 393 *** apply (without `--index` or `--cached`) 394 395* Commands that may slightly differ for behavior A vs. behavior B: 396 397 Commands in this category behave mostly the same between the two 398 behaviors, but may differ in verbosity and types of warning and error 399 messages. 400 401 ** commands that make modifications to which files are tracked: 402 403 *** add 404 *** rm 405 *** mv 406 *** update-index 407 408 The fact that files can move between the 'tracked' and 'untracked' 409 categories means some commands will have to treat untracked files 410 differently. But if we have to treat untracked files differently, 411 then additional commands may also need changes: 412 413 *** status 414 *** clean 415 416 In particular, `status` may need to report any untracked files outside 417 the sparsity specification as an erroneous condition (especially to 418 avoid the user trying to `git add` them, forcing `git add` to display 419 an error). 420 421 It's not clear to me exactly how (or even if) `clean` would change, 422 but it's the other command that also affects untracked files. 423 424 `update-index` may be slightly special. Its --[no-]skip-worktree flag 425 may need to ignore the sparse specification by its nature. Also, its 426 current --[no-]ignore-skip-worktree-entries default is totally bogus. 427 428 ** commands for manually tweaking paths in both the index and the working tree 429 430 *** `restore` 431 *** the restore-like half of `checkout` 432 433 These commands should be similar to add/rm/mv in that they should 434 only operate on the sparse specification by default, and require a 435 special flag to operate on all files. 436 437 Also, note that these commands currently have a number of issues (see 438 the "Known bugs" section below) 439 440* Commands that significantly differ for behavior A vs. behavior B: 441 442 ** commands that query history 443 444 *** diff (with --cached or REVISION arguments) 445 *** grep (with --cached or REVISION arguments) 446 *** show (when given commit arguments) 447 *** blame (only matters when one or more -C flags are passed) 448 **** and annotate 449 *** log 450 *** whatchanged (may not exist anymore) 451 *** ls-files 452 *** diff-index 453 *** diff-tree 454 *** ls-tree 455 456 Note: for log and whatchanged, revision walking logic is unaffected 457 but displaying of patches is affected by scoping the command to the 458 sparse-checkout. (The fact that revision walking is unaffected is 459 why rev-list, shortlog, show-branch, and bisect are not in this 460 list.) 461 462 ls-files may be slightly special in that e.g. `git ls-files -t` is 463 often used to see what is sparse and what is not. Perhaps -t should 464 always work on the full tree? 465 466* Commands I don't know how to classify 467 468 ** range-diff 469 470 Is this like `log` or `format-patch`? 471 472 ** cherry 473 474 See range-diff 475 476* Commands unaffected by sparse-checkouts 477 478 ** shortlog 479 ** show-branch 480 ** rev-list 481 ** bisect 482 483 ** branch 484 ** describe 485 ** fetch 486 ** gc 487 ** init 488 ** maintenance 489 ** notes 490 ** pull (merge & rebase have the necessary changes) 491 ** push 492 ** submodule 493 ** tag 494 495 ** config 496 ** filter-branch (works in separate checkout without sparse-checkout setup) 497 ** pack-refs 498 ** prune 499 ** remote 500 ** repack 501 ** replace 502 503 ** bugreport 504 ** count-objects 505 ** fsck 506 ** gitweb 507 ** help 508 ** instaweb 509 ** merge-tree (doesn't touch worktree or index, and merges always compute full-tree) 510 ** rerere 511 ** verify-commit 512 ** verify-tag 513 514 ** commit-graph 515 ** hash-object 516 ** index-pack 517 ** mktag 518 ** mktree 519 ** multi-pack-index 520 ** pack-objects 521 ** prune-packed 522 ** symbolic-ref 523 ** unpack-objects 524 ** update-ref 525 ** write-tree (operates on index, possibly optimized to use sparse dir entries) 526 527 ** for-each-ref 528 ** get-tar-commit-id 529 ** ls-remote 530 ** merge-base (merges are computed full tree, so merge base should be too) 531 ** name-rev 532 ** pack-redundant 533 ** rev-parse 534 ** show-index 535 ** show-ref 536 ** unpack-file 537 ** var 538 ** verify-pack 539 540 ** <Everything under 'Interacting with Others' in 'git help --all'> 541 ** <Everything under 'Low-level...Syncing' in 'git help --all'> 542 ** <Everything under 'Low-level...Internal Helpers' in 'git help --all'> 543 ** <Everything under 'External commands' in 'git help --all'> 544 545* Commands that might be affected, but who cares? 546 547 ** merge-file 548 ** merge-index 549 ** gitk? 550 551 552== Behavior classes == 553 554From the above there are a few classes of behavior: 555 556 * "restrict" 557 558 Commands in this class only read or write files in the working tree 559 within the sparse specification. 560 561 When moving to a new commit (e.g. switch, reset --hard), these commands 562 may update index files outside the sparse specification as of the start 563 of the operation, but by the end of the operation those index files 564 will match HEAD again and thus those files will again be outside the 565 sparse specification. 566 567 When paths are explicitly specified, these paths are intersected with 568 the sparse specification and will only operate on such paths. 569 (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`) 570 571 Some of these commands may also attempt, at the end of their operation, 572 to cull transient differences between the sparse specification and the 573 sparsity patterns (see "Sparse specification vs. sparsity patterns" for 574 details, but this basically means either removing unmodified files not 575 matching the sparsity patterns and marking those files as 576 SKIP_WORKTREE, or vivifying files that match the sparsity patterns and 577 marking those files as !SKIP_WORKTREE). 578 579 * "restrict modulo conflicts" 580 581 Commands in this class generally behave like the "restrict" class, 582 except that: 583 584 (1) they will ignore the sparse specification and write files with 585 conflicts to the working tree (thus temporarily expanding the 586 sparse specification to include such files.) 587 (2) they are grouped with commands which move to a new commit, since 588 they often create a commit and then move to it, even though we 589 know there are many exceptions to moving to the new commit. (For 590 example, the user may rebase a commit that becomes empty, or have 591 a cherry-pick which conflicts, or a user could run `merge 592 --no-commit`, and we also view `apply --index` kind of like `am 593 --no-commit`.) As such, these commands can make changes to index 594 files outside the sparse specification, though they'll mark such 595 files with SKIP_WORKTREE. 596 597 * "restrict also specially applied to untracked files" 598 599 Commands in this class generally behave like the "restrict" class, 600 except that they have to handle untracked files differently too, often 601 because these commands are dealing with files changing state between 602 'tracked' and 'untracked'. Often, this may mean printing an error 603 message if the command had nothing to do, but the arguments may have 604 referred to files whose tracked-ness state could have changed were it 605 not for the sparsity patterns excluding them. 606 607 * "no restrict" 608 609 Commands in this class ignore the sparse specification entirely. 610 611 * "restrict or no restrict dependent upon behavior A vs. behavior B" 612 613 Commands in this class behave like "no restrict" for folks in the 614 behavior B camp, and like "restrict" for folks in the behavior A camp. 615 However, when behaving like "restrict" a warning of some sort might be 616 provided that history queries have been limited by the sparse-checkout 617 specification. 618 619 620== Subcommand-dependent defaults == 621 622Note that we have different defaults depending on the command for the 623desired behavior : 624 625 * Commands defaulting to "restrict": 626 627 ** diff-files 628 ** diff (without --cached or REVISION arguments) 629 ** grep (without --cached or REVISION arguments) 630 ** switch 631 ** checkout (the switch-like half) 632 ** reset (<commit>) 633 634 ** restore 635 ** checkout (the restore-like half) 636 ** checkout-index 637 ** reset (with pathspec) 638 639 This behavior makes sense; these interact with the working tree. 640 641 * Commands defaulting to "restrict modulo conflicts": 642 643 ** merge 644 ** rebase 645 ** cherry-pick 646 ** revert 647 648 ** am 649 ** apply --index (which is kind of like an `am --no-commit`) 650 651 ** read-tree (especially with -m or -u; is kind of like a --no-commit merge) 652 ** reset (<tree-ish>, due to similarity to read-tree) 653 654 These also interact with the working tree, but require slightly 655 different behavior either so that (a) conflicts can be resolved or (b) 656 because they are kind of like a merge-without-commit operation. 657 658 (See also the "Known bugs" section below regarding `am` and `apply`) 659 660 * Commands defaulting to "no restrict": 661 662 ** archive 663 ** bundle 664 ** commit 665 ** format-patch 666 ** fast-export 667 ** fast-import 668 ** commit-tree 669 670 ** stash 671 ** apply (without `--index`) 672 673 These have completely different defaults and perhaps deserve the most 674 detailed explanation: 675 676 In the case of commands in the first group (format-patch, 677 fast-export, bundle, archive, etc.), these are commands for 678 communicating history, which will be broken if they restrict to a 679 subset of the repository. As such, they operate on full paths and 680 have no `--restrict` option for overriding. Some of these commands may 681 take paths for manually restricting what is exported, but it needs to 682 be very explicit. 683 684 In the case of stash, it needs to vivify files to avoid losing the 685 user's changes. 686 687 In the case of apply without `--index`, that command needs to update 688 the working tree without the index (or the index without the working 689 tree if `--cached` is passed), and if we restrict those updates to the 690 sparse specification then we'll lose changes from the user. 691 692 * Commands defaulting to "restrict also specially applied to untracked files": 693 694 ** add 695 ** rm 696 ** mv 697 ** update-index 698 ** status 699 ** clean (?) 700 701.... 702 Our original implementation for the first three of these commands was 703 "no restrict", but it had some severe usability issues: 704 705 * `git add <somefile>` if honored and outside the sparse 706 specification, can result in the file randomly disappearing later 707 when some subsequent command is run (since various commands 708 automatically clean up unmodified files outside the sparse 709 specification). 710 * `git rm '*.jpg'` could very negatively surprise users if it deletes 711 files outside the range of the user's interest. 712 * `git mv` has similar surprises when moving into or out of the cone, 713 so best to restrict by default 714 715 So, we switched `add` and `rm` to default to "restrict", which made 716 usability problems much less severe and less frequent, but we still got 717 complaints because commands like: 718 719 git add <file-outside-sparse-specification> 720 git rm <file-outside-sparse-specification> 721 722 would silently do nothing. We should instead print an error in those 723 cases to get usability right. 724 725 update-index needs to be updated to match, and status and maybe clean 726 also need to be updated to specially handle untracked paths. 727 728 There may be a difference in here between behavior A and behavior B in 729 terms of verboseness of errors or additional warnings. 730.... 731 732 * Commands falling under "restrict or no restrict dependent upon behavior 733 A vs. behavior B" 734 735 ** diff (with --cached or REVISION arguments) 736 ** grep (with --cached or REVISION arguments) 737 ** show (when given commit arguments) 738 ** blame (only matters when one or more -C flags passed) 739 *** and annotate 740 ** log 741 *** and variants: shortlog, gitk, show-branch, whatchanged, rev-list 742 ** ls-files 743 ** diff-index 744 ** diff-tree 745 ** ls-tree 746 747 For now, we default to behavior B for these, which want a default of 748 "no restrict". 749 750 Note that two of these commands -- diff and grep -- also appeared in a 751 different list with a default of "restrict", but only when limited to 752 searching the working tree. The working tree vs. history distinction 753 is fundamental in how behavior B operates, so this is expected. Note, 754 though, that for diff and grep with --cached, when doing "restrict" 755 behavior, the difference between sparse specification and sparsity 756 patterns is important to handle. 757 758 "restrict" may make more sense as the long term default for these[12]. 759 Also, supporting "restrict" for these commands might be a fair amount 760 of work to implement, meaning it might be implemented over multiple 761 releases. If that behavior were the default in the commands that 762 supported it, that would force behavior B users to need to learn to 763 slowly add additional flags to their commands, depending on git 764 version, to get the behavior they want. That gradual switchover would 765 be painful, so we should avoid it at least until it's fully 766 implemented. 767 768 769== Sparse specification vs. sparsity patterns == 770 771In a well-behaved situation, the sparse specification is given directly 772by the $GIT_DIR/info/sparse-checkout file. However, it can transiently 773diverge for a few reasons: 774 775 * needing to resolve conflicts (merging will vivify conflicted files) 776 * running Git commands that implicitly vivify files (e.g. "git stash apply") 777 * running Git commands that explicitly vivify files (e.g. "git checkout 778 --ignore-skip-worktree-bits FILENAME") 779 * other commands that write to these files (perhaps a user copies it 780 from elsewhere) 781 782For the last item, note that we do automatically clear the SKIP_WORKTREE 783bit for files that are present in the working tree. This has been true 784since 82386b4496 ("Merge branch 'en/present-despite-skipped'", 7852022-03-09) 786 787However, such a situation is transient because: 788 789 * Such transient differences can and will be automatically removed as 790 a side-effect of commands which call unpack_trees() (checkout, 791 merge, reset, etc.). 792 * Users can also request such transient differences be corrected via 793 running `git sparse-checkout reapply`. Various places recommend 794 running that command. 795 * Additional commands are also welcome to implicitly fix these 796 differences; we may add more in the future. 797 798While we avoid dropping unstaged changes or files which have conflicts, 799we otherwise aggressively try to fix these transient differences. If 800users want these differences to persist, they should run the `set` or 801`add` subcommands of `git sparse-checkout` to reflect their intended 802sparse specification. 803 804However, when we need to do a query on history restricted to the 805"relevant subset of files" such a transiently expanded sparse 806specification is ignored. There are a couple reasons for this: 807 808 * The behavior wanted when doing something like 809 git grep expression REVISION 810 is roughly what the users would expect from 811 git checkout REVISION && git grep expression 812 (modulo a "REVISION:" prefix), which has a couple ramifications: 813 814 * REVISION may have paths not in the current index, so there is no 815 path we can consult for a SKIP_WORKTREE setting for those paths. 816 817 * Since `checkout` is one of those commands that tries to remove 818 transient differences in the sparse specification, it makes sense 819 to use the corrected sparse specification 820 (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to 821 consult SKIP_WORKTREE anyway. 822 823So, a transiently expanded (or restricted) sparse specification applies to 824the working tree, but not to history queries where we always use the 825sparsity patterns. (See [16] for an early discussion of this.) 826 827Similar to a transiently expanded sparse specification of the working tree 828based on additional files being present in the working tree, we also need 829to consider additional files being modified in the index. In particular, 830if the user has staged changes to files (relative to HEAD) that do not 831match the sparsity patterns, and the file is not present in the working 832tree, we still want to consider the file part of the sparse specification 833if we are specifically performing a query related to the index (e.g. git 834diff --cached [REVISION], git diff-index [REVISION], git restore --staged 835--source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse 836specification for the index usually only matters under behavior A, since 837under behavior B index operations are lumped with history and tend to 838operate full-tree. 839 840 841== Implementation Questions == 842 843 * Do the options --scope={sparse,all} sound good to others? Are there better options? 844 845 ** Names in use, or appearing in patches, or previously suggested: 846 847 *** --sparse/--dense 848 *** --ignore-skip-worktree-bits 849 *** --ignore-skip-worktree-entries 850 *** --ignore-sparsity 851 *** --[no-]restrict-to-sparse-paths 852 *** --full-tree/--sparse-tree 853 *** --[no-]restrict 854 *** --scope={sparse,all} 855 *** --focus/--unfocus 856 *** --limit/--unlimited 857 858 ** Rationale making me lean slightly towards --scope={sparse,all}: 859 860 *** We want a name that works for many commands, so we need a name that 861 does not conflict 862 *** We know that we have more than two possible usecases, so it is best 863 to avoid a flag that appears to be binary. 864 *** --scope={sparse,all} isn't overly long and seems relatively 865 explanatory 866 *** `--sparse`, as used in add/rm/mv, is totally backwards for 867 grep/log/etc. Changing the meaning of `--sparse` for these 868 commands would fix the backwardness, but possibly break existing 869 scripts. Using a new name pairing would allow us to treat 870 `--sparse` in these commands as a deprecated alias. 871 *** There is a different `--sparse`/`--dense` pair for commands using 872 revision machinery, so using that naming might cause confusion 873 *** There is also a `--sparse` in both pack-objects and show-branch, which 874 don't conflict but do suggest that `--sparse` is overloaded 875 *** The name --ignore-skip-worktree-bits is a double negative, is 876 quite a mouthful, refers to an implementation detail that many 877 users may not be familiar with, and we'd need a negation for it 878 which would probably be even more ridiculously long. (But we 879 can make --ignore-skip-worktree-bits a deprecated alias for 880 --no-restrict.) 881 882 ** If a config option is added (sparse.scope?) what should the values and 883 description be? "sparse" (behavior A), "worktree-sparse-history-dense" 884 (behavior B), "dense" (behavior C)? There's a risk of confusion, 885 because even for Behaviors A and B we want some commands to be 886 full-tree and others to operate sparsely, so the wording may need to be 887 more tied to the usecases and somehow explain that. Also, right now, 888 the primary difference we are focusing is just the history-querying 889 commands (log/diff/grep). Previous config suggestion here: [13] 890 891 ** Is `--no-expand` a good alias for ls-files's `--sparse` option? 892 (`--sparse` does not map to either `--scope=sparse` or `--scope=all`, 893 because in non-cone mode it does nothing and in cone-mode it shows the 894 sparse directory entries which are technically outside the sparse 895 specification) 896 897 ** Under Behavior A: 898 899 *** Does ls-files' `--no-expand` override the default `--scope=all`, or 900 does it need an extra flag? 901 *** Does ls-files' `-t` option imply `--scope=all`? 902 *** Does update-index's `--[no-]skip-worktree` option imply `--scope=all`? 903 904 ** sparse-checkout: once behavior A is fully implemented, should we take 905 an interim measure to ease people into switching the default? Namely, 906 if folks are not already in a sparse checkout, then require 907 `sparse-checkout init/set` to take a 908 `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which 909 would set sparse.scope according to the setting given), and throw an 910 error if the flag is not provided? That error would be a great place 911 to warn folks that the default may change in the future, and get them 912 used to specifying what they want so that the eventual default switch 913 is seamless for them. 914 915 916== Implementation Goals/Plans == 917 918 * Get buy-in on this document in general. 919 920 * Figure out answers to the 'Implementation Questions' sections (above) 921 922 * Fix bugs in the 'Known bugs' section (below) 923 924 * Provide some kind of method for backfilling the blobs within the sparse 925 specification in a partial clone 926 927 [Below here is kind of spitballing since the first two haven't been resolved] 928 929 * update-index: flip the default to --no-ignore-skip-worktree-entries, 930 nuke this stupid "Oh, there's a bug? Let me add a flag to let users 931 request that they not trigger this bug." flag 932 933 * Flags & Config 934 935 ** Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all` 936 ** Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore 937 a deprecated aliases for `--scope=all` 938 ** Create config option (sparse.scope?), tie it to the "Cliff notes" 939 overview 940 941 ** Add --scope=sparse (and --scope=all) flag to each of the history querying 942 commands. IMPORTANT: make sure diff machinery changes don't mess with 943 format-patch, fast-export, etc. 944 945== Known bugs == 946 947This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've 948been working on it. 949 9501. Behavior A is not well supported in Git. (Behavior B didn't used to 951 be either, but was the easier of the two to implement.) 952 9532. am and apply: 954 955 apply, without `--index` or `--cached`, relies on files being present 956 in the working copy, and also writes to them unconditionally. As 957 such, it should first check for the files' presence, and if found to 958 be SKIP_WORKTREE, then clear the bit and vivify the paths, then do 959 its work. Currently, it just throws an error. 960 961 apply, with either `--cached` or `--index`, will not preserve the 962 SKIP_WORKTREE bit. This is fine if the file has conflicts, but 963 otherwise SKIP_WORKTREE bits should be preserved for --cached and 964 probably also for --index. 965 966 am, if there are no conflicts, will vivify files and fail to preserve 967 the SKIP_WORKTREE bit. If there are conflicts and `-3` is not 968 specified, it will vivify files and then complain the patch doesn't 969 apply. If there are conflicts and `-3` is specified, it will vivify 970 files and then complain that those vivified files would be 971 overwritten by merge. 972 9733. reset --hard: 974 975 reset --hard provides confusing error message (works correctly, but 976 misleads the user into believing it didn't): 977 978 $ touch addme 979 $ git add addme 980 $ git ls-files -t 981 H addme 982 H tracked 983 S tracked-but-maybe-skipped 984 $ git reset --hard # usually works great 985 error: Path 'addme' not uptodate; will not remove from working tree. 986 HEAD is now at bdbbb6f third 987 $ git ls-files -t 988 H tracked 989 S tracked-but-maybe-skipped 990 $ ls -1 991 tracked 992 993 `git reset --hard` DID remove addme from the index and the working tree, contrary 994 to the error message, but in line with how reset --hard should behave. 995 9964. read-tree 997 998 `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the 999 entries it reads into the index, resulting in all your files suddenly 1000 appearing to be "deleted". 1001 10025. Checkout, restore: 1003 1004 These command do not handle path & revision arguments appropriately: 1005 1006 $ ls 1007 tracked 1008 $ git ls-files -t 1009 H tracked 1010 S tracked-but-maybe-skipped 1011 $ git status --porcelain 1012 $ git checkout -- '*skipped' 1013 error: pathspec '*skipped' did not match any file(s) known to git 1014 $ git ls-files -- '*skipped' 1015 tracked-but-maybe-skipped 1016 $ git checkout HEAD -- '*skipped' 1017 error: pathspec '*skipped' did not match any file(s) known to git 1018 $ git ls-tree HEAD | grep skipped 1019 100644 blob 276f5a64354b791b13840f02047738c77ad0584f tracked-but-maybe-skipped 1020 $ git status --porcelain 1021 $ git checkout HEAD~1 -- '*skipped' 1022 $ git ls-files -t 1023 H tracked 1024 H tracked-but-maybe-skipped 1025 $ git status --porcelain 1026 M tracked-but-maybe-skipped 1027 $ git checkout HEAD -- '*skipped' 1028 $ git status --porcelain 1029 $ 1030 1031 Note that checkout without a revision (or restore --staged) fails to 1032 find a file to restore from the index, even though ls-files shows 1033 such a file certainly exists. 1034 1035 Similar issues occur with HEAD (--source=HEAD in restore's case), 1036 but suddenly works when HEAD~1 is specified. And then after that it 1037 will work with HEAD specified, even though it didn't before. 1038 1039 Directories are also an issue: 1040 1041 $ git sparse-checkout set nomatches 1042 $ git status 1043 On branch main 1044 You are in a sparse checkout with 0% of tracked files present. 1045 1046 nothing to commit, working tree clean 1047 $ git checkout . 1048 error: pathspec '.' did not match any file(s) known to git 1049 $ git checkout HEAD~1 . 1050 Updated 1 path from 58916d9 1051 $ git ls-files -t 1052 S tracked 1053 H tracked-but-maybe-skipped 1054 10556. checkout and restore --staged, continued: 1056 1057 These commands do not correctly scope operations to the sparse 1058 specification, and make it worse by not setting important SKIP_WORKTREE 1059 bits: 1060 1061 $ git restore --source OLDREV --staged outside-sparse-cone/ 1062 $ git status --porcelain 1063 MD outside-sparse-cone/file1 1064 MD outside-sparse-cone/file2 1065 MD outside-sparse-cone/file3 1066 1067 We can add a --scope=all mode to `git restore` to let it operate outside 1068 the sparse specification, but then it will be important to set the 1069 SKIP_WORKTREE bits appropriately. 1070 10717. Performance issues; see: 1072 1073 https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/ 1074 1075 1076== Reference Emails == 1077 1078Emails that detail various bugs we've had in sparse-checkout: 1079 1080[1] (Original descriptions of behavior A & behavior B): 1081 1082https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/ 1083 1084[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences): 1085 1086https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/ 1087 1088[3] (Present-despite-skipped entries): 1089 1090https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/ 1091 1092[4] (Clone --no-checkout interaction): 1093 1094https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout) 1095 1096[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`): 1097 1098https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/ 1099 1100[6] (SKIP_WORKTREE is advisory, not mandatory): 1101 1102https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/ 1103 1104[7] (`worktree add` should copy sparsity settings from current worktree): 1105 1106https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/ 1107 1108[8] (Avoid negative surprises in add, rm, and mv): 1109 1110 * https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/ 1111 * https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/ 1112 1113[9] (Move from out-of-cone to in-cone): 1114 1115 * https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/ 1116 * https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/ 1117 1118[10] (Unnecessarily downloading objects outside sparse specification): 1119 1120https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/ 1121 1122[11] (Stolee's comments on high-level usecases): 1123 1124https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/ 1125 1126[12] Others commenting on eventually switching default to behavior A: 1127 1128 * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/ 1129 * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/ 1130 * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/ 1131 1132[13] Previous config name suggestion and description: 1133 1134 https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/ 1135 1136[14] Tangential issue: switch to cone mode as default sparse specification mechanism: 1137 1138https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/ 1139 1140[15] Lengthy email on grep behavior, covering what should be searched: 1141 1142https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/ 1143 1144[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations, 1145 search for the parenthetical comment starting "We do not check". 1146 1147https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/ 1148 1149[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/