Git fork
1Table of contents:
2
3 * Terminology
4 * Purpose of sparse-checkouts
5 * Usecases of primary concern
6 * Oversimplified mental models ("Cliff Notes" for this document!)
7 * Desired behavior
8 * Behavior classes
9 * Subcommand-dependent defaults
10 * Sparse specification vs. sparsity patterns
11 * Implementation Questions
12 * Implementation Goals/Plans
13 * Known bugs
14 * Reference Emails
15
16
17== Terminology ==
18
19*`cone mode`*::
20 one of two modes for specifying the desired subset of files
21 in a sparse-checkout. In cone-mode, the user specifies
22 directories (getting both everything under that directory as
23 well as everything in leading directories), while in non-cone
24 mode, the user specifies gitignore-style patterns. Controlled
25 by the --[no-]cone option to sparse-checkout init|set.
26
27*`SKIP_WORKTREE`*::
28 When tracked files do not match the sparse specification and
29 are removed from the working tree, the file in the index is marked
30 with a SKIP_WORKTREE bit. Note that if a tracked file has the
31 SKIP_WORKTREE bit set but the file is later written by the user to
32 the working tree anyway, the SKIP_WORKTREE bit will be cleared at
33 the beginning of any subsequent Git operation.
34+
35Most sparse checkout users are unaware of this implementation
36detail, and the term should generally be avoided in user-facing
37descriptions and command flags. Unfortunately, prior to the
38`sparse-checkout` subcommand this low-level detail was exposed,
39and as of time of writing, is still exposed in various places.
40
41*`sparse-checkout`*::
42 a subcommand in git used to reduce the files present in
43 the working tree to a subset of all tracked files. Also, the
44 name of the file in the $GIT_DIR/info directory used to track
45 the sparsity patterns corresponding to the user's desired
46 subset.
47
48*`sparse cone`*:: see cone mode
49
50*`sparse directory`*::
51 An entry in the index corresponding to a directory, which
52 appears in the index instead of all the files under that directory
53 that would normally appear. See also sparse-index. Something that
54 can cause confusion is that the "sparse directory" does NOT match
55 the sparse specification, i.e. the directory is NOT present in the
56 working tree. May be renamed in the future (e.g. to "skipped
57 directory").
58
59*`sparse index`*::
60 A special mode for sparse-checkout that also makes the
61 index sparse by recording a directory entry in lieu of all the
62 files underneath that directory (thus making that a "skipped
63 directory" which unfortunately has also been called a "sparse
64 directory"), and does this for potentially multiple
65 directories. Controlled by the --[no-]sparse-index option to
66 init|set|reapply.
67
68*`sparsity patterns`*::
69 patterns from $GIT_DIR/info/sparse-checkout used to
70 define the set of files of interest. A warning: It is easy to
71 over-use this term (or the shortened "patterns" term), for two
72 reasons: (1) users in cone mode specify directories rather than
73 patterns (their directories are transformed into patterns, but
74 users may think you are talking about non-cone mode if you use the
75 word "patterns"), and (2) the sparse specification might
76 transiently differ in the working tree or index from the sparsity
77 patterns (see "Sparse specification vs. sparsity patterns").
78
79*`sparse specification`*::
80 The set of paths in the user's area of focus. This
81 is typically just the tracked files that match the sparsity
82 patterns, but the sparse specification can temporarily differ and
83 include additional files. (See also "Sparse specification
84 vs. sparsity patterns")
85
86 * When working with history, the sparse specification is exactly
87 the set of files matching the sparsity patterns.
88 * When interacting with the working tree, the sparse specification
89 is the set of tracked files with a clear SKIP_WORKTREE bit or
90 tracked files present in the working copy.
91 * When modifying or showing results from the index, the sparse
92 specification is the set of files with a clear SKIP_WORKTREE bit
93 or that differ in the index from HEAD.
94 * If working with the index and the working copy, the sparse
95 specification is the union of the paths from above.
96
97*`vivifying`*::
98 When a command restores a tracked file to the working tree (and
99 hopefully also clears the SKIP_WORKTREE bit in the index for that
100 file), this is referred to as "vivifying" the file.
101
102
103== Purpose of sparse-checkouts ==
104
105sparse-checkouts exist to allow users to work with a subset of their
106files.
107
108You can think of sparse-checkouts as subdividing "tracked" files into two
109categories -- a sparse subset, and all the rest. Implementationally, we
110mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
111out of the working tree. The SKIP_WORKTREE files are still tracked, just
112not present in the working tree.
113
114In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
115is missing from the working tree but pretend the file contents match HEAD".
116That was not only bogus (it actually meant the file missing from the
117working tree matched the index rather than HEAD), but it was also a
118low-level detail which only provided decent behavior for a few commands.
119There were a surprising number of ways in which that guiding principle gave
120command results that violated user expectations, and as such was a bad
121mental model. However, it persisted for many years and may still be found
122in some corners of the code base.
123
124Anyway, the idea of "working with a subset of files" is simple enough, but
125there are multiple different high-level usecases which affect how some Git
126subcommands should behave. Further, even if we only considered one of
127those usecases, sparse-checkouts can modify different subcommands in over a
128half dozen different ways. Let's start by considering the high level
129usecases:
130
131[horizontal]
132A):: Users are _only_ interested in the sparse portion of the repo
133A*):: Users are _only_ interested in the sparse portion of the repo
134 that they have downloaded so far
135B):: Users want a sparse working tree, but are working in a larger whole
136C):: sparse-checkout is a behind-the-scenes implementation detail allowing
137 Git to work with a specially crafted in-house virtual file system;
138 users are actually working with a "full" working tree that is
139 lazily populated, and sparse-checkout helps with the lazy population
140 piece.
141
142It may be worth explaining each of these in a bit more detail:
143
144
145=== (Behavior A) Users are _only_ interested in the sparse portion of the repo
146
147These folks might know there are other things in the repository, but
148don't care. They are uninterested in other parts of the repository, and
149only want to know about changes within their area of interest. Showing
150them other files from history (e.g. from diff/log/grep/etc.) is a
151usability annoyance, potentially a huge one since other changes in
152history may dwarf the changes they are interested in.
153
154Some of these users also arrive at this usecase from wanting to use partial
155clones together with sparse checkouts (in a way where they have downloaded
156blobs within the sparse specification) and do disconnected development.
157Not only do these users generally not care about other parts of the
158repository, but consider it a blocker for Git commands to try to operate on
159those. If commands attempt to access paths in history outside the sparsity
160specification, then the partial clone will attempt to download additional
161blobs on demand, fail, and then fail the user's command. (This may be
162unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
163reconcile outside the sparse specification, but we should limit how often
164users are forced to connect to the network.)
165
166Also, even for users using partial clones that do not mind being
167always connected to the network, the need to download blobs as
168side-effects of various other commands (such as the printed diffstat
169after a merge or pull) can lead to worries about local repository size
170growing unnecessarily[10].
171
172=== (Behavior A*) Users are _only_ interested in the sparse portion of the repo that they have downloaded so far (a variant on the first usecase)
173
174This variant is driven by folks who using partial clones together with
175sparse checkouts and do disconnected development (so far sounding like a
176subset of behavior A users) and doing so on very large repositories. The
177reason for yet another variant is that downloading even just the blobs
178through history within their sparse specification may be too much, so they
179only download some. They would still like operations to succeed without
180network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
181or `git grep ${SEARCH_TERM} OLDREV` would need to be prepared to provide
182partial results that depend on what happens to have been downloaded.
183
184This variant could be viewed as Behavior A with the sparse specification
185for history querying operations modified from "sparsity patterns" to
186"sparsity patterns limited to the blobs we have already downloaded".
187
188=== (Behavior B) Users want a sparse working tree, but are working in a larger whole
189
190Stolee described this usecase this way[11]:
191
192"I'm also focused on users that know that they are a part of a larger
193whole. They know they are operating on a large repository but focus on
194what they need to contribute their part. I expect multiple "roles" to
195use very different, almost disjoint parts of the codebase. Some other
196"architect" users operate across the entire tree or hop between different
197sections of the codebase as necessary. In this situation, I'm wary of
198scoping too many features to the sparse-checkout definition, especially
199"git log," as it can be too confusing to have their view of the codebase
200depend on your "point of view."
201
202People might also end up wanting behavior B due to complex inter-project
203dependencies. The initial attempts to use sparse-checkouts usually involve
204the directories you are directly interested in plus what those directories
205depend upon within your repository. But there's a monkey wrench here: if
206you have integration tests, they invert the hierarchy: to run integration
207tests, you need not only what you are interested in and its in-tree
208dependencies, you also need everything that depends upon what you are
209interested in or that depends upon one of your dependencies...AND you need
210all the in-tree dependencies of that expanded group. That can easily
211change your sparse-checkout into a nearly dense one.
212
213Naturally, that tends to kill the benefits of sparse-checkouts. There are
214a couple solutions to this conundrum: either avoid grabbing in-repo
215dependencies (maybe have built versions of your in-repo dependencies pulled
216from a CI cache somewhere), or say that users shouldn't run integration
217tests directly and instead do it on the CI server when they submit a code
218review. Or do both. Regardless of whether you stub out your in-repo
219dependencies or stub out the things that depend upon you, there is
220certainly a reason to want to query and be aware of those other stubbed-out
221parts of the repository, particularly when the dependencies are complex or
222change relatively frequently. Thus, for such uses, sparse-checkouts can be
223used to limit what you directly build and modify, but these users do not
224necessarily want their sparse checkout paths to limit their queries of
225versions in history.
226
227Some people may also be interested in behavior B over behavior A simply as
228a performance workaround: if they are using non-cone mode, then they have
229to deal with its inherent quadratic performance problems. In that mode,
230every operation that checks whether paths match the sparsity specification
231can be expensive. As such, these users may only be willing to pay for
232those expensive checks when interacting with the working copy, and may
233prefer getting "unrelated" results from their history queries over having
234slow commands.
235
236=== (Behavior C) sparse-checkout is an implementational detail supporting a special VFS.
237
238This usecase goes slightly against the traditional definition of
239sparse-checkout in that it actually tries to present a full or dense
240checkout to the user. However, this usecase utilizes the same underlying
241technical underpinnings in a new way which does provide some performance
242advantages to users. The basic idea is that a company can have an in-house
243Git-aware Virtual File System which pretends all files are present in the
244working tree, by intercepting all file system accesses and using those to
245fetch and write accessed files on demand via partial clones. The VFS uses
246sparse-checkout to prevent Git from writing or paying attention to many
247files, and manually updates the sparse checkout patterns itself based on
248user access and modification of files in the working tree. See commit
249ecc7c8841d ("repo_read_index: add config to expect files outside sparse
250patterns", 2022-02-25) and the link at [17] for a more detailed description
251of such a VFS.
252
253The biggest difference here is that users are completely unaware that the
254sparse-checkout machinery is even in use. The sparse patterns are not
255specified by the user but rather are under the complete control of the VFS
256(and the patterns are updated frequently and dynamically by it). The user
257will perceive the checkout as dense, and commands should thus behave as if
258all files are present.
259
260
261== Usecases of primary concern ==
262
263Most of the rest of this document will focus on Behavior A and Behavior
264B. Some notes about the other two cases and why we are not focusing on
265them:
266
267=== (Behavior A*)
268
269Supporting this usecase is estimated to be difficult and a lot of work.
270There are no plans to implement it currently, but it may be a potential
271future alternative. Knowing about the existence of additional alternatives
272may affect our choice of command line flags (e.g. if we need tri-state or
273quad-state flags rather than just binary flags), so it was still important
274to at least note.
275
276Further, I believe the descriptions below for Behavior A are probably still
277valid for this usecase, with the only exception being that it redefines the
278sparse specification to restrict it to already-downloaded blobs. The hard
279part is in making commands capable of respecting that modified definition.
280
281=== (Behavior C)
282
283This usecase violates some of the early sparse-checkout documented
284assumptions (since files marked as SKIP_WORKTREE will be displayed to users
285as present in the working tree). That violation may mean various
286sparse-checkout related behaviors are not well suited to this usecase and
287we may need tweaks -- to both documentation and code -- to handle it.
288However, this usecase is also perhaps the simplest model to support in that
289everything behaves like a dense checkout with a few exceptions (e.g. branch
290checkouts and switches write fewer things, knowing the VFS will lazily
291write the rest on an as-needed basis).
292
293Since there is no publicly available VFS-related code for folks to try,
294the number of folks who can test such a usecase is limited.
295
296The primary reason to note the Behavior C usecase is that as we fix things
297to better support Behaviors A and B, there may be additional places where
298we need to make tweaks allowing folks in this usecase to get the original
299non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index:
300add config to expect files outside sparse patterns", 2022-02-25). The
301secondary reason to note Behavior C, is so that folks taking advantage of
302Behavior C do not assume they are part of the Behavior B camp and propose
303patches that break things for the real Behavior B folks.
304
305
306== Oversimplified mental models ==
307
308An oversimplification of the differences in the above behaviors is:
309
310(Behavior A):: Restrict worktree and history operations to sparse specification
311(Behavior B):: Restrict worktree operations to sparse specification; have any
312 history operations work across all files
313(Behavior C):: Do not restrict either worktree or history operations to the
314 sparse specification...with the exception of branch checkouts or
315 switches which avoid writing files that will match the index so
316 they can later lazily be populated instead.
317
318
319== Desired behavior ==
320
321As noted previously, despite the simple idea of just working with a subset
322of files, there are a range of different behavioral changes that need to be
323made to different subcommands to work well with such a feature. See
324[1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw
325that mere composition of other commands that individually worked correctly
326in a sparse-checkout context did not imply that the higher level command
327would work correctly; it sometimes requires further tweaks. So,
328understanding these differences can be beneficial.
329
330* Commands behaving the same regardless of high-level use-case
331
332 ** commands that only look at files within the sparsity specification
333
334 *** diff (without --cached or REVISION arguments)
335 *** grep (without --cached or REVISION arguments)
336 *** diff-files
337
338 ** commands that restore files to the working tree that match sparsity
339 patterns, and remove unmodified files that don't match those
340 patterns:
341
342 *** switch
343 *** checkout (the switch-like half)
344 *** read-tree
345 *** reset --hard
346
347 ** commands that write conflicted files to the working tree, but otherwise
348 will omit writing files to the working tree that do not match the
349 sparsity patterns:
350
351 *** merge
352 *** rebase
353 *** cherry-pick
354 *** revert
355
356 *** `am` and `apply --cached` should probably be in this section but
357 are buggy (see the "Known bugs" section below)
358
359 The behavior for these commands somewhat depends upon the merge
360 strategy being used:
361
362 *** `ort` behaves as described above
363 *** `octopus` and `resolve` will always vivify any file changed in the merge
364 relative to the first parent, which is rather suboptimal.
365
366 It is also important to note that these commands WILL update the index
367 outside the sparse specification relative to when the operation began,
368 BUT these commands often make a commit just before or after such that
369 by the end of the operation there is no change to the index outside the
370 sparse specification. Of course, if the operation hits conflicts or
371 does not make a commit, then these operations clearly can modify the
372 index outside the sparse specification.
373
374 Finally, it is important to note that at least the first four of these
375 commands also try to remove differences between the sparse
376 specification and the sparsity patterns (much like the commands in the
377 previous section).
378
379 ** commands that always ignore sparsity since commits must be full-tree
380
381 *** archive
382 *** bundle
383 *** commit
384 *** format-patch
385 *** fast-export
386 *** fast-import
387 *** commit-tree
388
389 ** commands that write any modified file to the working tree (conflicted
390 or not, and whether those paths match sparsity patterns or not):
391
392 *** stash
393 *** apply (without `--index` or `--cached`)
394
395* Commands that may slightly differ for behavior A vs. behavior B:
396
397 Commands in this category behave mostly the same between the two
398 behaviors, but may differ in verbosity and types of warning and error
399 messages.
400
401 ** commands that make modifications to which files are tracked:
402
403 *** add
404 *** rm
405 *** mv
406 *** update-index
407
408 The fact that files can move between the 'tracked' and 'untracked'
409 categories means some commands will have to treat untracked files
410 differently. But if we have to treat untracked files differently,
411 then additional commands may also need changes:
412
413 *** status
414 *** clean
415
416 In particular, `status` may need to report any untracked files outside
417 the sparsity specification as an erroneous condition (especially to
418 avoid the user trying to `git add` them, forcing `git add` to display
419 an error).
420
421 It's not clear to me exactly how (or even if) `clean` would change,
422 but it's the other command that also affects untracked files.
423
424 `update-index` may be slightly special. Its --[no-]skip-worktree flag
425 may need to ignore the sparse specification by its nature. Also, its
426 current --[no-]ignore-skip-worktree-entries default is totally bogus.
427
428 ** commands for manually tweaking paths in both the index and the working tree
429
430 *** `restore`
431 *** the restore-like half of `checkout`
432
433 These commands should be similar to add/rm/mv in that they should
434 only operate on the sparse specification by default, and require a
435 special flag to operate on all files.
436
437 Also, note that these commands currently have a number of issues (see
438 the "Known bugs" section below)
439
440* Commands that significantly differ for behavior A vs. behavior B:
441
442 ** commands that query history
443
444 *** diff (with --cached or REVISION arguments)
445 *** grep (with --cached or REVISION arguments)
446 *** show (when given commit arguments)
447 *** blame (only matters when one or more -C flags are passed)
448 **** and annotate
449 *** log
450 *** whatchanged (may not exist anymore)
451 *** ls-files
452 *** diff-index
453 *** diff-tree
454 *** ls-tree
455
456 Note: for log and whatchanged, revision walking logic is unaffected
457 but displaying of patches is affected by scoping the command to the
458 sparse-checkout. (The fact that revision walking is unaffected is
459 why rev-list, shortlog, show-branch, and bisect are not in this
460 list.)
461
462 ls-files may be slightly special in that e.g. `git ls-files -t` is
463 often used to see what is sparse and what is not. Perhaps -t should
464 always work on the full tree?
465
466* Commands I don't know how to classify
467
468 ** range-diff
469
470 Is this like `log` or `format-patch`?
471
472 ** cherry
473
474 See range-diff
475
476* Commands unaffected by sparse-checkouts
477
478 ** shortlog
479 ** show-branch
480 ** rev-list
481 ** bisect
482
483 ** branch
484 ** describe
485 ** fetch
486 ** gc
487 ** init
488 ** maintenance
489 ** notes
490 ** pull (merge & rebase have the necessary changes)
491 ** push
492 ** submodule
493 ** tag
494
495 ** config
496 ** filter-branch (works in separate checkout without sparse-checkout setup)
497 ** pack-refs
498 ** prune
499 ** remote
500 ** repack
501 ** replace
502
503 ** bugreport
504 ** count-objects
505 ** fsck
506 ** gitweb
507 ** help
508 ** instaweb
509 ** merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
510 ** rerere
511 ** verify-commit
512 ** verify-tag
513
514 ** commit-graph
515 ** hash-object
516 ** index-pack
517 ** mktag
518 ** mktree
519 ** multi-pack-index
520 ** pack-objects
521 ** prune-packed
522 ** symbolic-ref
523 ** unpack-objects
524 ** update-ref
525 ** write-tree (operates on index, possibly optimized to use sparse dir entries)
526
527 ** for-each-ref
528 ** get-tar-commit-id
529 ** ls-remote
530 ** merge-base (merges are computed full tree, so merge base should be too)
531 ** name-rev
532 ** pack-redundant
533 ** rev-parse
534 ** show-index
535 ** show-ref
536 ** unpack-file
537 ** var
538 ** verify-pack
539
540 ** <Everything under 'Interacting with Others' in 'git help --all'>
541 ** <Everything under 'Low-level...Syncing' in 'git help --all'>
542 ** <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
543 ** <Everything under 'External commands' in 'git help --all'>
544
545* Commands that might be affected, but who cares?
546
547 ** merge-file
548 ** merge-index
549 ** gitk?
550
551
552== Behavior classes ==
553
554From the above there are a few classes of behavior:
555
556 * "restrict"
557
558 Commands in this class only read or write files in the working tree
559 within the sparse specification.
560
561 When moving to a new commit (e.g. switch, reset --hard), these commands
562 may update index files outside the sparse specification as of the start
563 of the operation, but by the end of the operation those index files
564 will match HEAD again and thus those files will again be outside the
565 sparse specification.
566
567 When paths are explicitly specified, these paths are intersected with
568 the sparse specification and will only operate on such paths.
569 (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
570
571 Some of these commands may also attempt, at the end of their operation,
572 to cull transient differences between the sparse specification and the
573 sparsity patterns (see "Sparse specification vs. sparsity patterns" for
574 details, but this basically means either removing unmodified files not
575 matching the sparsity patterns and marking those files as
576 SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
577 marking those files as !SKIP_WORKTREE).
578
579 * "restrict modulo conflicts"
580
581 Commands in this class generally behave like the "restrict" class,
582 except that:
583
584 (1) they will ignore the sparse specification and write files with
585 conflicts to the working tree (thus temporarily expanding the
586 sparse specification to include such files.)
587 (2) they are grouped with commands which move to a new commit, since
588 they often create a commit and then move to it, even though we
589 know there are many exceptions to moving to the new commit. (For
590 example, the user may rebase a commit that becomes empty, or have
591 a cherry-pick which conflicts, or a user could run `merge
592 --no-commit`, and we also view `apply --index` kind of like `am
593 --no-commit`.) As such, these commands can make changes to index
594 files outside the sparse specification, though they'll mark such
595 files with SKIP_WORKTREE.
596
597 * "restrict also specially applied to untracked files"
598
599 Commands in this class generally behave like the "restrict" class,
600 except that they have to handle untracked files differently too, often
601 because these commands are dealing with files changing state between
602 'tracked' and 'untracked'. Often, this may mean printing an error
603 message if the command had nothing to do, but the arguments may have
604 referred to files whose tracked-ness state could have changed were it
605 not for the sparsity patterns excluding them.
606
607 * "no restrict"
608
609 Commands in this class ignore the sparse specification entirely.
610
611 * "restrict or no restrict dependent upon behavior A vs. behavior B"
612
613 Commands in this class behave like "no restrict" for folks in the
614 behavior B camp, and like "restrict" for folks in the behavior A camp.
615 However, when behaving like "restrict" a warning of some sort might be
616 provided that history queries have been limited by the sparse-checkout
617 specification.
618
619
620== Subcommand-dependent defaults ==
621
622Note that we have different defaults depending on the command for the
623desired behavior :
624
625 * Commands defaulting to "restrict":
626
627 ** diff-files
628 ** diff (without --cached or REVISION arguments)
629 ** grep (without --cached or REVISION arguments)
630 ** switch
631 ** checkout (the switch-like half)
632 ** reset (<commit>)
633
634 ** restore
635 ** checkout (the restore-like half)
636 ** checkout-index
637 ** reset (with pathspec)
638
639 This behavior makes sense; these interact with the working tree.
640
641 * Commands defaulting to "restrict modulo conflicts":
642
643 ** merge
644 ** rebase
645 ** cherry-pick
646 ** revert
647
648 ** am
649 ** apply --index (which is kind of like an `am --no-commit`)
650
651 ** read-tree (especially with -m or -u; is kind of like a --no-commit merge)
652 ** reset (<tree-ish>, due to similarity to read-tree)
653
654 These also interact with the working tree, but require slightly
655 different behavior either so that (a) conflicts can be resolved or (b)
656 because they are kind of like a merge-without-commit operation.
657
658 (See also the "Known bugs" section below regarding `am` and `apply`)
659
660 * Commands defaulting to "no restrict":
661
662 ** archive
663 ** bundle
664 ** commit
665 ** format-patch
666 ** fast-export
667 ** fast-import
668 ** commit-tree
669
670 ** stash
671 ** apply (without `--index`)
672
673 These have completely different defaults and perhaps deserve the most
674 detailed explanation:
675
676 In the case of commands in the first group (format-patch,
677 fast-export, bundle, archive, etc.), these are commands for
678 communicating history, which will be broken if they restrict to a
679 subset of the repository. As such, they operate on full paths and
680 have no `--restrict` option for overriding. Some of these commands may
681 take paths for manually restricting what is exported, but it needs to
682 be very explicit.
683
684 In the case of stash, it needs to vivify files to avoid losing the
685 user's changes.
686
687 In the case of apply without `--index`, that command needs to update
688 the working tree without the index (or the index without the working
689 tree if `--cached` is passed), and if we restrict those updates to the
690 sparse specification then we'll lose changes from the user.
691
692 * Commands defaulting to "restrict also specially applied to untracked files":
693
694 ** add
695 ** rm
696 ** mv
697 ** update-index
698 ** status
699 ** clean (?)
700
701....
702 Our original implementation for the first three of these commands was
703 "no restrict", but it had some severe usability issues:
704
705 * `git add <somefile>` if honored and outside the sparse
706 specification, can result in the file randomly disappearing later
707 when some subsequent command is run (since various commands
708 automatically clean up unmodified files outside the sparse
709 specification).
710 * `git rm '*.jpg'` could very negatively surprise users if it deletes
711 files outside the range of the user's interest.
712 * `git mv` has similar surprises when moving into or out of the cone,
713 so best to restrict by default
714
715 So, we switched `add` and `rm` to default to "restrict", which made
716 usability problems much less severe and less frequent, but we still got
717 complaints because commands like:
718
719 git add <file-outside-sparse-specification>
720 git rm <file-outside-sparse-specification>
721
722 would silently do nothing. We should instead print an error in those
723 cases to get usability right.
724
725 update-index needs to be updated to match, and status and maybe clean
726 also need to be updated to specially handle untracked paths.
727
728 There may be a difference in here between behavior A and behavior B in
729 terms of verboseness of errors or additional warnings.
730....
731
732 * Commands falling under "restrict or no restrict dependent upon behavior
733 A vs. behavior B"
734
735 ** diff (with --cached or REVISION arguments)
736 ** grep (with --cached or REVISION arguments)
737 ** show (when given commit arguments)
738 ** blame (only matters when one or more -C flags passed)
739 *** and annotate
740 ** log
741 *** and variants: shortlog, gitk, show-branch, whatchanged, rev-list
742 ** ls-files
743 ** diff-index
744 ** diff-tree
745 ** ls-tree
746
747 For now, we default to behavior B for these, which want a default of
748 "no restrict".
749
750 Note that two of these commands -- diff and grep -- also appeared in a
751 different list with a default of "restrict", but only when limited to
752 searching the working tree. The working tree vs. history distinction
753 is fundamental in how behavior B operates, so this is expected. Note,
754 though, that for diff and grep with --cached, when doing "restrict"
755 behavior, the difference between sparse specification and sparsity
756 patterns is important to handle.
757
758 "restrict" may make more sense as the long term default for these[12].
759 Also, supporting "restrict" for these commands might be a fair amount
760 of work to implement, meaning it might be implemented over multiple
761 releases. If that behavior were the default in the commands that
762 supported it, that would force behavior B users to need to learn to
763 slowly add additional flags to their commands, depending on git
764 version, to get the behavior they want. That gradual switchover would
765 be painful, so we should avoid it at least until it's fully
766 implemented.
767
768
769== Sparse specification vs. sparsity patterns ==
770
771In a well-behaved situation, the sparse specification is given directly
772by the $GIT_DIR/info/sparse-checkout file. However, it can transiently
773diverge for a few reasons:
774
775 * needing to resolve conflicts (merging will vivify conflicted files)
776 * running Git commands that implicitly vivify files (e.g. "git stash apply")
777 * running Git commands that explicitly vivify files (e.g. "git checkout
778 --ignore-skip-worktree-bits FILENAME")
779 * other commands that write to these files (perhaps a user copies it
780 from elsewhere)
781
782For the last item, note that we do automatically clear the SKIP_WORKTREE
783bit for files that are present in the working tree. This has been true
784since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
7852022-03-09)
786
787However, such a situation is transient because:
788
789 * Such transient differences can and will be automatically removed as
790 a side-effect of commands which call unpack_trees() (checkout,
791 merge, reset, etc.).
792 * Users can also request such transient differences be corrected via
793 running `git sparse-checkout reapply`. Various places recommend
794 running that command.
795 * Additional commands are also welcome to implicitly fix these
796 differences; we may add more in the future.
797
798While we avoid dropping unstaged changes or files which have conflicts,
799we otherwise aggressively try to fix these transient differences. If
800users want these differences to persist, they should run the `set` or
801`add` subcommands of `git sparse-checkout` to reflect their intended
802sparse specification.
803
804However, when we need to do a query on history restricted to the
805"relevant subset of files" such a transiently expanded sparse
806specification is ignored. There are a couple reasons for this:
807
808 * The behavior wanted when doing something like
809 git grep expression REVISION
810 is roughly what the users would expect from
811 git checkout REVISION && git grep expression
812 (modulo a "REVISION:" prefix), which has a couple ramifications:
813
814 * REVISION may have paths not in the current index, so there is no
815 path we can consult for a SKIP_WORKTREE setting for those paths.
816
817 * Since `checkout` is one of those commands that tries to remove
818 transient differences in the sparse specification, it makes sense
819 to use the corrected sparse specification
820 (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
821 consult SKIP_WORKTREE anyway.
822
823So, a transiently expanded (or restricted) sparse specification applies to
824the working tree, but not to history queries where we always use the
825sparsity patterns. (See [16] for an early discussion of this.)
826
827Similar to a transiently expanded sparse specification of the working tree
828based on additional files being present in the working tree, we also need
829to consider additional files being modified in the index. In particular,
830if the user has staged changes to files (relative to HEAD) that do not
831match the sparsity patterns, and the file is not present in the working
832tree, we still want to consider the file part of the sparse specification
833if we are specifically performing a query related to the index (e.g. git
834diff --cached [REVISION], git diff-index [REVISION], git restore --staged
835--source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse
836specification for the index usually only matters under behavior A, since
837under behavior B index operations are lumped with history and tend to
838operate full-tree.
839
840
841== Implementation Questions ==
842
843 * Do the options --scope={sparse,all} sound good to others? Are there better options?
844
845 ** Names in use, or appearing in patches, or previously suggested:
846
847 *** --sparse/--dense
848 *** --ignore-skip-worktree-bits
849 *** --ignore-skip-worktree-entries
850 *** --ignore-sparsity
851 *** --[no-]restrict-to-sparse-paths
852 *** --full-tree/--sparse-tree
853 *** --[no-]restrict
854 *** --scope={sparse,all}
855 *** --focus/--unfocus
856 *** --limit/--unlimited
857
858 ** Rationale making me lean slightly towards --scope={sparse,all}:
859
860 *** We want a name that works for many commands, so we need a name that
861 does not conflict
862 *** We know that we have more than two possible usecases, so it is best
863 to avoid a flag that appears to be binary.
864 *** --scope={sparse,all} isn't overly long and seems relatively
865 explanatory
866 *** `--sparse`, as used in add/rm/mv, is totally backwards for
867 grep/log/etc. Changing the meaning of `--sparse` for these
868 commands would fix the backwardness, but possibly break existing
869 scripts. Using a new name pairing would allow us to treat
870 `--sparse` in these commands as a deprecated alias.
871 *** There is a different `--sparse`/`--dense` pair for commands using
872 revision machinery, so using that naming might cause confusion
873 *** There is also a `--sparse` in both pack-objects and show-branch, which
874 don't conflict but do suggest that `--sparse` is overloaded
875 *** The name --ignore-skip-worktree-bits is a double negative, is
876 quite a mouthful, refers to an implementation detail that many
877 users may not be familiar with, and we'd need a negation for it
878 which would probably be even more ridiculously long. (But we
879 can make --ignore-skip-worktree-bits a deprecated alias for
880 --no-restrict.)
881
882 ** If a config option is added (sparse.scope?) what should the values and
883 description be? "sparse" (behavior A), "worktree-sparse-history-dense"
884 (behavior B), "dense" (behavior C)? There's a risk of confusion,
885 because even for Behaviors A and B we want some commands to be
886 full-tree and others to operate sparsely, so the wording may need to be
887 more tied to the usecases and somehow explain that. Also, right now,
888 the primary difference we are focusing is just the history-querying
889 commands (log/diff/grep). Previous config suggestion here: [13]
890
891 ** Is `--no-expand` a good alias for ls-files's `--sparse` option?
892 (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
893 because in non-cone mode it does nothing and in cone-mode it shows the
894 sparse directory entries which are technically outside the sparse
895 specification)
896
897 ** Under Behavior A:
898
899 *** Does ls-files' `--no-expand` override the default `--scope=all`, or
900 does it need an extra flag?
901 *** Does ls-files' `-t` option imply `--scope=all`?
902 *** Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
903
904 ** sparse-checkout: once behavior A is fully implemented, should we take
905 an interim measure to ease people into switching the default? Namely,
906 if folks are not already in a sparse checkout, then require
907 `sparse-checkout init/set` to take a
908 `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
909 would set sparse.scope according to the setting given), and throw an
910 error if the flag is not provided? That error would be a great place
911 to warn folks that the default may change in the future, and get them
912 used to specifying what they want so that the eventual default switch
913 is seamless for them.
914
915
916== Implementation Goals/Plans ==
917
918 * Get buy-in on this document in general.
919
920 * Figure out answers to the 'Implementation Questions' sections (above)
921
922 * Fix bugs in the 'Known bugs' section (below)
923
924 * Provide some kind of method for backfilling the blobs within the sparse
925 specification in a partial clone
926
927 [Below here is kind of spitballing since the first two haven't been resolved]
928
929 * update-index: flip the default to --no-ignore-skip-worktree-entries,
930 nuke this stupid "Oh, there's a bug? Let me add a flag to let users
931 request that they not trigger this bug." flag
932
933 * Flags & Config
934
935 ** Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
936 ** Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
937 a deprecated aliases for `--scope=all`
938 ** Create config option (sparse.scope?), tie it to the "Cliff notes"
939 overview
940
941 ** Add --scope=sparse (and --scope=all) flag to each of the history querying
942 commands. IMPORTANT: make sure diff machinery changes don't mess with
943 format-patch, fast-export, etc.
944
945== Known bugs ==
946
947This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
948been working on it.
949
9501. Behavior A is not well supported in Git. (Behavior B didn't used to
951 be either, but was the easier of the two to implement.)
952
9532. am and apply:
954
955 apply, without `--index` or `--cached`, relies on files being present
956 in the working copy, and also writes to them unconditionally. As
957 such, it should first check for the files' presence, and if found to
958 be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
959 its work. Currently, it just throws an error.
960
961 apply, with either `--cached` or `--index`, will not preserve the
962 SKIP_WORKTREE bit. This is fine if the file has conflicts, but
963 otherwise SKIP_WORKTREE bits should be preserved for --cached and
964 probably also for --index.
965
966 am, if there are no conflicts, will vivify files and fail to preserve
967 the SKIP_WORKTREE bit. If there are conflicts and `-3` is not
968 specified, it will vivify files and then complain the patch doesn't
969 apply. If there are conflicts and `-3` is specified, it will vivify
970 files and then complain that those vivified files would be
971 overwritten by merge.
972
9733. reset --hard:
974
975 reset --hard provides confusing error message (works correctly, but
976 misleads the user into believing it didn't):
977
978 $ touch addme
979 $ git add addme
980 $ git ls-files -t
981 H addme
982 H tracked
983 S tracked-but-maybe-skipped
984 $ git reset --hard # usually works great
985 error: Path 'addme' not uptodate; will not remove from working tree.
986 HEAD is now at bdbbb6f third
987 $ git ls-files -t
988 H tracked
989 S tracked-but-maybe-skipped
990 $ ls -1
991 tracked
992
993 `git reset --hard` DID remove addme from the index and the working tree, contrary
994 to the error message, but in line with how reset --hard should behave.
995
9964. read-tree
997
998 `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
999 entries it reads into the index, resulting in all your files suddenly
1000 appearing to be "deleted".
1001
10025. Checkout, restore:
1003
1004 These command do not handle path & revision arguments appropriately:
1005
1006 $ ls
1007 tracked
1008 $ git ls-files -t
1009 H tracked
1010 S tracked-but-maybe-skipped
1011 $ git status --porcelain
1012 $ git checkout -- '*skipped'
1013 error: pathspec '*skipped' did not match any file(s) known to git
1014 $ git ls-files -- '*skipped'
1015 tracked-but-maybe-skipped
1016 $ git checkout HEAD -- '*skipped'
1017 error: pathspec '*skipped' did not match any file(s) known to git
1018 $ git ls-tree HEAD | grep skipped
1019 100644 blob 276f5a64354b791b13840f02047738c77ad0584f tracked-but-maybe-skipped
1020 $ git status --porcelain
1021 $ git checkout HEAD~1 -- '*skipped'
1022 $ git ls-files -t
1023 H tracked
1024 H tracked-but-maybe-skipped
1025 $ git status --porcelain
1026 M tracked-but-maybe-skipped
1027 $ git checkout HEAD -- '*skipped'
1028 $ git status --porcelain
1029 $
1030
1031 Note that checkout without a revision (or restore --staged) fails to
1032 find a file to restore from the index, even though ls-files shows
1033 such a file certainly exists.
1034
1035 Similar issues occur with HEAD (--source=HEAD in restore's case),
1036 but suddenly works when HEAD~1 is specified. And then after that it
1037 will work with HEAD specified, even though it didn't before.
1038
1039 Directories are also an issue:
1040
1041 $ git sparse-checkout set nomatches
1042 $ git status
1043 On branch main
1044 You are in a sparse checkout with 0% of tracked files present.
1045
1046 nothing to commit, working tree clean
1047 $ git checkout .
1048 error: pathspec '.' did not match any file(s) known to git
1049 $ git checkout HEAD~1 .
1050 Updated 1 path from 58916d9
1051 $ git ls-files -t
1052 S tracked
1053 H tracked-but-maybe-skipped
1054
10556. checkout and restore --staged, continued:
1056
1057 These commands do not correctly scope operations to the sparse
1058 specification, and make it worse by not setting important SKIP_WORKTREE
1059 bits:
1060
1061 $ git restore --source OLDREV --staged outside-sparse-cone/
1062 $ git status --porcelain
1063 MD outside-sparse-cone/file1
1064 MD outside-sparse-cone/file2
1065 MD outside-sparse-cone/file3
1066
1067 We can add a --scope=all mode to `git restore` to let it operate outside
1068 the sparse specification, but then it will be important to set the
1069 SKIP_WORKTREE bits appropriately.
1070
10717. Performance issues; see:
1072
1073 https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
1074
1075
1076== Reference Emails ==
1077
1078Emails that detail various bugs we've had in sparse-checkout:
1079
1080[1] (Original descriptions of behavior A & behavior B):
1081
1082https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
1083
1084[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences):
1085
1086https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
1087
1088[3] (Present-despite-skipped entries):
1089
1090https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
1091
1092[4] (Clone --no-checkout interaction):
1093
1094https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
1095
1096[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`):
1097
1098https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
1099
1100[6] (SKIP_WORKTREE is advisory, not mandatory):
1101
1102https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
1103
1104[7] (`worktree add` should copy sparsity settings from current worktree):
1105
1106https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
1107
1108[8] (Avoid negative surprises in add, rm, and mv):
1109
1110 * https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
1111 * https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
1112
1113[9] (Move from out-of-cone to in-cone):
1114
1115 * https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
1116 * https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
1117
1118[10] (Unnecessarily downloading objects outside sparse specification):
1119
1120https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
1121
1122[11] (Stolee's comments on high-level usecases):
1123
1124https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
1125
1126[12] Others commenting on eventually switching default to behavior A:
1127
1128 * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
1129 * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
1130 * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
1131
1132[13] Previous config name suggestion and description:
1133
1134 https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
1135
1136[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
1137
1138https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
1139
1140[15] Lengthy email on grep behavior, covering what should be searched:
1141
1142https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
1143
1144[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
1145 search for the parenthetical comment starting "We do not check".
1146
1147https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
1148
1149[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/