···11+Table of contents:
22+33+ * Terminology
44+ * Purpose of sparse-checkouts
55+ * Usecases of primary concern
66+ * Oversimplified mental models ("Cliff Notes" for this document!)
77+ * Desired behavior
88+ * Behavior classes
99+ * Subcommand-dependent defaults
1010+ * Sparse specification vs. sparsity patterns
1111+ * Implementation Questions
1212+ * Implementation Goals/Plans
1313+ * Known bugs
1414+ * Reference Emails
1515+1616+1717+=== Terminology ===
1818+1919+cone mode: one of two modes for specifying the desired subset of files
2020+ in a sparse-checkout. In cone-mode, the user specifies
2121+ directories (getting both everything under that directory as
2222+ well as everything in leading directories), while in non-cone
2323+ mode, the user specifies gitignore-style patterns. Controlled
2424+ by the --[no-]cone option to sparse-checkout init|set.
2525+2626+SKIP_WORKTREE: When tracked files do not match the sparse specification and
2727+ are removed from the working tree, the file in the index is marked
2828+ with a SKIP_WORKTREE bit. Note that if a tracked file has the
2929+ SKIP_WORKTREE bit set but the file is later written by the user to
3030+ the working tree anyway, the SKIP_WORKTREE bit will be cleared at
3131+ the beginning of any subsequent Git operation.
3232+3333+ Most sparse checkout users are unaware of this implementation
3434+ detail, and the term should generally be avoided in user-facing
3535+ descriptions and command flags. Unfortunately, prior to the
3636+ `sparse-checkout` subcommand this low-level detail was exposed,
3737+ and as of time of writing, is still exposed in various places.
3838+3939+sparse-checkout: a subcommand in git used to reduce the files present in
4040+ the working tree to a subset of all tracked files. Also, the
4141+ name of the file in the $GIT_DIR/info directory used to track
4242+ the sparsity patterns corresponding to the user's desired
4343+ subset.
4444+4545+sparse cone: see cone mode
4646+4747+sparse directory: An entry in the index corresponding to a directory, which
4848+ appears in the index instead of all the files under that directory
4949+ that would normally appear. See also sparse-index. Something that
5050+ can cause confusion is that the "sparse directory" does NOT match
5151+ the sparse specification, i.e. the directory is NOT present in the
5252+ working tree. May be renamed in the future (e.g. to "skipped
5353+ directory").
5454+5555+sparse index: A special mode for sparse-checkout that also makes the
5656+ index sparse by recording a directory entry in lieu of all the
5757+ files underneath that directory (thus making that a "skipped
5858+ directory" which unfortunately has also been called a "sparse
5959+ directory"), and does this for potentially multiple
6060+ directories. Controlled by the --[no-]sparse-index option to
6161+ init|set|reapply.
6262+6363+sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
6464+ define the set of files of interest. A warning: It is easy to
6565+ over-use this term (or the shortened "patterns" term), for two
6666+ reasons: (1) users in cone mode specify directories rather than
6767+ patterns (their directories are transformed into patterns, but
6868+ users may think you are talking about non-cone mode if you use the
6969+ word "patterns"), and (b) the sparse specification might
7070+ transiently differ in the working tree or index from the sparsity
7171+ patterns (see "Sparse specification vs. sparsity patterns").
7272+7373+sparse specification: The set of paths in the user's area of focus. This
7474+ is typically just the tracked files that match the sparsity
7575+ patterns, but the sparse specification can temporarily differ and
7676+ include additional files. (See also "Sparse specification
7777+ vs. sparsity patterns")
7878+7979+ * When working with history, the sparse specification is exactly
8080+ the set of files matching the sparsity patterns.
8181+ * When interacting with the working tree, the sparse specification
8282+ is the set of tracked files with a clear SKIP_WORKTREE bit or
8383+ tracked files present in the working copy.
8484+ * When modifying or showing results from the index, the sparse
8585+ specification is the set of files with a clear SKIP_WORKTREE bit
8686+ or that differ in the index from HEAD.
8787+ * If working with the index and the working copy, the sparse
8888+ specification is the union of the paths from above.
8989+9090+vivifying: When a command restores a tracked file to the working tree (and
9191+ hopefully also clears the SKIP_WORKTREE bit in the index for that
9292+ file), this is referred to as "vivifying" the file.
9393+9494+9595+=== Purpose of sparse-checkouts ===
9696+9797+sparse-checkouts exist to allow users to work with a subset of their
9898+files.
9999+100100+You can think of sparse-checkouts as subdividing "tracked" files into two
101101+categories -- a sparse subset, and all the rest. Implementationally, we
102102+mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
103103+out of the working tree. The SKIP_WORKTREE files are still tracked, just
104104+not present in the working tree.
105105+106106+In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
107107+is missing from the working tree but pretend the file contents match HEAD".
108108+That was not only bogus (it actually meant the file missing from the
109109+working tree matched the index rather than HEAD), but it was also a
110110+low-level detail which only provided decent behavior for a few commands.
111111+There were a surprising number of ways in which that guiding principle gave
112112+command results that violated user expectations, and as such was a bad
113113+mental model. However, it persisted for many years and may still be found
114114+in some corners of the code base.
115115+116116+Anyway, the idea of "working with a subset of files" is simple enough, but
117117+there are multiple different high-level usecases which affect how some Git
118118+subcommands should behave. Further, even if we only considered one of
119119+those usecases, sparse-checkouts can modify different subcommands in over a
120120+half dozen different ways. Let's start by considering the high level
121121+usecases:
122122+123123+ A) Users are _only_ interested in the sparse portion of the repo
124124+125125+ A*) Users are _only_ interested in the sparse portion of the repo
126126+ that they have downloaded so far
127127+128128+ B) Users want a sparse working tree, but are working in a larger whole
129129+130130+ C) sparse-checkout is a behind-the-scenes implementation detail allowing
131131+ Git to work with a specially crafted in-house virtual file system;
132132+ users are actually working with a "full" working tree that is
133133+ lazily populated, and sparse-checkout helps with the lazy population
134134+ piece.
135135+136136+It may be worth explaining each of these in a bit more detail:
137137+138138+139139+ (Behavior A) Users are _only_ interested in the sparse portion of the repo
140140+141141+These folks might know there are other things in the repository, but
142142+don't care. They are uninterested in other parts of the repository, and
143143+only want to know about changes within their area of interest. Showing
144144+them other files from history (e.g. from diff/log/grep/etc.) is a
145145+usability annoyance, potentially a huge one since other changes in
146146+history may dwarf the changes they are interested in.
147147+148148+Some of these users also arrive at this usecase from wanting to use partial
149149+clones together with sparse checkouts (in a way where they have downloaded
150150+blobs within the sparse specification) and do disconnected development.
151151+Not only do these users generally not care about other parts of the
152152+repository, but consider it a blocker for Git commands to try to operate on
153153+those. If commands attempt to access paths in history outside the sparsity
154154+specification, then the partial clone will attempt to download additional
155155+blobs on demand, fail, and then fail the user's command. (This may be
156156+unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
157157+reconcile outside the sparse specification, but we should limit how often
158158+users are forced to connect to the network.)
159159+160160+Also, even for users using partial clones that do not mind being
161161+always connected to the network, the need to download blobs as
162162+side-effects of various other commands (such as the printed diffstat
163163+after a merge or pull) can lead to worries about local repository size
164164+growing unnecessarily[10].
165165+166166+ (Behavior A*) Users are _only_ interested in the sparse portion of the repo
167167+ that they have downloaded so far (a variant on the first usecase)
168168+169169+This variant is driven by folks who using partial clones together with
170170+sparse checkouts and do disconnected development (so far sounding like a
171171+subset of behavior A users) and doing so on very large repositories. The
172172+reason for yet another variant is that downloading even just the blobs
173173+through history within their sparse specification may be too much, so they
174174+only download some. They would still like operations to succeed without
175175+network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
176176+or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
177177+partial results that depend on what happens to have been downloaded.
178178+179179+This variant could be viewed as Behavior A with the sparse specification
180180+for history querying operations modified from "sparsity patterns" to
181181+"sparsity patterns limited to the blobs we have already downloaded".
182182+183183+ (Behavior B) Users want a sparse working tree, but are working in a
184184+ larger whole
185185+186186+Stolee described this usecase this way[11]:
187187+188188+"I'm also focused on users that know that they are a part of a larger
189189+whole. They know they are operating on a large repository but focus on
190190+what they need to contribute their part. I expect multiple "roles" to
191191+use very different, almost disjoint parts of the codebase. Some other
192192+"architect" users operate across the entire tree or hop between different
193193+sections of the codebase as necessary. In this situation, I'm wary of
194194+scoping too many features to the sparse-checkout definition, especially
195195+"git log," as it can be too confusing to have their view of the codebase
196196+depend on your "point of view."
197197+198198+People might also end up wanting behavior B due to complex inter-project
199199+dependencies. The initial attempts to use sparse-checkouts usually involve
200200+the directories you are directly interested in plus what those directories
201201+depend upon within your repository. But there's a monkey wrench here: if
202202+you have integration tests, they invert the hierarchy: to run integration
203203+tests, you need not only what you are interested in and its in-tree
204204+dependencies, you also need everything that depends upon what you are
205205+interested in or that depends upon one of your dependencies...AND you need
206206+all the in-tree dependencies of that expanded group. That can easily
207207+change your sparse-checkout into a nearly dense one.
208208+209209+Naturally, that tends to kill the benefits of sparse-checkouts. There are
210210+a couple solutions to this conundrum: either avoid grabbing in-repo
211211+dependencies (maybe have built versions of your in-repo dependencies pulled
212212+from a CI cache somewhere), or say that users shouldn't run integration
213213+tests directly and instead do it on the CI server when they submit a code
214214+review. Or do both. Regardless of whether you stub out your in-repo
215215+dependencies or stub out the things that depend upon you, there is
216216+certainly a reason to want to query and be aware of those other stubbed-out
217217+parts of the repository, particularly when the dependencies are complex or
218218+change relatively frequently. Thus, for such uses, sparse-checkouts can be
219219+used to limit what you directly build and modify, but these users do not
220220+necessarily want their sparse checkout paths to limit their queries of
221221+versions in history.
222222+223223+Some people may also be interested in behavior B over behavior A simply as
224224+a performance workaround: if they are using non-cone mode, then they have
225225+to deal with its inherent quadratic performance problems. In that mode,
226226+every operation that checks whether paths match the sparsity specification
227227+can be expensive. As such, these users may only be willing to pay for
228228+those expensive checks when interacting with the working copy, and may
229229+prefer getting "unrelated" results from their history queries over having
230230+slow commands.
231231+232232+ (Behavior C) sparse-checkout is an implementational detail supporting a
233233+ special VFS.
234234+235235+This usecase goes slightly against the traditional definition of
236236+sparse-checkout in that it actually tries to present a full or dense
237237+checkout to the user. However, this usecase utilizes the same underlying
238238+technical underpinnings in a new way which does provide some performance
239239+advantages to users. The basic idea is that a company can have an in-house
240240+Git-aware Virtual File System which pretends all files are present in the
241241+working tree, by intercepting all file system accesses and using those to
242242+fetch and write accessed files on demand via partial clones. The VFS uses
243243+sparse-checkout to prevent Git from writing or paying attention to many
244244+files, and manually updates the sparse checkout patterns itself based on
245245+user access and modification of files in the working tree. See commit
246246+ecc7c8841d ("repo_read_index: add config to expect files outside sparse
247247+patterns", 2022-02-25) and the link at [17] for a more detailed description
248248+of such a VFS.
249249+250250+The biggest difference here is that users are completely unaware that the
251251+sparse-checkout machinery is even in use. The sparse patterns are not
252252+specified by the user but rather are under the complete control of the VFS
253253+(and the patterns are updated frequently and dynamically by it). The user
254254+will perceive the checkout as dense, and commands should thus behave as if
255255+all files are present.
256256+257257+258258+=== Usecases of primary concern ===
259259+260260+Most of the rest of this document will focus on Behavior A and Behavior
261261+B. Some notes about the other two cases and why we are not focusing on
262262+them:
263263+264264+ (Behavior A*)
265265+266266+Supporting this usecase is estimated to be difficult and a lot of work.
267267+There are no plans to implement it currently, but it may be a potential
268268+future alternative. Knowing about the existence of additional alternatives
269269+may affect our choice of command line flags (e.g. if we need tri-state or
270270+quad-state flags rather than just binary flags), so it was still important
271271+to at least note.
272272+273273+Further, I believe the descriptions below for Behavior A are probably still
274274+valid for this usecase, with the only exception being that it redefines the
275275+sparse specification to restrict it to already-downloaded blobs. The hard
276276+part is in making commands capable of respecting that modified definition.
277277+278278+ (Behavior C)
279279+280280+This usecase violates some of the early sparse-checkout documented
281281+assumptions (since files marked as SKIP_WORKTREE will be displayed to users
282282+as present in the working tree). That violation may mean various
283283+sparse-checkout related behaviors are not well suited to this usecase and
284284+we may need tweaks -- to both documentation and code -- to handle it.
285285+However, this usecase is also perhaps the simplest model to support in that
286286+everything behaves like a dense checkout with a few exceptions (e.g. branch
287287+checkouts and switches write fewer things, knowing the VFS will lazily
288288+write the rest on an as-needed basis).
289289+290290+Since there is no publically available VFS-related code for folks to try,
291291+the number of folks who can test such a usecase is limited.
292292+293293+The primary reason to note the Behavior C usecase is that as we fix things
294294+to better support Behaviors A and B, there may be additional places where
295295+we need to make tweaks allowing folks in this usecase to get the original
296296+non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index:
297297+add config to expect files outside sparse patterns", 2022-02-25). The
298298+secondary reason to note Behavior C, is so that folks taking advantage of
299299+Behavior C do not assume they are part of the Behavior B camp and propose
300300+patches that break things for the real Behavior B folks.
301301+302302+303303+=== Oversimplified mental models ===
304304+305305+An oversimplification of the differences in the above behaviors is:
306306+307307+ Behavior A: Restrict worktree and history operations to sparse specification
308308+ Behavior B: Restrict worktree operations to sparse specification; have any
309309+ history operations work across all files
310310+ Behavior C: Do not restrict either worktree or history operations to the
311311+ sparse specification...with the exception of branch checkouts or
312312+ switches which avoid writing files that will match the index so
313313+ they can later lazily be populated instead.
314314+315315+316316+=== Desired behavior ===
317317+318318+As noted previously, despite the simple idea of just working with a subset
319319+of files, there are a range of different behavioral changes that need to be
320320+made to different subcommands to work well with such a feature. See
321321+[1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw
322322+that mere composition of other commands that individually worked correctly
323323+in a sparse-checkout context did not imply that the higher level command
324324+would work correctly; it sometimes requires further tweaks. So,
325325+understanding these differences can be beneficial.
326326+327327+* Commands behaving the same regardless of high-level use-case
328328+329329+ * commands that only look at files within the sparsity specification
330330+331331+ * diff (without --cached or REVISION arguments)
332332+ * grep (without --cached or REVISION arguments)
333333+ * diff-files
334334+335335+ * commands that restore files to the working tree that match sparsity
336336+ patterns, and remove unmodified files that don't match those
337337+ patterns:
338338+339339+ * switch
340340+ * checkout (the switch-like half)
341341+ * read-tree
342342+ * reset --hard
343343+344344+ * commands that write conflicted files to the working tree, but otherwise
345345+ will omit writing files to the working tree that do not match the
346346+ sparsity patterns:
347347+348348+ * merge
349349+ * rebase
350350+ * cherry-pick
351351+ * revert
352352+353353+ * `am` and `apply --cached` should probably be in this section but
354354+ are buggy (see the "Known bugs" section below)
355355+356356+ The behavior for these commands somewhat depends upon the merge
357357+ strategy being used:
358358+ * `ort` behaves as described above
359359+ * `recursive` tries to not vivify files unnecessarily, but does sometimes
360360+ vivify files without conflicts.
361361+ * `octopus` and `resolve` will always vivify any file changed in the merge
362362+ relative to the first parent, which is rather suboptimal.
363363+364364+ It is also important to note that these commands WILL update the index
365365+ outside the sparse specification relative to when the operation began,
366366+ BUT these commands often make a commit just before or after such that
367367+ by the end of the operation there is no change to the index outside the
368368+ sparse specification. Of course, if the operation hits conflicts or
369369+ does not make a commit, then these operations clearly can modify the
370370+ index outside the sparse specification.
371371+372372+ Finally, it is important to note that at least the first four of these
373373+ commands also try to remove differences between the sparse
374374+ specification and the sparsity patterns (much like the commands in the
375375+ previous section).
376376+377377+ * commands that always ignore sparsity since commits must be full-tree
378378+379379+ * archive
380380+ * bundle
381381+ * commit
382382+ * format-patch
383383+ * fast-export
384384+ * fast-import
385385+ * commit-tree
386386+387387+ * commands that write any modified file to the working tree (conflicted
388388+ or not, and whether those paths match sparsity patterns or not):
389389+390390+ * stash
391391+ * apply (without `--index` or `--cached`)
392392+393393+* Commands that may slightly differ for behavior A vs. behavior B:
394394+395395+ Commands in this category behave mostly the same between the two
396396+ behaviors, but may differ in verbosity and types of warning and error
397397+ messages.
398398+399399+ * commands that make modifications to which files are tracked:
400400+ * add
401401+ * rm
402402+ * mv
403403+ * update-index
404404+405405+ The fact that files can move between the 'tracked' and 'untracked'
406406+ categories means some commands will have to treat untracked files
407407+ differently. But if we have to treat untracked files differently,
408408+ then additional commands may also need changes:
409409+410410+ * status
411411+ * clean
412412+413413+ In particular, `status` may need to report any untracked files outside
414414+ the sparsity specification as an erroneous condition (especially to
415415+ avoid the user trying to `git add` them, forcing `git add` to display
416416+ an error).
417417+418418+ It's not clear to me exactly how (or even if) `clean` would change,
419419+ but it's the other command that also affects untracked files.
420420+421421+ `update-index` may be slightly special. Its --[no-]skip-worktree flag
422422+ may need to ignore the sparse specification by its nature. Also, its
423423+ current --[no-]ignore-skip-worktree-entries default is totally bogus.
424424+425425+ * commands for manually tweaking paths in both the index and the working tree
426426+ * `restore`
427427+ * the restore-like half of `checkout`
428428+429429+ These commands should be similar to add/rm/mv in that they should
430430+ only operate on the sparse specification by default, and require a
431431+ special flag to operate on all files.
432432+433433+ Also, note that these commands currently have a number of issues (see
434434+ the "Known bugs" section below)
435435+436436+* Commands that significantly differ for behavior A vs. behavior B:
437437+438438+ * commands that query history
439439+ * diff (with --cached or REVISION arguments)
440440+ * grep (with --cached or REVISION arguments)
441441+ * show (when given commit arguments)
442442+ * blame (only matters when one or more -C flags are passed)
443443+ * and annotate
444444+ * log
445445+ * whatchanged
446446+ * ls-files
447447+ * diff-index
448448+ * diff-tree
449449+ * ls-tree
450450+451451+ Note: for log and whatchanged, revision walking logic is unaffected
452452+ but displaying of patches is affected by scoping the command to the
453453+ sparse-checkout. (The fact that revision walking is unaffected is
454454+ why rev-list, shortlog, show-branch, and bisect are not in this
455455+ list.)
456456+457457+ ls-files may be slightly special in that e.g. `git ls-files -t` is
458458+ often used to see what is sparse and what is not. Perhaps -t should
459459+ always work on the full tree?
460460+461461+* Commands I don't know how to classify
462462+463463+ * range-diff
464464+465465+ Is this like `log` or `format-patch`?
466466+467467+ * cherry
468468+469469+ See range-diff
470470+471471+* Commands unaffected by sparse-checkouts
472472+473473+ * shortlog
474474+ * show-branch
475475+ * rev-list
476476+ * bisect
477477+478478+ * branch
479479+ * describe
480480+ * fetch
481481+ * gc
482482+ * init
483483+ * maintenance
484484+ * notes
485485+ * pull (merge & rebase have the necessary changes)
486486+ * push
487487+ * submodule
488488+ * tag
489489+490490+ * config
491491+ * filter-branch (works in separate checkout without sparse-checkout setup)
492492+ * pack-refs
493493+ * prune
494494+ * remote
495495+ * repack
496496+ * replace
497497+498498+ * bugreport
499499+ * count-objects
500500+ * fsck
501501+ * gitweb
502502+ * help
503503+ * instaweb
504504+ * merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
505505+ * rerere
506506+ * verify-commit
507507+ * verify-tag
508508+509509+ * commit-graph
510510+ * hash-object
511511+ * index-pack
512512+ * mktag
513513+ * mktree
514514+ * multi-pack-index
515515+ * pack-objects
516516+ * prune-packed
517517+ * symbolic-ref
518518+ * unpack-objects
519519+ * update-ref
520520+ * write-tree (operates on index, possibly optimized to use sparse dir entries)
521521+522522+ * for-each-ref
523523+ * get-tar-commit-id
524524+ * ls-remote
525525+ * merge-base (merges are computed full tree, so merge base should be too)
526526+ * name-rev
527527+ * pack-redundant
528528+ * rev-parse
529529+ * show-index
530530+ * show-ref
531531+ * unpack-file
532532+ * var
533533+ * verify-pack
534534+535535+ * <Everything under 'Interacting with Others' in 'git help --all'>
536536+ * <Everything under 'Low-level...Syncing' in 'git help --all'>
537537+ * <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
538538+ * <Everything under 'External commands' in 'git help --all'>
539539+540540+* Commands that might be affected, but who cares?
541541+542542+ * merge-file
543543+ * merge-index
544544+ * gitk?
545545+546546+547547+=== Behavior classes ===
548548+549549+From the above there are a few classes of behavior:
550550+551551+ * "restrict"
552552+553553+ Commands in this class only read or write files in the working tree
554554+ within the sparse specification.
555555+556556+ When moving to a new commit (e.g. switch, reset --hard), these commands
557557+ may update index files outside the sparse specification as of the start
558558+ of the operation, but by the end of the operation those index files
559559+ will match HEAD again and thus those files will again be outside the
560560+ sparse specification.
561561+562562+ When paths are explicitly specified, these paths are intersected with
563563+ the sparse specification and will only operate on such paths.
564564+ (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
565565+566566+ Some of these commands may also attempt, at the end of their operation,
567567+ to cull transient differences between the sparse specification and the
568568+ sparsity patterns (see "Sparse specification vs. sparsity patterns" for
569569+ details, but this basically means either removing unmodified files not
570570+ matching the sparsity patterns and marking those files as
571571+ SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
572572+ marking those files as !SKIP_WORKTREE).
573573+574574+ * "restrict modulo conflicts"
575575+576576+ Commands in this class generally behave like the "restrict" class,
577577+ except that:
578578+ (1) they will ignore the sparse specification and write files with
579579+ conflicts to the working tree (thus temporarily expanding the
580580+ sparse specification to include such files.)
581581+ (2) they are grouped with commands which move to a new commit, since
582582+ they often create a commit and then move to it, even though we
583583+ know there are many exceptions to moving to the new commit. (For
584584+ example, the user may rebase a commit that becomes empty, or have
585585+ a cherry-pick which conflicts, or a user could run `merge
586586+ --no-commit`, and we also view `apply --index` kind of like `am
587587+ --no-commit`.) As such, these commands can make changes to index
588588+ files outside the sparse specification, though they'll mark such
589589+ files with SKIP_WORKTREE.
590590+591591+ * "restrict also specially applied to untracked files"
592592+593593+ Commands in this class generally behave like the "restrict" class,
594594+ except that they have to handle untracked files differently too, often
595595+ because these commands are dealing with files changing state between
596596+ 'tracked' and 'untracked'. Often, this may mean printing an error
597597+ message if the command had nothing to do, but the arguments may have
598598+ referred to files whose tracked-ness state could have changed were it
599599+ not for the sparsity patterns excluding them.
600600+601601+ * "no restrict"
602602+603603+ Commands in this class ignore the sparse specification entirely.
604604+605605+ * "restrict or no restrict dependent upon behavior A vs. behavior B"
606606+607607+ Commands in this class behave like "no restrict" for folks in the
608608+ behavior B camp, and like "restrict" for folks in the behavior A camp.
609609+ However, when behaving like "restrict" a warning of some sort might be
610610+ provided that history queries have been limited by the sparse-checkout
611611+ specification.
612612+613613+614614+=== Subcommand-dependent defaults ===
615615+616616+Note that we have different defaults depending on the command for the
617617+desired behavior :
618618+619619+ * Commands defaulting to "restrict":
620620+ * diff-files
621621+ * diff (without --cached or REVISION arguments)
622622+ * grep (without --cached or REVISION arguments)
623623+ * switch
624624+ * checkout (the switch-like half)
625625+ * reset (<commit>)
626626+627627+ * restore
628628+ * checkout (the restore-like half)
629629+ * checkout-index
630630+ * reset (with pathspec)
631631+632632+ This behavior makes sense; these interact with the working tree.
633633+634634+ * Commands defaulting to "restrict modulo conflicts":
635635+ * merge
636636+ * rebase
637637+ * cherry-pick
638638+ * revert
639639+640640+ * am
641641+ * apply --index (which is kind of like an `am --no-commit`)
642642+643643+ * read-tree (especially with -m or -u; is kind of like a --no-commit merge)
644644+ * reset (<tree-ish>, due to similarity to read-tree)
645645+646646+ These also interact with the working tree, but require slightly
647647+ different behavior either so that (a) conflicts can be resolved or (b)
648648+ because they are kind of like a merge-without-commit operation.
649649+650650+ (See also the "Known bugs" section below regarding `am` and `apply`)
651651+652652+ * Commands defaulting to "no restrict":
653653+ * archive
654654+ * bundle
655655+ * commit
656656+ * format-patch
657657+ * fast-export
658658+ * fast-import
659659+ * commit-tree
660660+661661+ * stash
662662+ * apply (without `--index`)
663663+664664+ These have completely different defaults and perhaps deserve the most
665665+ detailed explanation:
666666+667667+ In the case of commands in the first group (format-patch,
668668+ fast-export, bundle, archive, etc.), these are commands for
669669+ communicating history, which will be broken if they restrict to a
670670+ subset of the repository. As such, they operate on full paths and
671671+ have no `--restrict` option for overriding. Some of these commands may
672672+ take paths for manually restricting what is exported, but it needs to
673673+ be very explicit.
674674+675675+ In the case of stash, it needs to vivify files to avoid losing the
676676+ user's changes.
677677+678678+ In the case of apply without `--index`, that command needs to update
679679+ the working tree without the index (or the index without the working
680680+ tree if `--cached` is passed), and if we restrict those updates to the
681681+ sparse specification then we'll lose changes from the user.
682682+683683+ * Commands defaulting to "restrict also specially applied to untracked files":
684684+ * add
685685+ * rm
686686+ * mv
687687+ * update-index
688688+ * status
689689+ * clean (?)
690690+691691+ Our original implementation for the first three of these commands was
692692+ "no restrict", but it had some severe usability issues:
693693+ * `git add <somefile>` if honored and outside the sparse
694694+ specification, can result in the file randomly disappearing later
695695+ when some subsequent command is run (since various commands
696696+ automatically clean up unmodified files outside the sparse
697697+ specification).
698698+ * `git rm '*.jpg'` could very negatively surprise users if it deletes
699699+ files outside the range of the user's interest.
700700+ * `git mv` has similar surprises when moving into or out of the cone,
701701+ so best to restrict by default
702702+703703+ So, we switched `add` and `rm` to default to "restrict", which made
704704+ usability problems much less severe and less frequent, but we still got
705705+ complaints because commands like:
706706+ git add <file-outside-sparse-specification>
707707+ git rm <file-outside-sparse-specification>
708708+ would silently do nothing. We should instead print an error in those
709709+ cases to get usability right.
710710+711711+ update-index needs to be updated to match, and status and maybe clean
712712+ also need to be updated to specially handle untracked paths.
713713+714714+ There may be a difference in here between behavior A and behavior B in
715715+ terms of verboseness of errors or additional warnings.
716716+717717+ * Commands falling under "restrict or no restrict dependent upon behavior
718718+ A vs. behavior B"
719719+720720+ * diff (with --cached or REVISION arguments)
721721+ * grep (with --cached or REVISION arguments)
722722+ * show (when given commit arguments)
723723+ * blame (only matters when one or more -C flags passed)
724724+ * and annotate
725725+ * log
726726+ * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
727727+ * ls-files
728728+ * diff-index
729729+ * diff-tree
730730+ * ls-tree
731731+732732+ For now, we default to behavior B for these, which want a default of
733733+ "no restrict".
734734+735735+ Note that two of these commands -- diff and grep -- also appeared in a
736736+ different list with a default of "restrict", but only when limited to
737737+ searching the working tree. The working tree vs. history distinction
738738+ is fundamental in how behavior B operates, so this is expected. Note,
739739+ though, that for diff and grep with --cached, when doing "restrict"
740740+ behavior, the difference between sparse specification and sparsity
741741+ patterns is important to handle.
742742+743743+ "restrict" may make more sense as the long term default for these[12].
744744+ Also, supporting "restrict" for these commands might be a fair amount
745745+ of work to implement, meaning it might be implemented over multiple
746746+ releases. If that behavior were the default in the commands that
747747+ supported it, that would force behavior B users to need to learn to
748748+ slowly add additional flags to their commands, depending on git
749749+ version, to get the behavior they want. That gradual switchover would
750750+ be painful, so we should avoid it at least until it's fully
751751+ implemented.
752752+753753+754754+=== Sparse specification vs. sparsity patterns ===
755755+756756+In a well-behaved situation, the sparse specification is given directly
757757+by the $GIT_DIR/info/sparse-checkout file. However, it can transiently
758758+diverge for a few reasons:
759759+760760+ * needing to resolve conflicts (merging will vivify conflicted files)
761761+ * running Git commands that implicitly vivify files (e.g. "git stash apply")
762762+ * running Git commands that explicitly vivify files (e.g. "git checkout
763763+ --ignore-skip-worktree-bits FILENAME")
764764+ * other commands that write to these files (perhaps a user copies it
765765+ from elsewhere)
766766+767767+For the last item, note that we do automatically clear the SKIP_WORKTREE
768768+bit for files that are present in the working tree. This has been true
769769+since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
770770+2022-03-09)
771771+772772+However, such a situation is transient because:
773773+774774+ * Such transient differences can and will be automatically removed as
775775+ a side-effect of commands which call unpack_trees() (checkout,
776776+ merge, reset, etc.).
777777+ * Users can also request such transient differences be corrected via
778778+ running `git sparse-checkout reapply`. Various places recommend
779779+ running that command.
780780+ * Additional commands are also welcome to implicitly fix these
781781+ differences; we may add more in the future.
782782+783783+While we avoid dropping unstaged changes or files which have conflicts,
784784+we otherwise aggressively try to fix these transient differences. If
785785+users want these differences to persist, they should run the `set` or
786786+`add` subcommands of `git sparse-checkout` to reflect their intended
787787+sparse specification.
788788+789789+However, when we need to do a query on history restricted to the
790790+"relevant subset of files" such a transiently expanded sparse
791791+specification is ignored. There are a couple reasons for this:
792792+793793+ * The behavior wanted when doing something like
794794+ git grep expression REVISION
795795+ is roughly what the users would expect from
796796+ git checkout REVISION && git grep expression
797797+ (modulo a "REVISION:" prefix), which has a couple ramifications:
798798+799799+ * REVISION may have paths not in the current index, so there is no
800800+ path we can consult for a SKIP_WORKTREE setting for those paths.
801801+802802+ * Since `checkout` is one of those commands that tries to remove
803803+ transient differences in the sparse specification, it makes sense
804804+ to use the corrected sparse specification
805805+ (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
806806+ consult SKIP_WORKTREE anyway.
807807+808808+So, a transiently expanded (or restricted) sparse specification applies to
809809+the working tree, but not to history queries where we always use the
810810+sparsity patterns. (See [16] for an early discussion of this.)
811811+812812+Similar to a transiently expanded sparse specification of the working tree
813813+based on additional files being present in the working tree, we also need
814814+to consider additional files being modified in the index. In particular,
815815+if the user has staged changes to files (relative to HEAD) that do not
816816+match the sparsity patterns, and the file is not present in the working
817817+tree, we still want to consider the file part of the sparse specification
818818+if we are specifically performing a query related to the index (e.g. git
819819+diff --cached [REVISION], git diff-index [REVISION], git restore --staged
820820+--source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse
821821+specification for the index usually only matters under behavior A, since
822822+under behavior B index operations are lumped with history and tend to
823823+operate full-tree.
824824+825825+826826+=== Implementation Questions ===
827827+828828+ * Do the options --scope={sparse,all} sound good to others? Are there better
829829+ options?
830830+ * Names in use, or appearing in patches, or previously suggested:
831831+ * --sparse/--dense
832832+ * --ignore-skip-worktree-bits
833833+ * --ignore-skip-worktree-entries
834834+ * --ignore-sparsity
835835+ * --[no-]restrict-to-sparse-paths
836836+ * --full-tree/--sparse-tree
837837+ * --[no-]restrict
838838+ * --scope={sparse,all}
839839+ * --focus/--unfocus
840840+ * --limit/--unlimited
841841+ * Rationale making me lean slightly towards --scope={sparse,all}:
842842+ * We want a name that works for many commands, so we need a name that
843843+ does not conflict
844844+ * We know that we have more than two possible usecases, so it is best
845845+ to avoid a flag that appears to be binary.
846846+ * --scope={sparse,all} isn't overly long and seems relatively
847847+ explanatory
848848+ * `--sparse`, as used in add/rm/mv, is totally backwards for
849849+ grep/log/etc. Changing the meaning of `--sparse` for these
850850+ commands would fix the backwardness, but possibly break existing
851851+ scripts. Using a new name pairing would allow us to treat
852852+ `--sparse` in these commands as a deprecated alias.
853853+ * There is a different `--sparse`/`--dense` pair for commands using
854854+ revision machinery, so using that naming might cause confusion
855855+ * There is also a `--sparse` in both pack-objects and show-branch, which
856856+ don't conflict but do suggest that `--sparse` is overloaded
857857+ * The name --ignore-skip-worktree-bits is a double negative, is
858858+ quite a mouthful, refers to an implementation detail that many
859859+ users may not be familiar with, and we'd need a negation for it
860860+ which would probably be even more ridiculously long. (But we
861861+ can make --ignore-skip-worktree-bits a deprecated alias for
862862+ --no-restrict.)
863863+864864+ * If a config option is added (sparse.scope?) what should the values and
865865+ description be? "sparse" (behavior A), "worktree-sparse-history-dense"
866866+ (behavior B), "dense" (behavior C)? There's a risk of confusion,
867867+ because even for Behaviors A and B we want some commands to be
868868+ full-tree and others to operate sparsely, so the wording may need to be
869869+ more tied to the usecases and somehow explain that. Also, right now,
870870+ the primary difference we are focusing is just the history-querying
871871+ commands (log/diff/grep). Previous config suggestion here: [13]
872872+873873+ * Is `--no-expand` a good alias for ls-files's `--sparse` option?
874874+ (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
875875+ because in non-cone mode it does nothing and in cone-mode it shows the
876876+ sparse directory entries which are technically outside the sparse
877877+ specification)
878878+879879+ * Under Behavior A:
880880+ * Does ls-files' `--no-expand` override the default `--scope=all`, or
881881+ does it need an extra flag?
882882+ * Does ls-files' `-t` option imply `--scope=all`?
883883+ * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
884884+885885+ * sparse-checkout: once behavior A is fully implemented, should we take
886886+ an interim measure to ease people into switching the default? Namely,
887887+ if folks are not already in a sparse checkout, then require
888888+ `sparse-checkout init/set` to take a
889889+ `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
890890+ would set sparse.scope according to the setting given), and throw an
891891+ error if the flag is not provided? That error would be a great place
892892+ to warn folks that the default may change in the future, and get them
893893+ used to specifying what they want so that the eventual default switch
894894+ is seamless for them.
895895+896896+897897+=== Implementation Goals/Plans ===
898898+899899+ * Get buy-in on this document in general.
900900+901901+ * Figure out answers to the 'Implementation Questions' sections (above)
902902+903903+ * Fix bugs in the 'Known bugs' section (below)
904904+905905+ * Provide some kind of method for backfilling the blobs within the sparse
906906+ specification in a partial clone
907907+908908+ [Below here is kind of spitballing since the first two haven't been resolved]
909909+910910+ * update-index: flip the default to --no-ignore-skip-worktree-entries,
911911+ nuke this stupid "Oh, there's a bug? Let me add a flag to let users
912912+ request that they not trigger this bug." flag
913913+914914+ * Flags & Config
915915+ * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
916916+ * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
917917+ a deprecated aliases for `--scope=all`
918918+ * Create config option (sparse.scope?), tie it to the "Cliff notes"
919919+ overview
920920+921921+ * Add --scope=sparse (and --scope=all) flag to each of the history querying
922922+ commands. IMPORTANT: make sure diff machinery changes don't mess with
923923+ format-patch, fast-export, etc.
924924+925925+=== Known bugs ===
926926+927927+This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
928928+been working on it.
929929+930930+0. Behavior A is not well supported in Git. (Behavior B didn't used to
931931+ be either, but was the easier of the two to implement.)
932932+933933+1. am and apply:
934934+935935+ apply, without `--index` or `--cached`, relies on files being present
936936+ in the working copy, and also writes to them unconditionally. As
937937+ such, it should first check for the files' presence, and if found to
938938+ be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
939939+ its work. Currently, it just throws an error.
940940+941941+ apply, with either `--cached` or `--index`, will not preserve the
942942+ SKIP_WORKTREE bit. This is fine if the file has conflicts, but
943943+ otherwise SKIP_WORKTREE bits should be preserved for --cached and
944944+ probably also for --index.
945945+946946+ am, if there are no conflicts, will vivify files and fail to preserve
947947+ the SKIP_WORKTREE bit. If there are conflicts and `-3` is not
948948+ specified, it will vivify files and then complain the patch doesn't
949949+ apply. If there are conflicts and `-3` is specified, it will vivify
950950+ files and then complain that those vivified files would be
951951+ overwritten by merge.
952952+953953+2. reset --hard:
954954+955955+ reset --hard provides confusing error message (works correctly, but
956956+ misleads the user into believing it didn't):
957957+958958+ $ touch addme
959959+ $ git add addme
960960+ $ git ls-files -t
961961+ H addme
962962+ H tracked
963963+ S tracked-but-maybe-skipped
964964+ $ git reset --hard # usually works great
965965+ error: Path 'addme' not uptodate; will not remove from working tree.
966966+ HEAD is now at bdbbb6f third
967967+ $ git ls-files -t
968968+ H tracked
969969+ S tracked-but-maybe-skipped
970970+ $ ls -1
971971+ tracked
972972+973973+ `git reset --hard` DID remove addme from the index and the working tree, contrary
974974+ to the error message, but in line with how reset --hard should behave.
975975+976976+3. read-tree
977977+978978+ `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
979979+ entries it reads into the index, resulting in all your files suddenly
980980+ appearing to be "deleted".
981981+982982+4. Checkout, restore:
983983+984984+ These command do not handle path & revision arguments appropriately:
985985+986986+ $ ls
987987+ tracked
988988+ $ git ls-files -t
989989+ H tracked
990990+ S tracked-but-maybe-skipped
991991+ $ git status --porcelain
992992+ $ git checkout -- '*skipped'
993993+ error: pathspec '*skipped' did not match any file(s) known to git
994994+ $ git ls-files -- '*skipped'
995995+ tracked-but-maybe-skipped
996996+ $ git checkout HEAD -- '*skipped'
997997+ error: pathspec '*skipped' did not match any file(s) known to git
998998+ $ git ls-tree HEAD | grep skipped
999999+ 100644 blob 276f5a64354b791b13840f02047738c77ad0584f tracked-but-maybe-skipped
10001000+ $ git status --porcelain
10011001+ $ git checkout HEAD~1 -- '*skipped'
10021002+ $ git ls-files -t
10031003+ H tracked
10041004+ H tracked-but-maybe-skipped
10051005+ $ git status --porcelain
10061006+ M tracked-but-maybe-skipped
10071007+ $ git checkout HEAD -- '*skipped'
10081008+ $ git status --porcelain
10091009+ $
10101010+10111011+ Note that checkout without a revision (or restore --staged) fails to
10121012+ find a file to restore from the index, even though ls-files shows
10131013+ such a file certainly exists.
10141014+10151015+ Similar issues occur with HEAD (--source=HEAD in restore's case),
10161016+ but suddenly works when HEAD~1 is specified. And then after that it
10171017+ will work with HEAD specified, even though it didn't before.
10181018+10191019+ Directories are also an issue:
10201020+10211021+ $ git sparse-checkout set nomatches
10221022+ $ git status
10231023+ On branch main
10241024+ You are in a sparse checkout with 0% of tracked files present.
10251025+10261026+ nothing to commit, working tree clean
10271027+ $ git checkout .
10281028+ error: pathspec '.' did not match any file(s) known to git
10291029+ $ git checkout HEAD~1 .
10301030+ Updated 1 path from 58916d9
10311031+ $ git ls-files -t
10321032+ S tracked
10331033+ H tracked-but-maybe-skipped
10341034+10351035+5. checkout and restore --staged, continued:
10361036+10371037+ These commands do not correctly scope operations to the sparse
10381038+ specification, and make it worse by not setting important SKIP_WORKTREE
10391039+ bits:
10401040+10411041+ $ git restore --source OLDREV --staged outside-sparse-cone/
10421042+ $ git status --porcelain
10431043+ MD outside-sparse-cone/file1
10441044+ MD outside-sparse-cone/file2
10451045+ MD outside-sparse-cone/file3
10461046+10471047+ We can add a --scope=all mode to `git restore` to let it operate outside
10481048+ the sparse specification, but then it will be important to set the
10491049+ SKIP_WORKTREE bits appropriately.
10501050+10511051+6. Performance issues; see:
10521052+ https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
10531053+10541054+10551055+=== Reference Emails ===
10561056+10571057+Emails that detail various bugs we've had in sparse-checkout:
10581058+10591059+[1] (Original descriptions of behavior A & behavior B)
10601060+ https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
10611061+[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
10621062+ https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
10631063+[3] (Present-despite-skipped entries)
10641064+ https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
10651065+[4] (Clone --no-checkout interaction)
10661066+ https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
10671067+[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
10681068+ https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
10691069+[6] (SKIP_WORKTREE is advisory, not mandatory)
10701070+ https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
10711071+[7] (`worktree add` should copy sparsity settings from current worktree)
10721072+ https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
10731073+[8] (Avoid negative surprises in add, rm, and mv)
10741074+ https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
10751075+ https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
10761076+[9] (Move from out-of-cone to in-cone)
10771077+ https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
10781078+ https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
10791079+[10] (Unnecessarily downloading objects outside sparse specification)
10801080+ https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
10811081+10821082+[11] (Stolee's comments on high-level usecases)
10831083+ https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
10841084+10851085+[12] Others commenting on eventually switching default to behavior A:
10861086+ * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
10871087+ * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
10881088+ * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
10891089+10901090+[13] Previous config name suggestion and description
10911091+ * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
10921092+10931093+[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
10941094+ https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
10951095+10961096+[15] Lengthy email on grep behavior, covering what should be searched:
10971097+ * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
10981098+10991099+[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
11001100+ search for the parenthetical comment starting "We do not check".
11011101+ https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
11021102+11031103+[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/