Git fork

pack-objects: introduce '--stdin-packs=follow'

When invoked with '--stdin-packs', pack-objects will generate a pack
which contains the objects found in the "included" packs, less any
objects from "excluded" packs.

Packs that exist in the repository but weren't specified as either
included or excluded are in practice treated like the latter, at least
in the sense that pack-objects won't include objects from those packs.
This behavior forces us to include any cruft pack(s) in a repository's
multi-pack index for the reasons described in ddee3703b3
(builtin/repack.c: add cruft packs to MIDX during geometric repack,
2022-05-20).

The full details are in ddee3703b3, but the gist is if you
have a once-unreachable object in a cruft pack which later becomes
reachable via one or more commits in a pack generated with
'--stdin-packs', you *have* to include that object in the MIDX via the
copy in the cruft pack, otherwise we cannot generate reachability
bitmaps for any commits which reach that object.

Note that the traversal here is best-effort, similar to the existing
traversal which provides name-hash hints. This means that the object
traversal may hand us back a blob that does not actually exist. We
*won't* see missing trees/commits with 'ignore_missing_links' because:

- missing commit parents are discarded at the commit traversal stage by
revision.c::process_parents()

- missing tag objects are discarded by revision.c::handle_commit()

- missing tree objects are discarded by the list-objects code in
list-objects.c::process_tree()

But we have to handle potentially-missing blobs specially by making a
separate check to ensure they exist in the repository. Failing to do so
would mean that we'd add an object to the packing list which doesn't
actually exist, rendering us unable to write out the pack.

This prepares us for new repacking behavior which will "resurrect"
objects found in cruft or otherwise unspecified packs when generating
new packs. In the context of geometric repacking, this may be used to
maintain a sequence of geometrically-repacked packs, the union of which
is closed under reachability, even in the case described earlier.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

authored by

Taylor Blau and committed by
Junio C Hamano
cd846bac 63195f01

+193 -23
+9 -1
Documentation/git-pack-objects.adoc
··· 87 reference was included in the resulting packfile. This 88 can be useful to send new tags to native Git clients. 89 90 - --stdin-packs:: 91 Read the basenames of packfiles (e.g., `pack-1234abcd.pack`) 92 from the standard input, instead of object names or revision 93 arguments. The resulting pack contains all objects listed in the 94 included packs (those not beginning with `^`), excluding any 95 objects listed in the excluded packs (beginning with `^`). 96 + 97 Incompatible with `--revs`, or options that imply `--revs` (such as 98 `--all`), with the exception of `--unpacked`, which is compatible.
··· 87 reference was included in the resulting packfile. This 88 can be useful to send new tags to native Git clients. 89 90 + --stdin-packs[=<mode>]:: 91 Read the basenames of packfiles (e.g., `pack-1234abcd.pack`) 92 from the standard input, instead of object names or revision 93 arguments. The resulting pack contains all objects listed in the 94 included packs (those not beginning with `^`), excluding any 95 objects listed in the excluded packs (beginning with `^`). 96 + + 97 + When `mode` is "follow", objects from packs not listed on stdin receive 98 + special treatment. Objects within unlisted packs will be included if 99 + those objects are (1) reachable from the included packs, and (2) not 100 + found in any excluded packs. This mode is useful, for example, to 101 + resurrect once-unreachable objects found in cruft packs to generate 102 + packs which are closed under reachability up to the boundary set by the 103 + excluded packs. 104 + 105 Incompatible with `--revs`, or options that imply `--revs` (such as 106 `--all`), with the exception of `--unpacked`, which is compatible.
+64 -22
builtin/pack-objects.c
··· 284 static struct oidset excluded_by_config; 285 static int name_hash_version = -1; 286 287 /** 288 * Check whether the name_hash_version chosen by user input is appropriate, 289 * and also validate whether it is compatible with other features. ··· 3749 } 3750 3751 static void show_object_pack_hint(struct object *object, const char *name, 3752 - void *data UNUSED) 3753 { 3754 - struct object_entry *oe = packlist_find(&to_pack, &object->oid); 3755 - if (!oe) 3756 - return; 3757 3758 - /* 3759 - * Our 'to_pack' list was constructed by iterating all objects packed in 3760 - * included packs, and so doesn't have a non-zero hash field that you 3761 - * would typically pick up during a reachability traversal. 3762 - * 3763 - * Make a best-effort attempt to fill in the ->hash and ->no_try_delta 3764 - * fields here in order to perhaps improve the delta selection 3765 - * process. 3766 - */ 3767 - oe->hash = pack_name_hash_fn(name); 3768 - oe->no_try_delta = name && no_try_delta(name); 3769 3770 - stdin_packs_hints_nr++; 3771 } 3772 3773 - static void show_commit_pack_hint(struct commit *commit UNUSED, 3774 - void *data UNUSED) 3775 { 3776 /* nothing to do; commits don't have a namehash */ 3777 } 3778 3779 static int pack_mtime_cmp(const void *_a, const void *_b) ··· 3881 3882 static void add_unreachable_loose_objects(struct rev_info *revs); 3883 3884 - static void read_stdin_packs(int rev_list_unpacked) 3885 { 3886 struct rev_info revs; 3887 ··· 3913 traverse_commit_list(&revs, 3914 show_commit_pack_hint, 3915 show_object_pack_hint, 3916 - NULL); 3917 3918 trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found", 3919 stdin_packs_found_nr); ··· 4795 return is_not_in_promisor_pack_obj((struct object *) commit, data); 4796 } 4797 4798 int cmd_pack_objects(int argc, 4799 const char **argv, 4800 const char *prefix, ··· 4805 struct strvec rp = STRVEC_INIT; 4806 int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0; 4807 int rev_list_index = 0; 4808 - int stdin_packs = 0; 4809 struct string_list keep_pack_list = STRING_LIST_INIT_NODUP; 4810 struct list_objects_filter_options filter_options = 4811 LIST_OBJECTS_FILTER_INIT; ··· 4860 OPT_SET_INT_F(0, "indexed-objects", &rev_list_index, 4861 N_("include objects referred to by the index"), 4862 1, PARSE_OPT_NONEG), 4863 OPT_BOOL(0, "stdin-packs", &stdin_packs, 4864 N_("read packs from stdin")), 4865 OPT_BOOL(0, "stdout", &pack_to_stdout, ··· 5150 progress_state = start_progress(the_repository, 5151 _("Enumerating objects"), 0); 5152 if (stdin_packs) { 5153 - read_stdin_packs(rev_list_unpacked); 5154 } else if (cruft) { 5155 read_cruft_objects(); 5156 } else if (!use_internal_rev_list) {
··· 284 static struct oidset excluded_by_config; 285 static int name_hash_version = -1; 286 287 + enum stdin_packs_mode { 288 + STDIN_PACKS_MODE_NONE, 289 + STDIN_PACKS_MODE_STANDARD, 290 + STDIN_PACKS_MODE_FOLLOW, 291 + }; 292 + 293 /** 294 * Check whether the name_hash_version chosen by user input is appropriate, 295 * and also validate whether it is compatible with other features. ··· 3755 } 3756 3757 static void show_object_pack_hint(struct object *object, const char *name, 3758 + void *data) 3759 { 3760 + enum stdin_packs_mode mode = *(enum stdin_packs_mode *)data; 3761 + if (mode == STDIN_PACKS_MODE_FOLLOW) { 3762 + if (object->type == OBJ_BLOB && 3763 + !has_object(the_repository, &object->oid, 0)) 3764 + return; 3765 + add_object_entry(&object->oid, object->type, name, 0); 3766 + } else { 3767 + struct object_entry *oe = packlist_find(&to_pack, &object->oid); 3768 + if (!oe) 3769 + return; 3770 3771 + /* 3772 + * Our 'to_pack' list was constructed by iterating all 3773 + * objects packed in included packs, and so doesn't have 3774 + * a non-zero hash field that you would typically pick 3775 + * up during a reachability traversal. 3776 + * 3777 + * Make a best-effort attempt to fill in the ->hash and 3778 + * ->no_try_delta fields here in order to perhaps 3779 + * improve the delta selection process. 3780 + */ 3781 + oe->hash = pack_name_hash_fn(name); 3782 + oe->no_try_delta = name && no_try_delta(name); 3783 3784 + stdin_packs_hints_nr++; 3785 + } 3786 } 3787 3788 + static void show_commit_pack_hint(struct commit *commit, void *data) 3789 { 3790 + enum stdin_packs_mode mode = *(enum stdin_packs_mode *)data; 3791 + 3792 + if (mode == STDIN_PACKS_MODE_FOLLOW) { 3793 + show_object_pack_hint((struct object *)commit, "", data); 3794 + return; 3795 + } 3796 + 3797 /* nothing to do; commits don't have a namehash */ 3798 + 3799 } 3800 3801 static int pack_mtime_cmp(const void *_a, const void *_b) ··· 3903 3904 static void add_unreachable_loose_objects(struct rev_info *revs); 3905 3906 + static void read_stdin_packs(enum stdin_packs_mode mode, int rev_list_unpacked) 3907 { 3908 struct rev_info revs; 3909 ··· 3935 traverse_commit_list(&revs, 3936 show_commit_pack_hint, 3937 show_object_pack_hint, 3938 + &mode); 3939 3940 trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found", 3941 stdin_packs_found_nr); ··· 4817 return is_not_in_promisor_pack_obj((struct object *) commit, data); 4818 } 4819 4820 + static int parse_stdin_packs_mode(const struct option *opt, const char *arg, 4821 + int unset) 4822 + { 4823 + enum stdin_packs_mode *mode = opt->value; 4824 + 4825 + if (unset) 4826 + *mode = STDIN_PACKS_MODE_NONE; 4827 + else if (!arg || !*arg) 4828 + *mode = STDIN_PACKS_MODE_STANDARD; 4829 + else if (!strcmp(arg, "follow")) 4830 + *mode = STDIN_PACKS_MODE_FOLLOW; 4831 + else 4832 + die(_("invalid value for '%s': '%s'"), opt->long_name, arg); 4833 + 4834 + return 0; 4835 + } 4836 + 4837 int cmd_pack_objects(int argc, 4838 const char **argv, 4839 const char *prefix, ··· 4844 struct strvec rp = STRVEC_INIT; 4845 int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0; 4846 int rev_list_index = 0; 4847 + enum stdin_packs_mode stdin_packs = STDIN_PACKS_MODE_NONE; 4848 struct string_list keep_pack_list = STRING_LIST_INIT_NODUP; 4849 struct list_objects_filter_options filter_options = 4850 LIST_OBJECTS_FILTER_INIT; ··· 4899 OPT_SET_INT_F(0, "indexed-objects", &rev_list_index, 4900 N_("include objects referred to by the index"), 4901 1, PARSE_OPT_NONEG), 4902 + OPT_CALLBACK_F(0, "stdin-packs", &stdin_packs, N_("mode"), 4903 + N_("read packs from stdin"), 4904 + PARSE_OPT_OPTARG, parse_stdin_packs_mode), 4905 OPT_BOOL(0, "stdin-packs", &stdin_packs, 4906 N_("read packs from stdin")), 4907 OPT_BOOL(0, "stdout", &pack_to_stdout, ··· 5192 progress_state = start_progress(the_repository, 5193 _("Enumerating objects"), 0); 5194 if (stdin_packs) { 5195 + read_stdin_packs(stdin_packs, rev_list_unpacked); 5196 } else if (cruft) { 5197 read_cruft_objects(); 5198 } else if (!use_internal_rev_list) {
+120
t/t5331-pack-objects-stdin.sh
··· 236 test_cmp expected-objects actual-objects 237 ' 238 239 test_done
··· 236 test_cmp expected-objects actual-objects 237 ' 238 239 + objdir=.git/objects 240 + packdir=$objdir/pack 241 + 242 + objects_in_packs () { 243 + for p in "$@" 244 + do 245 + git show-index <"$packdir/pack-$p.idx" || return 1 246 + done >objects.raw && 247 + 248 + cut -d' ' -f2 objects.raw | sort && 249 + rm -f objects.raw 250 + } 251 + 252 + test_expect_success '--stdin-packs=follow walks into unknown packs' ' 253 + test_when_finished "rm -fr repo" && 254 + 255 + git init repo && 256 + ( 257 + cd repo && 258 + 259 + for c in A B C D 260 + do 261 + test_commit "$c" || return 1 262 + done && 263 + 264 + A="$(echo A | git pack-objects --revs $packdir/pack)" && 265 + B="$(echo A..B | git pack-objects --revs $packdir/pack)" && 266 + C="$(echo B..C | git pack-objects --revs $packdir/pack)" && 267 + D="$(echo C..D | git pack-objects --revs $packdir/pack)" && 268 + test_commit E && 269 + 270 + git prune-packed && 271 + 272 + cat >in <<-EOF && 273 + pack-$B.pack 274 + ^pack-$C.pack 275 + pack-$D.pack 276 + EOF 277 + 278 + # With just --stdin-packs, pack "A" is unknown to us, so 279 + # only objects from packs "B" and "D" are included in 280 + # the output pack. 281 + P=$(git pack-objects --stdin-packs $packdir/pack <in) && 282 + objects_in_packs $B $D >expect && 283 + objects_in_packs $P >actual && 284 + test_cmp expect actual && 285 + 286 + # But with --stdin-packs=follow, objects from both 287 + # included packs reach objects from the unknown pack, so 288 + # objects from pack "A" is included in the output pack 289 + # in addition to the above. 290 + P=$(git pack-objects --stdin-packs=follow $packdir/pack <in) && 291 + objects_in_packs $A $B $D >expect && 292 + objects_in_packs $P >actual && 293 + test_cmp expect actual && 294 + 295 + # And with --unpacked, we will pick up objects from unknown 296 + # packs that are reachable from loose objects. Loose object E 297 + # reaches objects in pack A, but there are three excluded packs 298 + # in between. 299 + # 300 + # The resulting pack should include objects reachable from E 301 + # that are not present in packs B, C, or D, along with those 302 + # present in pack A. 303 + cat >in <<-EOF && 304 + ^pack-$B.pack 305 + ^pack-$C.pack 306 + ^pack-$D.pack 307 + EOF 308 + 309 + P=$(git pack-objects --stdin-packs=follow --unpacked \ 310 + $packdir/pack <in) && 311 + 312 + { 313 + objects_in_packs $A && 314 + git rev-list --objects --no-object-names D..E 315 + }>expect.raw && 316 + sort expect.raw >expect && 317 + objects_in_packs $P >actual && 318 + test_cmp expect actual 319 + ) 320 + ' 321 + 322 + stdin_packs__follow_with_only () { 323 + rm -fr stdin_packs__follow_with_only && 324 + git init stdin_packs__follow_with_only && 325 + ( 326 + cd stdin_packs__follow_with_only && 327 + 328 + test_commit A && 329 + test_commit B && 330 + 331 + git rev-parse "$@" >B.objects && 332 + 333 + echo A | git pack-objects --revs $packdir/pack && 334 + B="$(git pack-objects $packdir/pack <B.objects)" && 335 + 336 + git cat-file --batch-check="%(objectname)" --batch-all-objects >objs && 337 + for obj in $(cat objs) 338 + do 339 + rm -f $objdir/$(test_oid_to_path $obj) || return 1 340 + done && 341 + 342 + ( cd $packdir && ls pack-*.pack ) >in && 343 + git pack-objects --stdin-packs=follow --stdout >/dev/null <in 344 + ) 345 + } 346 + 347 + test_expect_success '--stdin-packs=follow tolerates missing blobs' ' 348 + stdin_packs__follow_with_only HEAD HEAD^{tree} 349 + ' 350 + 351 + test_expect_success '--stdin-packs=follow tolerates missing trees' ' 352 + stdin_packs__follow_with_only HEAD HEAD:B.t 353 + ' 354 + 355 + test_expect_success '--stdin-packs=follow tolerates missing commits' ' 356 + stdin_packs__follow_with_only HEAD HEAD^{tree} 357 + ' 358 + 359 test_done