Git fork

pack-objects: introduce '--stdin-packs=follow'

When invoked with '--stdin-packs', pack-objects will generate a pack
which contains the objects found in the "included" packs, less any
objects from "excluded" packs.

Packs that exist in the repository but weren't specified as either
included or excluded are in practice treated like the latter, at least
in the sense that pack-objects won't include objects from those packs.
This behavior forces us to include any cruft pack(s) in a repository's
multi-pack index for the reasons described in ddee3703b3
(builtin/repack.c: add cruft packs to MIDX during geometric repack,
2022-05-20).

The full details are in ddee3703b3, but the gist is if you
have a once-unreachable object in a cruft pack which later becomes
reachable via one or more commits in a pack generated with
'--stdin-packs', you *have* to include that object in the MIDX via the
copy in the cruft pack, otherwise we cannot generate reachability
bitmaps for any commits which reach that object.

Note that the traversal here is best-effort, similar to the existing
traversal which provides name-hash hints. This means that the object
traversal may hand us back a blob that does not actually exist. We
*won't* see missing trees/commits with 'ignore_missing_links' because:

- missing commit parents are discarded at the commit traversal stage by
revision.c::process_parents()

- missing tag objects are discarded by revision.c::handle_commit()

- missing tree objects are discarded by the list-objects code in
list-objects.c::process_tree()

But we have to handle potentially-missing blobs specially by making a
separate check to ensure they exist in the repository. Failing to do so
would mean that we'd add an object to the packing list which doesn't
actually exist, rendering us unable to write out the pack.

This prepares us for new repacking behavior which will "resurrect"
objects found in cruft or otherwise unspecified packs when generating
new packs. In the context of geometric repacking, this may be used to
maintain a sequence of geometrically-repacked packs, the union of which
is closed under reachability, even in the case described earlier.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

authored by

Taylor Blau and committed by
Junio C Hamano
cd846bac 63195f01

+193 -23
+9 -1
Documentation/git-pack-objects.adoc
··· 87 87 reference was included in the resulting packfile. This 88 88 can be useful to send new tags to native Git clients. 89 89 90 - --stdin-packs:: 90 + --stdin-packs[=<mode>]:: 91 91 Read the basenames of packfiles (e.g., `pack-1234abcd.pack`) 92 92 from the standard input, instead of object names or revision 93 93 arguments. The resulting pack contains all objects listed in the 94 94 included packs (those not beginning with `^`), excluding any 95 95 objects listed in the excluded packs (beginning with `^`). 96 + + 97 + When `mode` is "follow", objects from packs not listed on stdin receive 98 + special treatment. Objects within unlisted packs will be included if 99 + those objects are (1) reachable from the included packs, and (2) not 100 + found in any excluded packs. This mode is useful, for example, to 101 + resurrect once-unreachable objects found in cruft packs to generate 102 + packs which are closed under reachability up to the boundary set by the 103 + excluded packs. 96 104 + 97 105 Incompatible with `--revs`, or options that imply `--revs` (such as 98 106 `--all`), with the exception of `--unpacked`, which is compatible.
+64 -22
builtin/pack-objects.c
··· 284 284 static struct oidset excluded_by_config; 285 285 static int name_hash_version = -1; 286 286 287 + enum stdin_packs_mode { 288 + STDIN_PACKS_MODE_NONE, 289 + STDIN_PACKS_MODE_STANDARD, 290 + STDIN_PACKS_MODE_FOLLOW, 291 + }; 292 + 287 293 /** 288 294 * Check whether the name_hash_version chosen by user input is appropriate, 289 295 * and also validate whether it is compatible with other features. ··· 3749 3755 } 3750 3756 3751 3757 static void show_object_pack_hint(struct object *object, const char *name, 3752 - void *data UNUSED) 3758 + void *data) 3753 3759 { 3754 - struct object_entry *oe = packlist_find(&to_pack, &object->oid); 3755 - if (!oe) 3756 - return; 3760 + enum stdin_packs_mode mode = *(enum stdin_packs_mode *)data; 3761 + if (mode == STDIN_PACKS_MODE_FOLLOW) { 3762 + if (object->type == OBJ_BLOB && 3763 + !has_object(the_repository, &object->oid, 0)) 3764 + return; 3765 + add_object_entry(&object->oid, object->type, name, 0); 3766 + } else { 3767 + struct object_entry *oe = packlist_find(&to_pack, &object->oid); 3768 + if (!oe) 3769 + return; 3757 3770 3758 - /* 3759 - * Our 'to_pack' list was constructed by iterating all objects packed in 3760 - * included packs, and so doesn't have a non-zero hash field that you 3761 - * would typically pick up during a reachability traversal. 3762 - * 3763 - * Make a best-effort attempt to fill in the ->hash and ->no_try_delta 3764 - * fields here in order to perhaps improve the delta selection 3765 - * process. 3766 - */ 3767 - oe->hash = pack_name_hash_fn(name); 3768 - oe->no_try_delta = name && no_try_delta(name); 3771 + /* 3772 + * Our 'to_pack' list was constructed by iterating all 3773 + * objects packed in included packs, and so doesn't have 3774 + * a non-zero hash field that you would typically pick 3775 + * up during a reachability traversal. 3776 + * 3777 + * Make a best-effort attempt to fill in the ->hash and 3778 + * ->no_try_delta fields here in order to perhaps 3779 + * improve the delta selection process. 3780 + */ 3781 + oe->hash = pack_name_hash_fn(name); 3782 + oe->no_try_delta = name && no_try_delta(name); 3769 3783 3770 - stdin_packs_hints_nr++; 3784 + stdin_packs_hints_nr++; 3785 + } 3771 3786 } 3772 3787 3773 - static void show_commit_pack_hint(struct commit *commit UNUSED, 3774 - void *data UNUSED) 3788 + static void show_commit_pack_hint(struct commit *commit, void *data) 3775 3789 { 3790 + enum stdin_packs_mode mode = *(enum stdin_packs_mode *)data; 3791 + 3792 + if (mode == STDIN_PACKS_MODE_FOLLOW) { 3793 + show_object_pack_hint((struct object *)commit, "", data); 3794 + return; 3795 + } 3796 + 3776 3797 /* nothing to do; commits don't have a namehash */ 3798 + 3777 3799 } 3778 3800 3779 3801 static int pack_mtime_cmp(const void *_a, const void *_b) ··· 3881 3903 3882 3904 static void add_unreachable_loose_objects(struct rev_info *revs); 3883 3905 3884 - static void read_stdin_packs(int rev_list_unpacked) 3906 + static void read_stdin_packs(enum stdin_packs_mode mode, int rev_list_unpacked) 3885 3907 { 3886 3908 struct rev_info revs; 3887 3909 ··· 3913 3935 traverse_commit_list(&revs, 3914 3936 show_commit_pack_hint, 3915 3937 show_object_pack_hint, 3916 - NULL); 3938 + &mode); 3917 3939 3918 3940 trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found", 3919 3941 stdin_packs_found_nr); ··· 4795 4817 return is_not_in_promisor_pack_obj((struct object *) commit, data); 4796 4818 } 4797 4819 4820 + static int parse_stdin_packs_mode(const struct option *opt, const char *arg, 4821 + int unset) 4822 + { 4823 + enum stdin_packs_mode *mode = opt->value; 4824 + 4825 + if (unset) 4826 + *mode = STDIN_PACKS_MODE_NONE; 4827 + else if (!arg || !*arg) 4828 + *mode = STDIN_PACKS_MODE_STANDARD; 4829 + else if (!strcmp(arg, "follow")) 4830 + *mode = STDIN_PACKS_MODE_FOLLOW; 4831 + else 4832 + die(_("invalid value for '%s': '%s'"), opt->long_name, arg); 4833 + 4834 + return 0; 4835 + } 4836 + 4798 4837 int cmd_pack_objects(int argc, 4799 4838 const char **argv, 4800 4839 const char *prefix, ··· 4805 4844 struct strvec rp = STRVEC_INIT; 4806 4845 int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0; 4807 4846 int rev_list_index = 0; 4808 - int stdin_packs = 0; 4847 + enum stdin_packs_mode stdin_packs = STDIN_PACKS_MODE_NONE; 4809 4848 struct string_list keep_pack_list = STRING_LIST_INIT_NODUP; 4810 4849 struct list_objects_filter_options filter_options = 4811 4850 LIST_OBJECTS_FILTER_INIT; ··· 4860 4899 OPT_SET_INT_F(0, "indexed-objects", &rev_list_index, 4861 4900 N_("include objects referred to by the index"), 4862 4901 1, PARSE_OPT_NONEG), 4902 + OPT_CALLBACK_F(0, "stdin-packs", &stdin_packs, N_("mode"), 4903 + N_("read packs from stdin"), 4904 + PARSE_OPT_OPTARG, parse_stdin_packs_mode), 4863 4905 OPT_BOOL(0, "stdin-packs", &stdin_packs, 4864 4906 N_("read packs from stdin")), 4865 4907 OPT_BOOL(0, "stdout", &pack_to_stdout, ··· 5150 5192 progress_state = start_progress(the_repository, 5151 5193 _("Enumerating objects"), 0); 5152 5194 if (stdin_packs) { 5153 - read_stdin_packs(rev_list_unpacked); 5195 + read_stdin_packs(stdin_packs, rev_list_unpacked); 5154 5196 } else if (cruft) { 5155 5197 read_cruft_objects(); 5156 5198 } else if (!use_internal_rev_list) {
+120
t/t5331-pack-objects-stdin.sh
··· 236 236 test_cmp expected-objects actual-objects 237 237 ' 238 238 239 + objdir=.git/objects 240 + packdir=$objdir/pack 241 + 242 + objects_in_packs () { 243 + for p in "$@" 244 + do 245 + git show-index <"$packdir/pack-$p.idx" || return 1 246 + done >objects.raw && 247 + 248 + cut -d' ' -f2 objects.raw | sort && 249 + rm -f objects.raw 250 + } 251 + 252 + test_expect_success '--stdin-packs=follow walks into unknown packs' ' 253 + test_when_finished "rm -fr repo" && 254 + 255 + git init repo && 256 + ( 257 + cd repo && 258 + 259 + for c in A B C D 260 + do 261 + test_commit "$c" || return 1 262 + done && 263 + 264 + A="$(echo A | git pack-objects --revs $packdir/pack)" && 265 + B="$(echo A..B | git pack-objects --revs $packdir/pack)" && 266 + C="$(echo B..C | git pack-objects --revs $packdir/pack)" && 267 + D="$(echo C..D | git pack-objects --revs $packdir/pack)" && 268 + test_commit E && 269 + 270 + git prune-packed && 271 + 272 + cat >in <<-EOF && 273 + pack-$B.pack 274 + ^pack-$C.pack 275 + pack-$D.pack 276 + EOF 277 + 278 + # With just --stdin-packs, pack "A" is unknown to us, so 279 + # only objects from packs "B" and "D" are included in 280 + # the output pack. 281 + P=$(git pack-objects --stdin-packs $packdir/pack <in) && 282 + objects_in_packs $B $D >expect && 283 + objects_in_packs $P >actual && 284 + test_cmp expect actual && 285 + 286 + # But with --stdin-packs=follow, objects from both 287 + # included packs reach objects from the unknown pack, so 288 + # objects from pack "A" is included in the output pack 289 + # in addition to the above. 290 + P=$(git pack-objects --stdin-packs=follow $packdir/pack <in) && 291 + objects_in_packs $A $B $D >expect && 292 + objects_in_packs $P >actual && 293 + test_cmp expect actual && 294 + 295 + # And with --unpacked, we will pick up objects from unknown 296 + # packs that are reachable from loose objects. Loose object E 297 + # reaches objects in pack A, but there are three excluded packs 298 + # in between. 299 + # 300 + # The resulting pack should include objects reachable from E 301 + # that are not present in packs B, C, or D, along with those 302 + # present in pack A. 303 + cat >in <<-EOF && 304 + ^pack-$B.pack 305 + ^pack-$C.pack 306 + ^pack-$D.pack 307 + EOF 308 + 309 + P=$(git pack-objects --stdin-packs=follow --unpacked \ 310 + $packdir/pack <in) && 311 + 312 + { 313 + objects_in_packs $A && 314 + git rev-list --objects --no-object-names D..E 315 + }>expect.raw && 316 + sort expect.raw >expect && 317 + objects_in_packs $P >actual && 318 + test_cmp expect actual 319 + ) 320 + ' 321 + 322 + stdin_packs__follow_with_only () { 323 + rm -fr stdin_packs__follow_with_only && 324 + git init stdin_packs__follow_with_only && 325 + ( 326 + cd stdin_packs__follow_with_only && 327 + 328 + test_commit A && 329 + test_commit B && 330 + 331 + git rev-parse "$@" >B.objects && 332 + 333 + echo A | git pack-objects --revs $packdir/pack && 334 + B="$(git pack-objects $packdir/pack <B.objects)" && 335 + 336 + git cat-file --batch-check="%(objectname)" --batch-all-objects >objs && 337 + for obj in $(cat objs) 338 + do 339 + rm -f $objdir/$(test_oid_to_path $obj) || return 1 340 + done && 341 + 342 + ( cd $packdir && ls pack-*.pack ) >in && 343 + git pack-objects --stdin-packs=follow --stdout >/dev/null <in 344 + ) 345 + } 346 + 347 + test_expect_success '--stdin-packs=follow tolerates missing blobs' ' 348 + stdin_packs__follow_with_only HEAD HEAD^{tree} 349 + ' 350 + 351 + test_expect_success '--stdin-packs=follow tolerates missing trees' ' 352 + stdin_packs__follow_with_only HEAD HEAD:B.t 353 + ' 354 + 355 + test_expect_success '--stdin-packs=follow tolerates missing commits' ' 356 + stdin_packs__follow_with_only HEAD HEAD^{tree} 357 + ' 358 + 239 359 test_done