Git fork

repack: avoid combining cruft packs with `--max-cruft-size`

In 37dc6d8104 (builtin/repack.c: implement support for
`--max-cruft-size`, 2023-10-02), we exposed new functionality that
allowed repositories to specify the behavior of when we should combine
multiple cruft packs together.

This feature was designed to ensure that we never repacked cruft packs
which were larger than the given threshold in order to provide tighter
I/O bounds for repositories that have many unreachable objects. In
essence, specifying '--max-cruft-size=N' instructed 'repack' to
aggregate cruft packs together (in order of ascending size) until the
combine size grows past 'N', and then make a new cruft pack whose
contents includes the packs we rolled up.

But this isn't quite how it works in practice. Suppose for example that
we have two cruft packs which are each 100MiB in size. One might expect
specifying "--max-cruft-size=200M" would combine these two packs
together, and then avoid repacking them until a pruning GC takes place.
In reality, 'repack' would try and aggregate these together, but writing
a pack that is strictly smaller than 200 MiB (since pack-objects'
"--max-pack-size" provides a strict bound for packs containing more than
one object).

So instead we'll write out a pack that is, say, 199 MiB in size, and
then another 1 MiB pack containing the balance. If we later repack the
repository without adding any new unreachable objects, we'll repeat the
same exercise again, making the same 199 MiB and 1 MiB packs each time.

This happens because of a poor choice to bolt the '--max-cruft-size'
functionality onto pack-objects' '--max-pack-size', forcing us to
generate packs which are always smaller than the provided threshold and
thus subject to repacking.

The following commit will introduce a new flag that implements something
similar to the behavior above. Let's prepare for that by making repack's
'--max-cruft-size' flag behave as an cruft pack-specific override for
'--max-pack-size'.

Do so by temporarily repurposing the 'collapse_small_cruft_packs()'
function to instead generate a cruft pack using the same instructions as
if we didn't specify any maximum pack size. The calling code looks
something like:

if (args->max_pack_size && !cruft_expiration) {
collapse_small_cruft_packs(in, args->max_pack_size, existing);
} else {
for_each_string_list_item(item, &existing->non_kept_packs)
fprintf(in, "-%s.pack\n", item->string);
for_each_string_list_item(item, &existing->cruft_packs)
fprintf(in, "-%s.pack\n", item->string);
}

This patch makes collapse_small_cruft_packs() behave identically to the
'else' arm of the conditional above. This repurposing of
'collapse_small_cruft_packs()' is intentional, since it will set us up
nicely to introduce the new behavior in the following commit.

Naturally, there is some test fallout in the test which exercises the
old meaning of '--max-cruft-size'. Mark that test as failing for now to
be dealt with in the following commit. Likewise, add a new test which
explicitly tests the behavior of '--max-cruft-size' to place a hard
limit on the size of any generated cruft pack(s).

Note that this is a breaking change, as it alters the user-visible
behavior of '--max-cruft-size'. But I'm OK changing this behavior in
this instance, since the behavior wasn't accurate to begin with.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

authored by

Taylor Blau and committed by
Junio C Hamano
0855ed96 7fb12bb2

+67 -55
+3 -9
Documentation/git-repack.adoc
··· 77 Only useful with `--cruft -d`. 78 79 --max-cruft-size=<n>:: 80 - Repack cruft objects into packs as large as `<n>` bytes before 81 - creating new packs. As long as there are enough cruft packs 82 - smaller than `<n>`, repacking will cause a new cruft pack to 83 - be created containing objects from any combined cruft packs, 84 - along with any new unreachable objects. Cruft packs larger than 85 - `<n>` will not be modified. When the new cruft pack is larger 86 - than `<n>` bytes, it will be split into multiple packs, all of 87 - which are guaranteed to be at most `<n>` bytes in size. Only 88 - useful with `--cruft -d`. 89 90 --expire-to=<dir>:: 91 Write a cruft pack containing pruned objects (if any) to the
··· 77 Only useful with `--cruft -d`. 78 79 --max-cruft-size=<n>:: 80 + Overrides `--max-pack-size` for cruft packs. Inherits the value of 81 + `--max-pack-size` (if any) by default. See the documentation for 82 + `--max-pack-size` for more details. 83 84 --expire-to=<dir>:: 85 Write a cruft pack containing pruned objects (if any) to the
+9 -41
builtin/repack.c
··· 1022 return finish_pack_objects_cmd(&cmd, names, local); 1023 } 1024 1025 - static int existing_cruft_pack_cmp(const void *va, const void *vb) 1026 - { 1027 - struct packed_git *a = *(struct packed_git **)va; 1028 - struct packed_git *b = *(struct packed_git **)vb; 1029 - 1030 - if (a->pack_size < b->pack_size) 1031 - return -1; 1032 - if (a->pack_size > b->pack_size) 1033 - return 1; 1034 - return 0; 1035 - } 1036 - 1037 - static void collapse_small_cruft_packs(FILE *in, size_t max_size, 1038 struct existing_packs *existing) 1039 { 1040 - struct packed_git **existing_cruft, *p; 1041 struct strbuf buf = STRBUF_INIT; 1042 - size_t total_size = 0; 1043 - size_t existing_cruft_nr = 0; 1044 size_t i; 1045 1046 - ALLOC_ARRAY(existing_cruft, existing->cruft_packs.nr); 1047 1048 for (p = get_all_packs(the_repository); p; p = p->next) { 1049 if (!(p->is_cruft && p->pack_local)) ··· 1056 if (!string_list_has_string(&existing->cruft_packs, buf.buf)) 1057 continue; 1058 1059 - if (existing_cruft_nr >= existing->cruft_packs.nr) 1060 - BUG("too many cruft packs (found %"PRIuMAX", but knew " 1061 - "of %"PRIuMAX")", 1062 - (uintmax_t)existing_cruft_nr + 1, 1063 - (uintmax_t)existing->cruft_packs.nr); 1064 - existing_cruft[existing_cruft_nr++] = p; 1065 - } 1066 - 1067 - QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp); 1068 - 1069 - for (i = 0; i < existing_cruft_nr; i++) { 1070 - size_t proposed; 1071 - 1072 - p = existing_cruft[i]; 1073 - proposed = st_add(total_size, p->pack_size); 1074 - 1075 - if (proposed <= max_size) { 1076 - total_size = proposed; 1077 - fprintf(in, "-%s\n", pack_basename(p)); 1078 - } else { 1079 - retain_cruft_pack(existing, p); 1080 - fprintf(in, "%s\n", pack_basename(p)); 1081 - } 1082 } 1083 1084 for (i = 0; i < existing->non_kept_packs.nr; i++) ··· 1086 existing->non_kept_packs.items[i].string); 1087 1088 strbuf_release(&buf); 1089 - free(existing_cruft); 1090 } 1091 1092 static int write_cruft_pack(const struct pack_objects_args *args,
··· 1022 return finish_pack_objects_cmd(&cmd, names, local); 1023 } 1024 1025 + static void collapse_small_cruft_packs(FILE *in, size_t max_size UNUSED, 1026 struct existing_packs *existing) 1027 { 1028 + struct packed_git *p; 1029 struct strbuf buf = STRBUF_INIT; 1030 size_t i; 1031 1032 + /* 1033 + * Squelch a -Wunused-function warning while we rationalize 1034 + * the behavior of --max-cruft-size. This function will become 1035 + * used again in a future commit. 1036 + */ 1037 + (void)retain_cruft_pack; 1038 1039 for (p = get_all_packs(the_repository); p; p = p->next) { 1040 if (!(p->is_cruft && p->pack_local)) ··· 1047 if (!string_list_has_string(&existing->cruft_packs, buf.buf)) 1048 continue; 1049 1050 + fprintf(in, "-%s.pack\n", buf.buf); 1051 } 1052 1053 for (i = 0; i < existing->non_kept_packs.nr; i++) ··· 1055 existing->non_kept_packs.items[i].string); 1056 1057 strbuf_release(&buf); 1058 } 1059 1060 static int write_cruft_pack(const struct pack_objects_args *args,
+52
t/t5329-pack-objects-cruft.sh
··· 695 ) 696 ' 697 698 test_done
··· 695 ) 696 ' 697 698 + test_expect_success 'split cruft packs with --max-cruft-size' ' 699 + repo=cruft-with--max-cruft-size && 700 + test_when_finished "rm -fr $repo" && 701 + 702 + git init "$repo" && 703 + 704 + ( 705 + cd "$repo" && 706 + 707 + git config core.compression 0 && 708 + 709 + sz=$((1024 * 1024)) && # 1MiB 710 + test-tool genrandom foo $sz >foo && 711 + test-tool genrandom bar $sz >bar && 712 + foo="$(git hash-object -w -t blob foo)" && 713 + bar="$(git hash-object -w -t blob bar)" && 714 + 715 + to=$packdir/pack && 716 + # Pack together foo and bar into a single 2MiB pack. 717 + pack="$(git pack-objects $to <<-EOF 718 + $foo 719 + $bar 720 + EOF 721 + )" && 722 + 723 + # Then generate a cruft pack containing foo and bar. 724 + # 725 + # Generate the pack with --max-pack-size equal to the 726 + # size of one object, forcing us to write two cruft 727 + # packs. 728 + git pack-objects --cruft --max-pack-size=$sz $to <<-EOF && 729 + -pack-$pack.pack 730 + EOF 731 + 732 + ls $packdir/pack-*.mtimes >crufts && 733 + test_line_count = 2 crufts && 734 + 735 + for cruft in $(cat crufts) 736 + do 737 + test-tool pack-mtimes "$(basename "$cruft")" || return 1 738 + done >actual.raw && 739 + 740 + cut -d" " -f1 <actual.raw | sort >actual && 741 + sort >expect <<-EOF && 742 + $foo 743 + $bar 744 + EOF 745 + 746 + test_cmp expect actual 747 + ) 748 + ' 749 + 750 test_done
+3 -5
t/t7704-repack-cruft.sh
··· 194 ) 195 ' 196 197 - test_expect_success '--max-cruft-size combines smaller packs first' ' 198 git init max-cruft-size-consume-small && 199 ( 200 cd max-cruft-size-consume-small && ··· 354 done >actual.raw && 355 sort actual.raw >actual && 356 357 - # Among the set of all cruft packs, we should see both 358 - # mtimes for object $foo and $bar, as well as the 359 # single new copy of $baz. 360 sort >expect <<-EOF && 361 - $foo $(cat foo.old) 362 $foo $(cat foo.new) 363 - $bar $(cat bar.old) 364 $bar $(cat bar.new) 365 $baz $(cat baz.old) 366 $quux $(cat quux.new)
··· 194 ) 195 ' 196 197 + test_expect_failure '--max-cruft-size combines smaller packs first' ' 198 git init max-cruft-size-consume-small && 199 ( 200 cd max-cruft-size-consume-small && ··· 354 done >actual.raw && 355 sort actual.raw >actual && 356 357 + # Among the set of all cruft packs, we should see the 358 + # new mtimes for object $foo and $bar, as well as the 359 # single new copy of $baz. 360 sort >expect <<-EOF && 361 $foo $(cat foo.new) 362 $bar $(cat bar.new) 363 $baz $(cat baz.old) 364 $quux $(cat quux.new)