Git fork

repack: avoid combining cruft packs with `--max-cruft-size`

In 37dc6d8104 (builtin/repack.c: implement support for
`--max-cruft-size`, 2023-10-02), we exposed new functionality that
allowed repositories to specify the behavior of when we should combine
multiple cruft packs together.

This feature was designed to ensure that we never repacked cruft packs
which were larger than the given threshold in order to provide tighter
I/O bounds for repositories that have many unreachable objects. In
essence, specifying '--max-cruft-size=N' instructed 'repack' to
aggregate cruft packs together (in order of ascending size) until the
combine size grows past 'N', and then make a new cruft pack whose
contents includes the packs we rolled up.

But this isn't quite how it works in practice. Suppose for example that
we have two cruft packs which are each 100MiB in size. One might expect
specifying "--max-cruft-size=200M" would combine these two packs
together, and then avoid repacking them until a pruning GC takes place.
In reality, 'repack' would try and aggregate these together, but writing
a pack that is strictly smaller than 200 MiB (since pack-objects'
"--max-pack-size" provides a strict bound for packs containing more than
one object).

So instead we'll write out a pack that is, say, 199 MiB in size, and
then another 1 MiB pack containing the balance. If we later repack the
repository without adding any new unreachable objects, we'll repeat the
same exercise again, making the same 199 MiB and 1 MiB packs each time.

This happens because of a poor choice to bolt the '--max-cruft-size'
functionality onto pack-objects' '--max-pack-size', forcing us to
generate packs which are always smaller than the provided threshold and
thus subject to repacking.

The following commit will introduce a new flag that implements something
similar to the behavior above. Let's prepare for that by making repack's
'--max-cruft-size' flag behave as an cruft pack-specific override for
'--max-pack-size'.

Do so by temporarily repurposing the 'collapse_small_cruft_packs()'
function to instead generate a cruft pack using the same instructions as
if we didn't specify any maximum pack size. The calling code looks
something like:

if (args->max_pack_size && !cruft_expiration) {
collapse_small_cruft_packs(in, args->max_pack_size, existing);
} else {
for_each_string_list_item(item, &existing->non_kept_packs)
fprintf(in, "-%s.pack\n", item->string);
for_each_string_list_item(item, &existing->cruft_packs)
fprintf(in, "-%s.pack\n", item->string);
}

This patch makes collapse_small_cruft_packs() behave identically to the
'else' arm of the conditional above. This repurposing of
'collapse_small_cruft_packs()' is intentional, since it will set us up
nicely to introduce the new behavior in the following commit.

Naturally, there is some test fallout in the test which exercises the
old meaning of '--max-cruft-size'. Mark that test as failing for now to
be dealt with in the following commit. Likewise, add a new test which
explicitly tests the behavior of '--max-cruft-size' to place a hard
limit on the size of any generated cruft pack(s).

Note that this is a breaking change, as it alters the user-visible
behavior of '--max-cruft-size'. But I'm OK changing this behavior in
this instance, since the behavior wasn't accurate to begin with.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

authored by

Taylor Blau and committed by
Junio C Hamano
0855ed96 7fb12bb2

+67 -55
+3 -9
Documentation/git-repack.adoc
··· 77 77 Only useful with `--cruft -d`. 78 78 79 79 --max-cruft-size=<n>:: 80 - Repack cruft objects into packs as large as `<n>` bytes before 81 - creating new packs. As long as there are enough cruft packs 82 - smaller than `<n>`, repacking will cause a new cruft pack to 83 - be created containing objects from any combined cruft packs, 84 - along with any new unreachable objects. Cruft packs larger than 85 - `<n>` will not be modified. When the new cruft pack is larger 86 - than `<n>` bytes, it will be split into multiple packs, all of 87 - which are guaranteed to be at most `<n>` bytes in size. Only 88 - useful with `--cruft -d`. 80 + Overrides `--max-pack-size` for cruft packs. Inherits the value of 81 + `--max-pack-size` (if any) by default. See the documentation for 82 + `--max-pack-size` for more details. 89 83 90 84 --expire-to=<dir>:: 91 85 Write a cruft pack containing pruned objects (if any) to the
+9 -41
builtin/repack.c
··· 1022 1022 return finish_pack_objects_cmd(&cmd, names, local); 1023 1023 } 1024 1024 1025 - static int existing_cruft_pack_cmp(const void *va, const void *vb) 1026 - { 1027 - struct packed_git *a = *(struct packed_git **)va; 1028 - struct packed_git *b = *(struct packed_git **)vb; 1029 - 1030 - if (a->pack_size < b->pack_size) 1031 - return -1; 1032 - if (a->pack_size > b->pack_size) 1033 - return 1; 1034 - return 0; 1035 - } 1036 - 1037 - static void collapse_small_cruft_packs(FILE *in, size_t max_size, 1025 + static void collapse_small_cruft_packs(FILE *in, size_t max_size UNUSED, 1038 1026 struct existing_packs *existing) 1039 1027 { 1040 - struct packed_git **existing_cruft, *p; 1028 + struct packed_git *p; 1041 1029 struct strbuf buf = STRBUF_INIT; 1042 - size_t total_size = 0; 1043 - size_t existing_cruft_nr = 0; 1044 1030 size_t i; 1045 1031 1046 - ALLOC_ARRAY(existing_cruft, existing->cruft_packs.nr); 1032 + /* 1033 + * Squelch a -Wunused-function warning while we rationalize 1034 + * the behavior of --max-cruft-size. This function will become 1035 + * used again in a future commit. 1036 + */ 1037 + (void)retain_cruft_pack; 1047 1038 1048 1039 for (p = get_all_packs(the_repository); p; p = p->next) { 1049 1040 if (!(p->is_cruft && p->pack_local)) ··· 1056 1047 if (!string_list_has_string(&existing->cruft_packs, buf.buf)) 1057 1048 continue; 1058 1049 1059 - if (existing_cruft_nr >= existing->cruft_packs.nr) 1060 - BUG("too many cruft packs (found %"PRIuMAX", but knew " 1061 - "of %"PRIuMAX")", 1062 - (uintmax_t)existing_cruft_nr + 1, 1063 - (uintmax_t)existing->cruft_packs.nr); 1064 - existing_cruft[existing_cruft_nr++] = p; 1065 - } 1066 - 1067 - QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp); 1068 - 1069 - for (i = 0; i < existing_cruft_nr; i++) { 1070 - size_t proposed; 1071 - 1072 - p = existing_cruft[i]; 1073 - proposed = st_add(total_size, p->pack_size); 1074 - 1075 - if (proposed <= max_size) { 1076 - total_size = proposed; 1077 - fprintf(in, "-%s\n", pack_basename(p)); 1078 - } else { 1079 - retain_cruft_pack(existing, p); 1080 - fprintf(in, "%s\n", pack_basename(p)); 1081 - } 1050 + fprintf(in, "-%s.pack\n", buf.buf); 1082 1051 } 1083 1052 1084 1053 for (i = 0; i < existing->non_kept_packs.nr; i++) ··· 1086 1055 existing->non_kept_packs.items[i].string); 1087 1056 1088 1057 strbuf_release(&buf); 1089 - free(existing_cruft); 1090 1058 } 1091 1059 1092 1060 static int write_cruft_pack(const struct pack_objects_args *args,
+52
t/t5329-pack-objects-cruft.sh
··· 695 695 ) 696 696 ' 697 697 698 + test_expect_success 'split cruft packs with --max-cruft-size' ' 699 + repo=cruft-with--max-cruft-size && 700 + test_when_finished "rm -fr $repo" && 701 + 702 + git init "$repo" && 703 + 704 + ( 705 + cd "$repo" && 706 + 707 + git config core.compression 0 && 708 + 709 + sz=$((1024 * 1024)) && # 1MiB 710 + test-tool genrandom foo $sz >foo && 711 + test-tool genrandom bar $sz >bar && 712 + foo="$(git hash-object -w -t blob foo)" && 713 + bar="$(git hash-object -w -t blob bar)" && 714 + 715 + to=$packdir/pack && 716 + # Pack together foo and bar into a single 2MiB pack. 717 + pack="$(git pack-objects $to <<-EOF 718 + $foo 719 + $bar 720 + EOF 721 + )" && 722 + 723 + # Then generate a cruft pack containing foo and bar. 724 + # 725 + # Generate the pack with --max-pack-size equal to the 726 + # size of one object, forcing us to write two cruft 727 + # packs. 728 + git pack-objects --cruft --max-pack-size=$sz $to <<-EOF && 729 + -pack-$pack.pack 730 + EOF 731 + 732 + ls $packdir/pack-*.mtimes >crufts && 733 + test_line_count = 2 crufts && 734 + 735 + for cruft in $(cat crufts) 736 + do 737 + test-tool pack-mtimes "$(basename "$cruft")" || return 1 738 + done >actual.raw && 739 + 740 + cut -d" " -f1 <actual.raw | sort >actual && 741 + sort >expect <<-EOF && 742 + $foo 743 + $bar 744 + EOF 745 + 746 + test_cmp expect actual 747 + ) 748 + ' 749 + 698 750 test_done
+3 -5
t/t7704-repack-cruft.sh
··· 194 194 ) 195 195 ' 196 196 197 - test_expect_success '--max-cruft-size combines smaller packs first' ' 197 + test_expect_failure '--max-cruft-size combines smaller packs first' ' 198 198 git init max-cruft-size-consume-small && 199 199 ( 200 200 cd max-cruft-size-consume-small && ··· 354 354 done >actual.raw && 355 355 sort actual.raw >actual && 356 356 357 - # Among the set of all cruft packs, we should see both 358 - # mtimes for object $foo and $bar, as well as the 357 + # Among the set of all cruft packs, we should see the 358 + # new mtimes for object $foo and $bar, as well as the 359 359 # single new copy of $baz. 360 360 sort >expect <<-EOF && 361 - $foo $(cat foo.old) 362 361 $foo $(cat foo.new) 363 - $bar $(cat bar.old) 364 362 $bar $(cat bar.new) 365 363 $baz $(cat baz.old) 366 364 $quux $(cat quux.new)