Git fork

repack: exclude cruft pack(s) from the MIDX where possible

In ddee3703b3 (builtin/repack.c: add cruft packs to MIDX during
geometric repack, 2022-05-20), repack began adding cruft pack(s) to the
MIDX with '--write-midx' to ensure that the resulting MIDX was always
closed under reachability in order to generate reachability bitmaps.

While the previous patch added the '--stdin-packs=follow' option to
pack-objects, it is not yet on by default. Given that, suppose you have
a once-unreachable object packed in a cruft pack, which later becomes
reachable from one or more objects in a geometrically repacked pack.
That once-unreachable object *won't* appear in the new pack, since the
cruft pack was not specified as included or excluded when the
geometrically repacked pack was created with 'pack-objects
--stdin-packs' (*not* '--stdin-packs=follow', which is not on). If that
new pack is included in a MIDX without the cruft pack, then trying to
generate bitmaps for that MIDX may fail. This happens when the bitmap
selection process picks one or more commits which reach the
once-unreachable objects.

To mitigate this failure mode, commit ddee3703b3 ensures that the MIDX
will be closed under reachability by including cruft pack(s). If cruft
pack(s) were not included, we would fail to generate a MIDX bitmap. But
ddee3703b3 alludes to the fact that this is sub-optimal by saying

[...] it's desirable to avoid including cruft packs in the MIDX
because it causes the MIDX to store a bunch of objects which are
likely to get thrown away.

, which is true, but hides an even larger problem. If repositories
rarely prune their unreachable objects and/or have many of them, the
MIDX must keep track of a large number of objects which bloats the MIDX
and slows down object lookup.

This is doubly unfortunate because the vast majority of objects in cruft
pack(s) are unlikely to be read. But any object lookups that go through
the MIDX must binary search over them anyway, slowing down object
lookups using the MIDX.

This patch causes geometrically-repacked packs to contain a copy of any
once-unreachable object(s) with 'git pack-objects --stdin-packs=follow',
allowing us to avoid including any cruft packs in the MIDX. This is
because a sequence of geometrically-repacked packs that were all
generated with '--stdin-packs=follow' are guaranteed to have their union
be closed under reachability.

Note that you cannot guarantee that a collection of packs is closed
under reachability if not all of them were generated with "following" as
above. One tell-tale sign that not all geometrically-repacked packs in
the MIDX were generated with "following" is to see if there is a pack in
the existing MIDX that is not going to be somehow represented (either
verbatim or as part of a geometric rollup) in the new MIDX.

If there is, then starting to generate packs with "following" during
geometric repacking won't work, since it's open to the same race as
described above.

But if you're starting from scratch (e.g., building the first MIDX after
an all-into-one '--cruft' repack), then you can guarantee that the union
of subsequently generated packs from geometric repacking *is* closed
under reachability.

(One exception here is when "starting from scratch" results in a noop
repack, e.g., because the non-cruft pack(s) in a repository already form
a geometric progression. Since we can't tell whether or not those were
generated with '--stdin-packs=follow', they may depend on
once-unreachable objects, so we have to include the cruft pack in the
MIDX in this case.)

Detect when this is the case and avoid including cruft packs in the MIDX
where possible. The existing behavior remains the default, and the new
behavior is available with the config 'repack.midxMustIncludeCruft' set
to 'false'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

authored by

Taylor Blau and committed by
Junio C Hamano
5ee86c27 cd846bac

+319 -20
+7
Documentation/config/repack.adoc
··· 39 39 a cruft pack and the respective parameters are not given over 40 40 the command line. See similarly named `pack.*` configuration 41 41 variables for defaults and meaning. 42 + 43 + repack.midxMustContainCruft:: 44 + When set to true, linkgit:git-repack[1] will unconditionally include 45 + cruft pack(s), if any, in the multi-pack index when invoked with 46 + `--write-midx`. When false, cruft packs are only included in the MIDX 47 + when necessary (e.g., because they might be required to form a 48 + reachability closure with MIDX bitmaps). Defaults to true.
+167 -20
builtin/repack.c
··· 39 39 static int use_delta_islands; 40 40 static int run_update_server_info = 1; 41 41 static char *packdir, *packtmp_name, *packtmp; 42 + static int midx_must_contain_cruft = 1; 42 43 43 44 static const char *const git_repack_usage[] = { 44 45 N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n" ··· 107 108 if (!strcmp(var, "repack.cruftthreads")) { 108 109 free(cruft_po_args->threads); 109 110 return git_config_string(&cruft_po_args->threads, var, value); 111 + } 112 + if (!strcmp(var, "repack.midxmustcontaincruft")) { 113 + midx_must_contain_cruft = git_config_bool(var, value); 114 + return 0; 110 115 } 111 116 return git_default_config(var, value, ctx, cb); 112 117 } ··· 690 695 free(geometry->pack); 691 696 } 692 697 698 + static int midx_has_unknown_packs(char **midx_pack_names, 699 + size_t midx_pack_names_nr, 700 + struct string_list *include, 701 + struct pack_geometry *geometry, 702 + struct existing_packs *existing) 703 + { 704 + size_t i; 705 + 706 + string_list_sort(include); 707 + 708 + for (i = 0; i < midx_pack_names_nr; i++) { 709 + const char *pack_name = midx_pack_names[i]; 710 + 711 + /* 712 + * Determine whether or not each MIDX'd pack from the existing 713 + * MIDX (if any) is represented in the new MIDX. For each pack 714 + * in the MIDX, it must either be: 715 + * 716 + * - In the "include" list of packs to be included in the new 717 + * MIDX. Note this function is called before the include 718 + * list is populated with any cruft pack(s). 719 + * 720 + * - Below the geometric split line (if using pack geometry), 721 + * indicating that the pack won't be included in the new 722 + * MIDX, but its contents were rolled up as part of the 723 + * geometric repack. 724 + * 725 + * - In the existing non-kept packs list (if not using pack 726 + * geometry), and marked as non-deleted. 727 + */ 728 + if (string_list_has_string(include, pack_name)) { 729 + continue; 730 + } else if (geometry) { 731 + struct strbuf buf = STRBUF_INIT; 732 + uint32_t j; 733 + 734 + for (j = 0; j < geometry->split; j++) { 735 + strbuf_reset(&buf); 736 + strbuf_addstr(&buf, pack_basename(geometry->pack[j])); 737 + strbuf_strip_suffix(&buf, ".pack"); 738 + strbuf_addstr(&buf, ".idx"); 739 + 740 + if (!strcmp(pack_name, buf.buf)) { 741 + strbuf_release(&buf); 742 + break; 743 + } 744 + } 745 + 746 + strbuf_release(&buf); 747 + 748 + if (j < geometry->split) 749 + continue; 750 + } else { 751 + struct string_list_item *item; 752 + 753 + item = string_list_lookup(&existing->non_kept_packs, 754 + pack_name); 755 + if (item && !pack_is_marked_for_deletion(item)) 756 + continue; 757 + } 758 + 759 + /* 760 + * If we got to this point, the MIDX includes some pack that we 761 + * don't know about. 762 + */ 763 + return 1; 764 + } 765 + 766 + return 0; 767 + } 768 + 693 769 struct midx_snapshot_ref_data { 694 770 struct tempfile *f; 695 771 struct oidset seen; ··· 758 834 759 835 static void midx_included_packs(struct string_list *include, 760 836 struct existing_packs *existing, 837 + char **midx_pack_names, 838 + size_t midx_pack_names_nr, 761 839 struct string_list *names, 762 840 struct pack_geometry *geometry) 763 841 { ··· 811 889 } 812 890 } 813 891 814 - for_each_string_list_item(item, &existing->cruft_packs) { 892 + if (midx_must_contain_cruft || 893 + midx_has_unknown_packs(midx_pack_names, midx_pack_names_nr, 894 + include, geometry, existing)) { 815 895 /* 816 - * When doing a --geometric repack, there is no need to check 817 - * for deleted packs, since we're by definition not doing an 818 - * ALL_INTO_ONE repack (hence no packs will be deleted). 819 - * Otherwise we must check for and exclude any packs which are 820 - * enqueued for deletion. 896 + * If there are one or more unknown pack(s) present (see 897 + * midx_has_unknown_packs() for what makes a pack 898 + * "unknown") in the MIDX before the repack, keep them 899 + * as they may be required to form a reachability 900 + * closure if the MIDX is bitmapped. 821 901 * 822 - * So we could omit the conditional below in the --geometric 823 - * case, but doing so is unnecessary since no packs are marked 824 - * as pending deletion (since we only call 825 - * `mark_packs_for_deletion()` when doing an all-into-one 826 - * repack). 902 + * For example, a cruft pack can be required to form a 903 + * reachability closure if the MIDX is bitmapped and one 904 + * or more of the bitmap's selected commits reaches a 905 + * once-cruft object that was later made reachable. 827 906 */ 828 - if (pack_is_marked_for_deletion(item)) 829 - continue; 907 + for_each_string_list_item(item, &existing->cruft_packs) { 908 + /* 909 + * When doing a --geometric repack, there is no 910 + * need to check for deleted packs, since we're 911 + * by definition not doing an ALL_INTO_ONE 912 + * repack (hence no packs will be deleted). 913 + * Otherwise we must check for and exclude any 914 + * packs which are enqueued for deletion. 915 + * 916 + * So we could omit the conditional below in the 917 + * --geometric case, but doing so is unnecessary 918 + * since no packs are marked as pending 919 + * deletion (since we only call 920 + * `mark_packs_for_deletion()` when doing an 921 + * all-into-one repack). 922 + */ 923 + if (pack_is_marked_for_deletion(item)) 924 + continue; 830 925 831 - strbuf_reset(&buf); 832 - strbuf_addf(&buf, "%s.idx", item->string); 833 - string_list_insert(include, buf.buf); 926 + strbuf_reset(&buf); 927 + strbuf_addf(&buf, "%s.idx", item->string); 928 + string_list_insert(include, buf.buf); 929 + } 930 + } else { 931 + /* 932 + * Modern versions of Git (with the appropriate 933 + * configuration setting) will write new copies of 934 + * once-cruft objects when doing a --geometric repack. 935 + * 936 + * If the MIDX has no cruft pack, new packs written 937 + * during a --geometric repack will not rely on the 938 + * cruft pack to form a reachability closure, so we can 939 + * avoid including them in the MIDX in that case. 940 + */ 941 + ; 834 942 } 835 943 836 944 strbuf_release(&buf); ··· 1145 1253 struct tempfile *refs_snapshot = NULL; 1146 1254 int i, ext, ret; 1147 1255 int show_progress; 1256 + char **midx_pack_names = NULL; 1257 + size_t midx_pack_names_nr = 0; 1148 1258 1149 1259 /* variables to be filled by option parsing */ 1150 1260 int delete_redundant = 0; ··· 1361 1471 !(pack_everything & PACK_CRUFT)) 1362 1472 strvec_push(&cmd.args, "--pack-loose-unreachable"); 1363 1473 } else if (geometry.split_factor) { 1364 - strvec_push(&cmd.args, "--stdin-packs"); 1474 + if (midx_must_contain_cruft) 1475 + strvec_push(&cmd.args, "--stdin-packs"); 1476 + else 1477 + strvec_push(&cmd.args, "--stdin-packs=follow"); 1365 1478 strvec_push(&cmd.args, "--unpacked"); 1366 1479 } else { 1367 1480 strvec_push(&cmd.args, "--unpacked"); ··· 1401 1514 if (ret) 1402 1515 goto cleanup; 1403 1516 1404 - if (!names.nr && !po_args.quiet) 1405 - printf_ln(_("Nothing new to pack.")); 1517 + if (!names.nr) { 1518 + if (!po_args.quiet) 1519 + printf_ln(_("Nothing new to pack.")); 1520 + /* 1521 + * If we didn't write any new packs, the non-cruft packs 1522 + * may refer to once-unreachable objects in the cruft 1523 + * pack(s). 1524 + * 1525 + * If there isn't already a MIDX, the one we write 1526 + * must include the cruft pack(s), in case the 1527 + * non-cruft pack(s) refer to once-cruft objects. 1528 + * 1529 + * If there is already a MIDX, we can punt here, since 1530 + * midx_has_unknown_packs() will make the decision for 1531 + * us. 1532 + */ 1533 + if (!get_local_multi_pack_index(the_repository)) 1534 + midx_must_contain_cruft = 1; 1535 + } 1406 1536 1407 1537 if (pack_everything & PACK_CRUFT) { 1408 1538 const char *pack_prefix = find_pack_prefix(packdir, packtmp); ··· 1483 1613 1484 1614 string_list_sort(&names); 1485 1615 1616 + if (get_local_multi_pack_index(the_repository)) { 1617 + struct multi_pack_index *m = 1618 + get_local_multi_pack_index(the_repository); 1619 + 1620 + ALLOC_ARRAY(midx_pack_names, 1621 + m->num_packs + m->num_packs_in_base); 1622 + 1623 + for (; m; m = m->base_midx) 1624 + for (uint32_t i = 0; i < m->num_packs; i++) 1625 + midx_pack_names[midx_pack_names_nr++] = 1626 + xstrdup(m->pack_names[i]); 1627 + } 1628 + 1486 1629 close_object_store(the_repository->objects); 1487 1630 1488 1631 /* ··· 1524 1667 1525 1668 if (write_midx) { 1526 1669 struct string_list include = STRING_LIST_INIT_DUP; 1527 - midx_included_packs(&include, &existing, &names, &geometry); 1670 + midx_included_packs(&include, &existing, midx_pack_names, 1671 + midx_pack_names_nr, &names, &geometry); 1528 1672 1529 1673 ret = write_midx_included_packs(&include, &geometry, &names, 1530 1674 refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL, ··· 1575 1719 string_list_clear(&names, 1); 1576 1720 existing_packs_release(&existing); 1577 1721 free_pack_geometry(&geometry); 1722 + for (size_t i = 0; i < midx_pack_names_nr; i++) 1723 + free(midx_pack_names[i]); 1724 + free(midx_pack_names); 1578 1725 pack_objects_args_release(&po_args); 1579 1726 pack_objects_args_release(&cruft_po_args); 1580 1727
+145
t/t7704-repack-cruft.sh
··· 724 724 ) 725 725 ' 726 726 727 + setup_cruft_exclude_tests() { 728 + git init "$1" && 729 + ( 730 + cd "$1" && 731 + 732 + git config repack.midxMustContainCruft false && 733 + 734 + test_commit one && 735 + 736 + test_commit --no-tag two && 737 + two="$(git rev-parse HEAD)" && 738 + test_commit --no-tag three && 739 + three="$(git rev-parse HEAD)" && 740 + git reset --hard one && 741 + git reflog expire --all --expire=all && 742 + 743 + GIT_TEST_MULTI_PACK_INDEX=0 git repack --cruft -d && 744 + 745 + git merge $two && 746 + test_commit four 747 + ) 748 + } 749 + 750 + test_expect_success 'repack --write-midx excludes cruft where possible' ' 751 + setup_cruft_exclude_tests exclude-cruft-when-possible && 752 + ( 753 + cd exclude-cruft-when-possible && 754 + 755 + GIT_TEST_MULTI_PACK_INDEX=0 \ 756 + git repack -d --geometric=2 --write-midx --write-bitmap-index && 757 + 758 + test-tool read-midx --show-objects $objdir >midx && 759 + cruft="$(ls $packdir/*.mtimes)" && 760 + test_grep ! "$(basename "$cruft" .mtimes).idx" midx && 761 + 762 + git rev-list --all --objects --no-object-names >reachable.raw && 763 + sort reachable.raw >reachable.objects && 764 + awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects && 765 + 766 + test_cmp reachable.objects midx.objects 767 + ) 768 + ' 769 + 770 + test_expect_success 'repack --write-midx includes cruft when instructed' ' 771 + setup_cruft_exclude_tests exclude-cruft-when-instructed && 772 + ( 773 + cd exclude-cruft-when-instructed && 774 + 775 + GIT_TEST_MULTI_PACK_INDEX=0 \ 776 + git -c repack.midxMustContainCruft=true repack \ 777 + -d --geometric=2 --write-midx --write-bitmap-index && 778 + 779 + test-tool read-midx --show-objects $objdir >midx && 780 + cruft="$(ls $packdir/*.mtimes)" && 781 + test_grep "$(basename "$cruft" .mtimes).idx" midx && 782 + 783 + git cat-file --batch-check="%(objectname)" --batch-all-objects \ 784 + >all.objects && 785 + awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects && 786 + 787 + test_cmp all.objects midx.objects 788 + ) 789 + ' 790 + 791 + test_expect_success 'repack --write-midx includes cruft when necessary' ' 792 + setup_cruft_exclude_tests exclude-cruft-when-necessary && 793 + ( 794 + cd exclude-cruft-when-necessary && 795 + 796 + test_path_is_file $(ls $packdir/pack-*.mtimes) && 797 + ( cd $packdir && ls pack-*.idx ) | sort >packs.all && 798 + git multi-pack-index write --stdin-packs --bitmap <packs.all && 799 + 800 + test_commit five && 801 + GIT_TEST_MULTI_PACK_INDEX=0 \ 802 + git repack -d --geometric=2 --write-midx --write-bitmap-index && 803 + 804 + test-tool read-midx --show-objects $objdir >midx && 805 + awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects && 806 + git cat-file --batch-all-objects --batch-check="%(objectname)" \ 807 + >expect.objects && 808 + test_cmp expect.objects midx.objects && 809 + 810 + grep "^pack-" midx >midx.packs && 811 + test_line_count = "$(($(wc -l <packs.all) + 1))" midx.packs 812 + ) 813 + ' 814 + 815 + test_expect_success 'repack --write-midx includes cruft when already geometric' ' 816 + git init repack--write-midx-geometric-noop && 817 + ( 818 + cd repack--write-midx-geometric-noop && 819 + 820 + git branch -M main && 821 + test_commit A && 822 + test_commit B && 823 + 824 + git checkout -B side && 825 + test_commit --no-tag C && 826 + C="$(git rev-parse HEAD)" && 827 + 828 + git checkout main && 829 + git branch -D side && 830 + git reflog expire --all --expire=all && 831 + 832 + # At this point we have two packs: one containing the 833 + # objects belonging to commits A and B, and another 834 + # (cruft) pack containing the objects belonging to 835 + # commit C. 836 + git repack --cruft -d && 837 + 838 + # Create a third pack which contains a merge commit 839 + # making commit C reachable again. 840 + # 841 + # --no-ff is important here, as it ensures that we 842 + # actually write a new object and subsequently a new 843 + # pack to contain it. 844 + git merge --no-ff $C && 845 + git repack -d && 846 + 847 + ls $packdir/pack-*.idx | sort >packs.all && 848 + cruft="$(ls $packdir/pack-*.mtimes)" && 849 + cruft="${cruft%.mtimes}.idx" && 850 + 851 + for idx in $(grep -v $cruft <packs.all) 852 + do 853 + git show-index <$idx >out && 854 + wc -l <out || return 1 855 + done >sizes.raw && 856 + 857 + # Make sure that there are two non-cruft packs, and 858 + # that one of them contains at least twice as many 859 + # objects as the other, ensuring that they are already 860 + # in a geometric progression. 861 + sort -n sizes.raw >sizes && 862 + test_line_count = 2 sizes && 863 + s1=$(head -n 1 sizes) && 864 + s2=$(tail -n 1 sizes) && 865 + test "$s2" -gt "$((2 * $s1))" && 866 + 867 + git -c repack.midxMustContainCruft=false repack --geometric=2 \ 868 + --write-midx --write-bitmap-index 869 + ) 870 + ' 871 + 727 872 test_done