Git fork

bundle: fix non-linear performance scaling with refs

The 'git bundle create' command has non-linear performance with the
number of refs in the repository. Benchmarking the command shows that
a large portion of the time (~75%) is spent in the
`object_array_remove_duplicates()` function.

The `object_array_remove_duplicates()` function was added in
b2a6d1c686 (bundle: allow the same ref to be given more than once,
2009-01-17) to skip duplicate refs provided by the user from being
written to the bundle. Since this is an O(N^2) algorithm, in repos with
large number of references, this can take up a large amount of time.

Let's instead use a 'strset' to skip duplicates inside
`write_bundle_refs()`. This improves the performance by around 6 times
when tested against in repository with 100000 refs:

Benchmark 1: bundle (refcount = 100000, revision = master)
Time (mean ± σ): 14.653 s ± 0.203 s [User: 13.940 s, System: 0.762 s]
Range (min … max): 14.237 s … 14.920 s 10 runs

Benchmark 2: bundle (refcount = 100000, revision = HEAD)
Time (mean ± σ): 2.394 s ± 0.023 s [User: 1.684 s, System: 0.798 s]
Range (min … max): 2.364 s … 2.425 s 10 runs

Summary
bundle (refcount = 100000, revision = HEAD) ran
6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)

Previously, `object_array_remove_duplicates()` ensured that both the
refname and the object it pointed to were checked for duplicates. The
new approach, implemented within `write_bundle_refs()`, eliminates
duplicate refnames without comparing the objects they reference. This
works because, for bundle creation, we only need to prevent duplicate
refs from being written to the bundle header. The `revs->pending` array
can contain duplicates of multiple types.

First, references which resolve to the same refname. For e.g. "git
bundle create out.bdl master master" or "git bundle create out.bdl
refs/heads/master refs/heads/master" or "git bundle create out.bdl
master refs/heads/master". In these scenarios we want to prevent writing
"refs/heads/master" twice to the bundle header. Since both the refnames
here would point to the same object (unless there is a race), we do not
need to check equality of the object.

Second, refnames which are duplicates but do not point to the same
object. This can happen when we use an exclusion criteria. For e.g. "git
bundle create out.bdl master master^!", Here `revs->pending` would
contain two elements, both with refname set to "master". However, each
of them would be pointing to an INTERESTING and UNINTERESTING object
respectively. Since we only write refnames with INTERESTING objects to
the bundle header, we perform our duplicate checks only on such objects.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

authored by

Karthik Nayak and committed by
Junio C Hamano
a52d459e 09d86e0b

+7 -44
+7 -1
bundle.c
··· 384 { 385 int i; 386 int ref_count = 0; 387 388 for (i = 0; i < revs->pending.nr; i++) { 389 struct object_array_entry *e = revs->pending.objects + i; ··· 400 if (refs_read_ref_full(get_main_ref_store(the_repository), e->name, RESOLVE_REF_READING, &oid, &flag)) 401 flag = 0; 402 display_ref = (flag & REF_ISSYMREF) ? e->name : ref; 403 404 if (e->item->type == OBJ_TAG && 405 !is_tag_in_date_range(e->item, revs)) { ··· 423 } 424 425 ref_count++; 426 write_or_die(bundle_fd, oid_to_hex(&e->item->oid), the_hash_algo->hexsz); 427 write_or_die(bundle_fd, " ", 1); 428 write_or_die(bundle_fd, display_ref, strlen(display_ref)); ··· 430 skip_write_ref: 431 free(ref); 432 } 433 434 /* end header */ 435 write_or_die(bundle_fd, "\n", 1); ··· 566 */ 567 revs.blob_objects = revs.tree_objects = 0; 568 traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi); 569 - object_array_remove_duplicates(&revs_copy.pending); 570 571 /* write bundle refs */ 572 ref_count = write_bundle_refs(bundle_fd, &revs_copy);
··· 384 { 385 int i; 386 int ref_count = 0; 387 + struct strset objects = STRSET_INIT; 388 389 for (i = 0; i < revs->pending.nr; i++) { 390 struct object_array_entry *e = revs->pending.objects + i; ··· 401 if (refs_read_ref_full(get_main_ref_store(the_repository), e->name, RESOLVE_REF_READING, &oid, &flag)) 402 flag = 0; 403 display_ref = (flag & REF_ISSYMREF) ? e->name : ref; 404 + 405 + if (strset_contains(&objects, display_ref)) 406 + goto skip_write_ref; 407 408 if (e->item->type == OBJ_TAG && 409 !is_tag_in_date_range(e->item, revs)) { ··· 427 } 428 429 ref_count++; 430 + strset_add(&objects, display_ref); 431 write_or_die(bundle_fd, oid_to_hex(&e->item->oid), the_hash_algo->hexsz); 432 write_or_die(bundle_fd, " ", 1); 433 write_or_die(bundle_fd, display_ref, strlen(display_ref)); ··· 435 skip_write_ref: 436 free(ref); 437 } 438 + 439 + strset_clear(&objects); 440 441 /* end header */ 442 write_or_die(bundle_fd, "\n", 1); ··· 573 */ 574 revs.blob_objects = revs.tree_objects = 0; 575 traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi); 576 577 /* write bundle refs */ 578 ref_count = write_bundle_refs(bundle_fd, &revs_copy);
-33
object.c
··· 491 array->nr = array->alloc = 0; 492 } 493 494 - /* 495 - * Return true if array already contains an entry. 496 - */ 497 - static int contains_object(struct object_array *array, 498 - const struct object *item, const char *name) 499 - { 500 - unsigned nr = array->nr, i; 501 - struct object_array_entry *object = array->objects; 502 - 503 - for (i = 0; i < nr; i++, object++) 504 - if (item == object->item && !strcmp(object->name, name)) 505 - return 1; 506 - return 0; 507 - } 508 - 509 - void object_array_remove_duplicates(struct object_array *array) 510 - { 511 - unsigned nr = array->nr, src; 512 - struct object_array_entry *objects = array->objects; 513 - 514 - array->nr = 0; 515 - for (src = 0; src < nr; src++) { 516 - if (!contains_object(array, objects[src].item, 517 - objects[src].name)) { 518 - if (src != array->nr) 519 - objects[array->nr] = objects[src]; 520 - array->nr++; 521 - } else { 522 - object_array_release_entry(&objects[src]); 523 - } 524 - } 525 - } 526 - 527 void clear_object_flags(unsigned flags) 528 { 529 int i;
··· 491 array->nr = array->alloc = 0; 492 } 493 494 void clear_object_flags(unsigned flags) 495 { 496 int i;
-6
object.h
··· 325 object_array_each_func_t want, void *cb_data); 326 327 /* 328 - * Remove from array all but the first entry with a given name. 329 - * Warning: this function uses an O(N^2) algorithm. 330 - */ 331 - void object_array_remove_duplicates(struct object_array *array); 332 - 333 - /* 334 * Remove any objects from the array, freeing all used memory; afterwards 335 * the array is ready to store more objects with add_object_array(). 336 */
··· 325 object_array_each_func_t want, void *cb_data); 326 327 /* 328 * Remove any objects from the array, freeing all used memory; afterwards 329 * the array is ready to store more objects with add_object_array(). 330 */
-4
t/t6020-bundle-misc.sh
··· 684 test_cmp expect actual 685 ' 686 687 - # This exhibits a bug, since the same refname is now added to the bundle twice. 688 test_expect_success 'create bundle with duplicate refnames and --all' ' 689 git bundle create out.bdl --all "main" "main" && 690 ··· 701 <TAG-2> refs/tags/v2 702 <TAG-3> refs/tags/v3 703 <COMMIT-P> HEAD 704 - <COMMIT-P> refs/heads/main 705 EOF 706 test_cmp expect actual 707 ' ··· 717 test_cmp expect actual 718 ' 719 720 - # This exhibits a bug, since the same refname is now added to the bundle twice. 721 test_expect_success 'create bundle with duplicate refname short-form' ' 722 git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" && 723 724 git bundle list-heads out.bdl | 725 make_user_friendly_and_stable_output >actual && 726 cat >expect <<-\EOF && 727 - <COMMIT-P> refs/heads/main 728 <COMMIT-P> refs/heads/main 729 EOF 730 test_cmp expect actual
··· 684 test_cmp expect actual 685 ' 686 687 test_expect_success 'create bundle with duplicate refnames and --all' ' 688 git bundle create out.bdl --all "main" "main" && 689 ··· 700 <TAG-2> refs/tags/v2 701 <TAG-3> refs/tags/v3 702 <COMMIT-P> HEAD 703 EOF 704 test_cmp expect actual 705 ' ··· 715 test_cmp expect actual 716 ' 717 718 test_expect_success 'create bundle with duplicate refname short-form' ' 719 git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" && 720 721 git bundle list-heads out.bdl | 722 make_user_friendly_and_stable_output >actual && 723 cat >expect <<-\EOF && 724 <COMMIT-P> refs/heads/main 725 EOF 726 test_cmp expect actual