Git fork
at reftables-rust 572 lines 27 kB view raw
1Bundle URIs 2=========== 3 4Git bundles are files that store a pack-file along with some extra metadata, 5including a set of refs and a (possibly empty) set of necessary commits. See 6linkgit:git-bundle[1] and linkgit:gitformat-bundle[5] for more information. 7 8Bundle URIs are locations where Git can download one or more bundles in 9order to bootstrap the object database in advance of fetching the remaining 10objects from a remote. 11 12One goal is to speed up clones and fetches for users with poor network 13connectivity to the origin server. Another benefit is to allow heavy users, 14such as CI build farms, to use local resources for the majority of Git data 15and thereby reducing the load on the origin server. 16 17To enable the bundle URI feature, users can specify a bundle URI using 18command-line options or the origin server can advertise one or more URIs 19via a protocol v2 capability. 20 21Design Goals 22------------ 23 24The bundle URI standard aims to be flexible enough to satisfy multiple 25workloads. The bundle provider and the Git client have several choices in 26how they create and consume bundle URIs. 27 28* Bundles can have whatever name the server desires. This name could refer 29 to immutable data by using a hash of the bundle contents. However, this 30 means that a new URI will be needed after every update of the content. 31 This might be acceptable if the server is advertising the URI (and the 32 server is aware of new bundles being generated) but would not be 33 ergonomic for users using the command line option. 34 35* The bundles could be organized specifically for bootstrapping full 36 clones, but could also be organized with the intention of bootstrapping 37 incremental fetches. The bundle provider must decide on one of several 38 organization schemes to minimize client downloads during incremental 39 fetches, but the Git client can also choose whether to use bundles for 40 either of these operations. 41 42* The bundle provider can choose to support full clones, partial clones, 43 or both. The client can detect which bundles are appropriate for the 44 repository's partial clone filter, if any. 45 46* The bundle provider can use a single bundle (for clones only), or a 47 list of bundles. When using a list of bundles, the provider can specify 48 whether or not the client needs _all_ of the bundle URIs for a full 49 clone, or if _any_ one of the bundle URIs is sufficient. This allows the 50 bundle provider to use different URIs for different geographies. 51 52* The bundle provider can organize the bundles using heuristics, such as 53 creation tokens, to help the client prevent downloading bundles it does 54 not need. When the bundle provider does not provide these heuristics, 55 the client can use optimizations to minimize how much of the data is 56 downloaded. 57 58* The bundle provider does not need to be associated with the Git server. 59 The client can choose to use the bundle provider without it being 60 advertised by the Git server. 61 62* The client can choose to discover bundle providers that are advertised 63 by the Git server. This could happen during `git clone`, during 64 `git fetch`, both, or neither. The user can choose which combination 65 works best for them. 66 67* The client can choose to configure a bundle provider manually at any 68 time. The client can also choose to specify a bundle provider manually 69 as a command-line option to `git clone`. 70 71Each repository is different and every Git server has different needs. 72Hopefully the bundle URI feature is flexible enough to satisfy all needs. 73If not, then the feature can be extended through its versioning mechanism. 74 75Server requirements 76------------------- 77 78To provide a server-side implementation of bundle servers, no other parts 79of the Git protocol are required. This allows server maintainers to use 80static content solutions such as CDNs in order to serve the bundle files. 81 82At the current scope of the bundle URI feature, all URIs are expected to 83be HTTP(S) URLs where content is downloaded to a local file using a `GET` 84request to that URL. The server could include authentication requirements 85to those requests with the aim of triggering the configured credential 86helper for secure access. (Future extensions could use "file://" URIs or 87SSH URIs.) 88 89Assuming a `200 OK` response from the server, the content at the URL is 90inspected. First, Git attempts to parse the file as a bundle file of 91version 2 or higher. If the file is not a bundle, then the file is parsed 92as a plain-text file using Git's config parser. The key-value pairs in 93that config file are expected to describe a list of bundle URIs. If 94neither of these parse attempts succeed, then Git will report an error to 95the user that the bundle URI provided erroneous data. 96 97Any other data provided by the server is considered erroneous. 98 99Bundle Lists 100------------ 101 102The Git server can advertise bundle URIs using a set of `key=value` pairs. 103A bundle URI can also serve a plain-text file in the Git config format 104containing these same `key=value` pairs. In both cases, we consider this 105to be a _bundle list_. The pairs specify information about the bundles 106that the client can use to make decisions for which bundles to download 107and which to ignore. 108 109A few keys focus on properties of the list itself. 110 111bundle.version:: 112 (Required) This value provides a version number for the bundle 113 list. If a future Git change enables a feature that needs the Git 114 client to react to a new key in the bundle list file, then this version 115 will increment. The only current version number is 1, and if any other 116 value is specified then Git will fail to use this file. 117 118bundle.mode:: 119 (Required) This value has one of two values: `all` and `any`. When `all` 120 is specified, then the client should expect to need all of the listed 121 bundle URIs that match their repository's requirements. When `any` is 122 specified, then the client should expect that any one of the bundle URIs 123 that match their repository's requirements will suffice. Typically, the 124 `any` option is used to list a number of different bundle servers 125 located in different geographies. 126 127bundle.heuristic:: 128 If this string-valued key exists, then the bundle list is designed to 129 work well with incremental `git fetch` commands. The heuristic signals 130 that there are additional keys available for each bundle that help 131 determine which subset of bundles the client should download. The only 132 heuristic currently planned is `creationToken`. 133 134The remaining keys include an `<id>` segment which is a server-designated 135name for each available bundle. The `<id>` must contain only alphanumeric 136and `-` characters. 137 138bundle.<id>.uri:: 139 (Required) This string value is the URI for downloading bundle `<id>`. 140 If the URI begins with a protocol (`http://` or `https://`) then the URI 141 is absolute. Otherwise, the URI is interpreted as relative to the URI 142 used for the bundle list. If the URI begins with `/`, then that relative 143 path is relative to the domain name used for the bundle list. (This use 144 of relative paths is intended to make it easier to distribute a set of 145 bundles across a large number of servers or CDNs with different domain 146 names.) 147 148bundle.<id>.filter:: 149 This string value represents an object filter that should also appear in 150 the header of this bundle. The server uses this value to differentiate 151 different kinds of bundles from which the client can choose those that 152 match their object filters. 153 154bundle.<id>.creationToken:: 155 This value is a nonnegative 64-bit integer used for sorting the bundles 156 list. This is used to download a subset of bundles during a fetch when 157 `bundle.heuristic=creationToken`. 158 159bundle.<id>.location:: 160 This string value advertises a real-world location from where the bundle 161 URI is served. This can be used to present the user with an option for 162 which bundle URI to use or simply as an informative indicator of which 163 bundle URI was selected by Git. This is only valuable when 164 `bundle.mode` is `any`. 165 166Here is an example bundle list using the Git config format: 167 168 [bundle] 169 version = 1 170 mode = all 171 heuristic = creationToken 172 173 [bundle "2022-02-09-1644442601-daily"] 174 uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle 175 creationToken = 1644442601 176 177 [bundle "2022-02-02-1643842562"] 178 uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle 179 creationToken = 1643842562 180 181 [bundle "2022-02-09-1644442631-daily-blobless"] 182 uri = 2022-02-09-1644442631-daily-blobless.bundle 183 creationToken = 1644442631 184 filter = blob:none 185 186 [bundle "2022-02-02-1643842568-blobless"] 187 uri = /git/git/2022-02-02-1643842568-blobless.bundle 188 creationToken = 1643842568 189 filter = blob:none 190 191This example uses `bundle.mode=all` as well as the 192`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter` 193options to present two parallel sets of bundles: one for full clones and 194another for blobless partial clones. 195 196Suppose that this bundle list was found at the URI 197`https://bundles.example.com/git/git/` and so the two blobless bundles have 198the following fully-expanded URIs: 199 200* `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle` 201* `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle` 202 203Advertising Bundle URIs 204----------------------- 205 206If a user knows a bundle URI for the repository they are cloning, then 207they can specify that URI manually through a command-line option. However, 208a Git host may want to advertise bundle URIs during the clone operation, 209helping users unaware of the feature. 210 211The only thing required for this feature is that the server can advertise 212one or more bundle URIs. This advertisement takes the form of a new 213protocol v2 capability specifically for discovering bundle URIs. 214 215The client could choose an arbitrary bundle URI as an option _or_ select 216the URI with best performance by some exploratory checks. It is up to the 217bundle provider to decide if having multiple URIs is preferable to a 218single URI that is geodistributed through server-side infrastructure. 219 220Cloning with Bundle URIs 221------------------------ 222 223The primary need for bundle URIs is to speed up clones. The Git client 224will interact with bundle URIs according to the following flow: 225 2261. The user specifies a bundle URI with the `--bundle-uri` command-line 227 option _or_ the client discovers a bundle list advertised by the 228 Git server. 229 2302. If the downloaded data from a bundle URI is a bundle, then the client 231 inspects the bundle headers to check that the prerequisite commit OIDs 232 are present in the client repository. If some are missing, then the 233 client delays unbundling until other bundles have been unbundled, 234 making those OIDs present. When all required OIDs are present, the 235 client unbundles that data using a refspec. The refspec used is 236 `+refs/*:refs/bundles/*`. These refs are stored so that later 237 `git fetch` negotiations can communicate each bundled ref as a `have`, 238 reducing the size of the fetch over the Git protocol. To allow pruning 239 refs from this ref namespace, Git may introduce a numbered namespace 240 (such as `refs/bundles/<i>/*`) such that stale bundle refs can be 241 deleted. 242 2433. If the file is instead a bundle list, then the client inspects the 244 `bundle.mode` to see if the list is of the `all` or `any` form. 245 246 a. If `bundle.mode=all`, then the client considers all bundle 247 URIs. The list is reduced based on the `bundle.<id>.filter` options 248 matching the client repository's partial clone filter. Then, all 249 bundle URIs are requested. If the `bundle.<id>.creationToken` 250 heuristic is provided, then the bundles are downloaded in decreasing 251 order by the creation token, stopping when a bundle has all required 252 OIDs. The bundles can then be unbundled in increasing creation token 253 order. The client stores the latest creation token as a heuristic 254 for avoiding future downloads if the bundle list does not advertise 255 bundles with larger creation tokens. 256 257 b. If `bundle.mode=any`, then the client can choose any one of the 258 bundle URIs to inspect. The client can use a variety of ways to 259 choose among these URIs. The client can also fallback to another URI 260 if the initial choice fails to return a result. 261 262Note that during a clone we expect that all bundles will be required, and 263heuristics such as `bundle.<uri>.creationToken` can be used to download 264bundles in chronological order or in parallel. 265 266If a given bundle URI is a bundle list with a `bundle.heuristic` 267value, then the client can choose to store that URI as its chosen bundle 268URI. The client can then navigate directly to that URI during later `git 269fetch` calls. 270 271When downloading bundle URIs, the client can choose to inspect the initial 272content before committing to downloading the entire content. This may 273provide enough information to determine if the URI is a bundle list or 274a bundle. In the case of a bundle, the client may inspect the bundle 275header to determine that all advertised tips are already in the client 276repository and cancel the remaining download. 277 278Fetching with Bundle URIs 279------------------------- 280 281When the client fetches new data, it can decide to fetch from bundle 282servers before fetching from the origin remote. This could be done via a 283command-line option, but it is more likely useful to use a config value 284such as the one specified during the clone. 285 286The fetch operation follows the same procedure to download bundles from a 287bundle list (although we do _not_ want to use parallel downloads here). We 288expect that the process will end when all prerequisite commit OIDs in a 289thin bundle are already in the object database. 290 291When using the `creationToken` heuristic, the client can avoid downloading 292any bundles if their creation tokens are not larger than the stored 293creation token. After fetching new bundles, Git updates this local 294creation token. 295 296If the bundle provider does not provide a heuristic, then the client 297should attempt to inspect the bundle headers before downloading the full 298bundle data in case the bundle tips already exist in the client 299repository. 300 301Error Conditions 302---------------- 303 304If the Git client discovers something unexpected while downloading 305information according to a bundle URI or the bundle list found at that 306location, then Git can ignore that data and continue as if it was not 307given a bundle URI. The remote Git server is the ultimate source of truth, 308not the bundle URI. 309 310Here are a few example error conditions: 311 312* The client fails to connect with a server at the given URI or a connection 313 is lost without any chance to recover. 314 315* The client receives a 400-level response (such as `404 Not Found` or 316 `401 Not Authorized`). The client should use the credential helper to 317 find and provide a credential for the URI, but match the semantics of 318 Git's other HTTP protocols in terms of handling specific 400-level 319 errors. 320 321* The server reports any other failure response. 322 323* The client receives data that is not parsable as a bundle or bundle list. 324 325* A bundle includes a filter that does not match expectations. 326 327* The client cannot unbundle the bundles because the prerequisite commit OIDs 328 are not in the object database and there are no more bundles to download. 329 330There are also situations that could be seen as wasteful, but are not 331error conditions: 332 333* The downloaded bundles contain more information than is requested by 334 the clone or fetch request. A primary example is if the user requests 335 a clone with `--single-branch` but downloads bundles that store every 336 reachable commit from all `refs/heads/*` references. This might be 337 initially wasteful, but perhaps these objects will become reachable by 338 a later ref update that the client cares about. 339 340* A bundle download during a `git fetch` contains objects already in the 341 object database. This is probably unavoidable if we are using bundles 342 for fetches, since the client will almost always be slightly ahead of 343 the bundle servers after performing its "catch-up" fetch to the remote 344 server. This extra work is most wasteful when the client is fetching 345 much more frequently than the server is computing bundles, such as if 346 the client is using hourly prefetches with background maintenance, but 347 the server is computing bundles weekly. For this reason, the client 348 should not use bundle URIs for fetch unless the server has explicitly 349 recommended it through a `bundle.heuristic` value. 350 351Example Bundle Provider organization 352------------------------------------ 353 354The bundle URI feature is intentionally designed to be flexible to 355different ways a bundle provider wants to organize the object data. 356However, it can be helpful to have a complete organization model described 357here so providers can start from that base. 358 359This example organization is a simplified model of what is used by the 360GVFS Cache Servers (see section near the end of this document) which have 361been beneficial in speeding up clones and fetches for very large 362repositories, although using extra software outside of Git. 363 364The bundle provider deploys servers across multiple geographies. Each 365server manages its own bundle set. The server can track a number of Git 366repositories, but provides a bundle list for each based on a pattern. For 367example, when mirroring a repository at `https://<domain>/<org>/<repo>` 368the bundle server could have its bundle list available at 369`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can 370list all of these servers under the "any" mode: 371 372 [bundle] 373 version = 1 374 mode = any 375 376 [bundle "eastus"] 377 uri = https://eastus.example.com/<domain>/<org>/<repo> 378 379 [bundle "europe"] 380 uri = https://europe.example.com/<domain>/<org>/<repo> 381 382 [bundle "apac"] 383 uri = https://apac.example.com/<domain>/<org>/<repo> 384 385This "list of lists" is static and only changes if a bundle server is 386added or removed. 387 388Each bundle server manages its own set of bundles. The initial bundle list 389contains only a single bundle, containing all of the objects received from 390cloning the repository from the origin server. The list uses the 391`creationToken` heuristic and a `creationToken` is made for the bundle 392based on the server's timestamp. 393 394The bundle server runs regularly-scheduled updates for the bundle list, 395such as once a day. During this task, the server fetches the latest 396contents from the origin server and generates a bundle containing the 397objects reachable from the latest origin refs, but not contained in a 398previously-computed bundle. This bundle is added to the list, with care 399that the `creationToken` is strictly greater than the previous maximum 400`creationToken`. 401 402When the bundle list grows too large, say more than 30 bundles, then the 403oldest "_N_ minus 30" bundles are combined into a single bundle. This 404bundle's `creationToken` is equal to the maximum `creationToken` among the 405merged bundles. 406 407An example bundle list is provided here, although it only has two daily 408bundles and not a full list of 30: 409 410 [bundle] 411 version = 1 412 mode = all 413 heuristic = creationToken 414 415 [bundle "2022-02-13-1644770820-daily"] 416 uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle 417 creationToken = 1644770820 418 419 [bundle "2022-02-09-1644442601-daily"] 420 uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle 421 creationToken = 1644442601 422 423 [bundle "2022-02-02-1643842562"] 424 uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle 425 creationToken = 1643842562 426 427To avoid storing and serving object data in perpetuity despite becoming 428unreachable in the origin server, this bundle merge can be more careful. 429Instead of taking an absolute union of the old bundles, instead the bundle 430can be created by looking at the newer bundles and ensuring that their 431necessary commits are all available in this merged bundle (or in another 432one of the newer bundles). This allows "expiring" object data that is not 433being used by new commits in this window of time. That data could be 434reintroduced by a later push. 435 436The intention of this data organization has two main goals. First, initial 437clones of the repository become faster by downloading precomputed object 438data from a closer source. Second, `git fetch` commands can be faster, 439especially if the client has not fetched for a few days. However, if a 440client does not fetch for 30 days, then the bundle list organization would 441cause redownloading a large amount of object data. 442 443One way to make this organization more useful to users who fetch frequently 444is to have more frequent bundle creation. For example, bundles could be 445created every hour, and then once a day those "hourly" bundles could be 446merged into a "daily" bundle. The daily bundles are merged into the 447oldest bundle after 30 days. 448 449It is recommended that this bundle strategy is repeated with the `blob:none` 450filter if clients of this repository are expecting to use blobless partial 451clones. This list of blobless bundles stays in the same list as the full 452bundles, but uses the `bundle.<id>.filter` key to separate the two groups. 453For very large repositories, the bundle provider may want to _only_ provide 454blobless bundles. 455 456Implementation Plan 457------------------- 458 459This design document is being submitted on its own as an aspirational 460document, with the goal of implementing all of the mentioned client 461features over the course of several patch series. Here is a potential 462outline for submitting these features: 463 4641. Integrate bundle URIs into `git clone` with a `--bundle-uri` option. 465 This will include a new `git fetch --bundle-uri` mode for use as the 466 implementation underneath `git clone`. The initial version here will 467 expect a single bundle at the given URI. 468 4692. Implement the ability to parse a bundle list from a bundle URI and 470 update the `git fetch --bundle-uri` logic to properly distinguish 471 between `bundle.mode` options. Specifically design the feature so 472 that the config format parsing feeds a list of key-value pairs into the 473 bundle list logic. 474 4753. Create the `bundle-uri` protocol v2 command so Git servers can advertise 476 bundle URIs using the key-value pairs. Plug into the existing key-value 477 input to the bundle list logic. Allow `git clone` to discover these 478 bundle URIs and bootstrap the client repository from the bundle data. 479 (This choice is an opt-in via a config option and a command-line 480 option.) 481 4824. Allow the client to understand the `bundle.heuristic` configuration key 483 and the `bundle.<id>.creationToken` heuristic. When `git clone` 484 discovers a bundle URI with `bundle.heuristic`, it configures the client 485 repository to check that bundle URI during later `git fetch <remote>` 486 commands. 487 4885. Allow clients to discover bundle URIs during `git fetch` and configure 489 a bundle URI for later fetches if `bundle.heuristic` is set. 490 4916. Implement the "inspect headers" heuristic to reduce data downloads when 492 the `bundle.<id>.creationToken` heuristic is not available. 493 494As these features are reviewed, this plan might be updated. We also expect 495that new designs will be discovered and implemented as this feature 496matures and becomes used in real-world scenarios. 497 498Related Work: Packfile URIs 499--------------------------- 500 501The Git protocol already has a capability where the Git server can list 502a set of URLs along with the packfile response when serving a client 503request. The client is then expected to download the packfiles at those 504locations in order to have a complete understanding of the response. 505 506This mechanism is used by the Gerrit server (implemented with JGit) and 507has been effective at reducing CPU load and improving user performance for 508clones. 509 510A major downside to this mechanism is that the origin server needs to know 511_exactly_ what is in those packfiles, and the packfiles need to be available 512to the user for some time after the server has responded. This coupling 513between the origin and the packfile data is difficult to manage. 514 515Further, this implementation is extremely hard to make work with fetches. 516 517Related Work: GVFS Cache Servers 518-------------------------------- 519 520The GVFS Protocol [2] is a set of HTTP endpoints designed independently of 521the Git project before Git's partial clone was created. One feature of this 522protocol is the idea of a "cache server" which can be colocated with build 523machines or developer offices to transfer Git data without overloading the 524central server. 525 526The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}` 527endpoint, which allows downloading an object on-demand. This is a critical 528piece of the filesystem virtualization of that product. 529 530However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>` 531endpoint. Given an optional timestamp, the cache server responds with a list 532of precomputed packfiles containing the commits and trees that were introduced 533in those time intervals. 534 535The cache server computes these "prefetch" packfiles using the following 536strategy: 537 5381. Every hour, an "hourly" pack is generated with a given timestamp. 5392. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack. 5403. Nightly, all prefetch packs more than 30 days old are rolled up into 541 one pack. 542 543When a user runs `gvfs clone` or `scalar clone` against a repo with cache 544servers, the client requests all prefetch packfiles, which is at most 545`24 + 30 + 1` packfiles downloading only commits and trees. The client 546then follows with a request to the origin server for the references, and 547attempts to checkout that tip reference. (There is an extra endpoint that 548helps get all reachable trees from a given commit, in case that commit 549was not already in a prefetch packfile.) 550 551During a `git fetch`, a hook requests the prefetch endpoint using the 552most-recent timestamp from a previously-downloaded prefetch packfile. 553Only the list of packfiles with later timestamps are downloaded. Most 554users fetch hourly, so they get at most one hourly prefetch pack. Users 555whose machines have been off or otherwise have not fetched in over 30 days 556might redownload all prefetch packfiles. This is rare. 557 558It is important to note that the clients always contact the origin server 559for the refs advertisement, so the refs are frequently "ahead" of the 560prefetched pack data. The missing objects are downloaded on-demand using 561the `GET gvfs/objects/{oid}` requests, when needed by a command such as 562`git checkout` or `git log`. Some Git optimizations disable checks that 563would cause these on-demand downloads to be too aggressive. 564 565See Also 566-------- 567 568[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/ 569 An earlier RFC for a bundle URI feature. 570 571[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md 572 The GVFS Protocol