Git fork
at reftables-rust 367 lines 15 kB view raw
1Partial Clone Design Notes 2========================== 3 4The "Partial Clone" feature is a performance optimization for Git that 5allows Git to function without having a complete copy of the repository. 6The goal of this work is to allow Git to better handle extremely large 7repositories. 8 9During clone and fetch operations, Git downloads the complete contents 10and history of the repository. This includes all commits, trees, and 11blobs for the complete life of the repository. For extremely large 12repositories, clones can take hours (or days) and consume 100+GiB of disk 13space. 14 15Often in these repositories there are many blobs and trees that the user 16does not need such as: 17 18 1. files outside of the user's work area in the tree. For example, in 19 a repository with 500K directories and 3.5M files in every commit, 20 we can avoid downloading many objects if the user only needs a 21 narrow "cone" of the source tree. 22 23 2. large binary assets. For example, in a repository where large build 24 artifacts are checked into the tree, we can avoid downloading all 25 previous versions of these non-mergeable binary assets and only 26 download versions that are actually referenced. 27 28Partial clone allows us to avoid downloading such unneeded objects *in 29advance* during clone and fetch operations and thereby reduce download 30times and disk usage. Missing objects can later be "demand fetched" 31if/when needed. 32 33A remote that can later provide the missing objects is called a 34promisor remote, as it promises to send the objects when 35requested. Initially Git supported only one promisor remote, the origin 36remote from which the user cloned and that was configured in the 37"extensions.partialClone" config option. Later support for more than 38one promisor remote has been implemented. 39 40Use of partial clone requires that the user be online and the origin 41remote or other promisor remotes be available for on-demand fetching 42of missing objects. This may or may not be problematic for the user. 43For example, if the user can stay within the pre-selected subset of 44the source tree, they may not encounter any missing objects. 45Alternatively, the user could try to pre-fetch various objects if they 46know that they are going offline. 47 48 49Non-Goals 50--------- 51 52Partial clone is a mechanism to limit the number of blobs and trees downloaded 53*within* a given range of commits -- and is therefore independent of and not 54intended to conflict with existing DAG-level mechanisms to limit the set of 55requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). 56 57 58Design Overview 59--------------- 60 61Partial clone logically consists of the following parts: 62 63- A mechanism for the client to describe unneeded or unwanted objects to 64 the server. 65 66- A mechanism for the server to omit such unwanted objects from packfiles 67 sent to the client. 68 69- A mechanism for the client to gracefully handle missing objects (that 70 were previously omitted by the server). 71 72- A mechanism for the client to backfill missing objects as needed. 73 74 75Design Details 76-------------- 77 78- A new pack-protocol capability "filter" is added to the fetch-pack and 79 upload-pack negotiation. 80+ 81This uses the existing capability discovery mechanism. 82See "filter" in linkgit:gitprotocol-pack[5]. 83 84- Clients pass a "filter-spec" to clone and fetch which is passed to the 85 server to request filtering during packfile construction. 86+ 87There are various filters available to accommodate different situations. 88See "--filter=<filter-spec>" in Documentation/rev-list-options.adoc. 89 90- On the server pack-objects applies the requested filter-spec as it 91 creates "filtered" packfiles for the client. 92+ 93These filtered packfiles are *incomplete* in the traditional sense because 94they may contain objects that reference objects not contained in the 95packfile and that the client doesn't already have. For example, the 96filtered packfile may contain trees or tags that reference missing blobs 97or commits that reference missing trees. 98 99- On the client these incomplete packfiles are marked as "promisor packfiles" 100 and treated differently by various commands. 101 102- On the client a repository extension is added to the local config to 103 prevent older versions of git from failing mid-operation because of 104 missing objects that they cannot handle. 105 See `extensions.partialClone` in linkgit:git-config[1]. 106 107 108Handling Missing Objects 109------------------------ 110 111- An object may be missing due to a partial clone or fetch, or missing 112 due to repository corruption. To differentiate these cases, the 113 local repository specially indicates such filtered packfiles 114 obtained from promisor remotes as "promisor packfiles". 115+ 116These promisor packfiles consist of a "<name>.promisor" file with 117arbitrary contents (like the "<name>.keep" files), in addition to 118their "<name>.pack" and "<name>.idx" files. 119 120- The local repository considers a "promisor object" to be an object that 121 it knows (to the best of its ability) that promisor remotes have promised 122 that they have, either because the local repository has that object in one of 123 its promisor packfiles, or because another promisor object refers to it. 124+ 125When Git encounters a missing object, Git can see if it is a promisor object 126and handle it appropriately. If not, Git can report a corruption. 127+ 128This means that there is no need for the client to explicitly maintain an 129expensive-to-modify list of missing objects.[a] 130 131- Since almost all Git code currently expects any referenced object to be 132 present locally and because we do not want to force every command to do 133 a dry-run first, a fallback mechanism is added to allow Git to attempt 134 to dynamically fetch missing objects from promisor remotes. 135+ 136When the normal object lookup fails to find an object, Git invokes 137promisor_remote_get_direct() to try to get the object from a promisor 138remote and then retry the object lookup. This allows objects to be 139"faulted in" without complicated prediction algorithms. 140+ 141For efficiency reasons, no check as to whether the missing object is 142actually a promisor object is performed. 143+ 144Dynamic object fetching tends to be slow as objects are fetched one at 145a time. 146 147- `checkout` (and any other command using `unpack-trees`) has been taught 148 to bulk pre-fetch all required missing blobs in a single batch. 149 150- `rev-list` has been taught to print missing objects. 151+ 152This can be used by other commands to bulk prefetch objects. 153For example, a "git log -p A..B" may internally want to first do 154something like "git rev-list --objects --quiet --missing=print A..B" 155and prefetch those objects in bulk. 156 157- `fsck` has been updated to be fully aware of promisor objects. 158 159- `repack` in GC has been updated to not touch promisor packfiles at all, 160 and to only repack other objects. 161 162- The global variable "fetch_if_missing" is used to control whether an 163 object lookup will attempt to dynamically fetch a missing object or 164 report an error. 165+ 166We are not happy with this global variable and would like to remove it, 167but that requires significant refactoring of the object code to pass an 168additional flag. 169 170 171Fetching Missing Objects 172------------------------ 173 174- Fetching of objects is done by invoking a "git fetch" subprocess. 175 176- The local repository sends a request with the hashes of all requested 177 objects, and does not perform any packfile negotiation. 178 It then receives a packfile. 179 180- Because we are reusing the existing fetch mechanism, fetching 181 currently fetches all objects referred to by the requested objects, even 182 though they are not necessary. 183 184- Fetching with `--refetch` will request a complete new filtered packfile from 185 the remote, which can be used to change a filter without needing to 186 dynamically fetch missing objects. 187 188Using many promisor remotes 189--------------------------- 190 191Many promisor remotes can be configured and used. 192 193This allows for example a user to have multiple geographically-close 194cache servers for fetching missing blobs while continuing to do 195filtered `git-fetch` commands from the central server. 196 197When fetching objects, promisor remotes are tried one after the other 198until all the objects have been fetched. 199 200Remotes that are considered "promisor" remotes are those specified by 201the following configuration variables: 202 203- `extensions.partialClone = <name>` 204 205- `remote.<name>.promisor = true` 206 207- `remote.<name>.partialCloneFilter = ...` 208 209Only one promisor remote can be configured using the 210`extensions.partialClone` config variable. This promisor remote will 211be the last one tried when fetching objects. 212 213We decided to make it the last one we try, because it is likely that 214someone using many promisor remotes is doing so because the other 215promisor remotes are better for some reason (maybe they are closer or 216faster for some kind of objects) than the origin, and the origin is 217likely to be the remote specified by extensions.partialClone. 218 219This justification is not very strong, but one choice had to be made, 220and anyway the long term plan should be to make the order somehow 221fully configurable. 222 223For now though the other promisor remotes will be tried in the order 224they appear in the config file. 225 226Current Limitations 227------------------- 228 229- It is not possible to specify the order in which the promisor 230 remotes are tried in other ways than the order in which they appear 231 in the config file. 232+ 233It is also not possible to specify an order to be used when fetching 234from one remote and a different order when fetching from another 235remote. 236 237- It is not possible to push only specific objects to a promisor 238 remote. 239+ 240It is not possible to push at the same time to multiple promisor 241remote in a specific order. 242 243- Dynamic object fetching will only ask promisor remotes for missing 244 objects. We assume that promisor remotes have a complete view of the 245 repository and can satisfy all such requests. 246 247- Repack essentially treats promisor and non-promisor packfiles as 2 248 distinct partitions and does not mix them. 249 250- Dynamic object fetching invokes fetch-pack once *for each item* 251 because most algorithms stumble upon a missing object and need to have 252 it resolved before continuing their work. This may incur significant 253 overhead -- and multiple authentication requests -- if many objects are 254 needed. 255 256- Dynamic object fetching currently uses the existing pack protocol V0 257 which means that each object is requested via fetch-pack. The server 258 will send a full set of info/refs when the connection is established. 259 If there are a large number of refs, this may incur significant overhead. 260 261 262Future Work 263----------- 264 265- Improve the way to specify the order in which promisor remotes are 266 tried. 267+ 268For example this could allow specifying explicitly something like: 269"When fetching from this remote, I want to use these promisor remotes 270in this order, though, when pushing or fetching to that remote, I want 271to use those promisor remotes in that order." 272 273- Allow pushing to promisor remotes. 274+ 275The user might want to work in a triangular work flow with multiple 276promisor remotes that each have an incomplete view of the repository. 277 278- Allow non-pathname-based filters to make use of packfile bitmaps (when 279 present). This was just an omission during the initial implementation. 280 281- Investigate use of a long-running process to dynamically fetch a series 282 of objects, such as proposed in [5,6] to reduce process startup and 283 overhead costs. 284+ 285It would be nice if pack protocol V2 could allow that long-running 286process to make a series of requests over a single long-running 287connection. 288 289- Investigate pack protocol V2 to avoid the info/refs broadcast on 290 each connection with the server to dynamically fetch missing objects. 291 292- Investigate the need to handle loose promisor objects. 293+ 294Objects in promisor packfiles are allowed to reference missing objects 295that can be dynamically fetched from the server. An assumption was 296made that loose objects are only created locally and therefore should 297not reference a missing object. We may need to revisit that assumption 298if, for example, we dynamically fetch a missing tree and store it as a 299loose object rather than a single object packfile. 300+ 301This does not necessarily mean we need to mark loose objects as promisor; 302it may be sufficient to relax the object lookup or is-promisor functions. 303 304 305Non-Tasks 306--------- 307 308- Every time the subject of "demand loading blobs" comes up it seems 309 that someone suggests that the server be allowed to "guess" and send 310 additional objects that may be related to the requested objects. 311+ 312No work has gone into actually doing that; we're just documenting that 313it is a common suggestion. We're not sure how it would work and have 314no plans to work on it. 315+ 316It is valid for the server to send more objects than requested (even 317for a dynamic object fetch), but we are not building on that. 318 319 320Footnotes 321--------- 322 323[a] expensive-to-modify list of missing objects: Earlier in the design of 324 partial clone we discussed the need for a single list of missing objects. 325 This would essentially be a sorted linear list of OIDs that were 326 omitted by the server during a clone or subsequent fetches. 327 328This file would need to be loaded into memory on every object lookup. 329It would need to be read, updated, and re-written (like the .git/index) 330on every explicit "git fetch" command *and* on any dynamic object fetch. 331 332The cost to read, update, and write this file could add significant 333overhead to every command if there are many missing objects. For example, 334if there are 100M missing blobs, this file would be at least 2GiB on disk. 335 336With the "promisor" concept, we *infer* a missing object based upon the 337type of packfile that references it. 338 339 340Related Links 341------------- 342[0] https://crbug.com/git/2 343 Bug#2: Partial Clone 344 345[1] https://lore.kernel.org/git/20170113155253.1644-1-benpeart@microsoft.com/ + 346 Subject: [RFC] Add support for downloading blobs on demand + 347 Date: Fri, 13 Jan 2017 10:52:53 -0500 348 349[2] https://lore.kernel.org/git/cover.1506714999.git.jonathantanmy@google.com/ + 350 Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) + 351 Date: Fri, 29 Sep 2017 13:11:36 -0700 352 353[3] https://lore.kernel.org/git/20170426221346.25337-1-jonathantanmy@google.com/ + 354 Subject: Proposal for missing blob support in Git repos + 355 Date: Wed, 26 Apr 2017 15:13:46 -0700 356 357[4] https://lore.kernel.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ + 358 Subject: [PATCH 00/10] RFC Partial Clone and Fetch + 359 Date: Wed, 8 Mar 2017 18:50:29 +0000 360 361[5] https://lore.kernel.org/git/20170505152802.6724-1-benpeart@microsoft.com/ + 362 Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module + 363 Date: Fri, 5 May 2017 11:27:52 -0400 364 365[6] https://lore.kernel.org/git/20170714132651.170708-1-benpeart@microsoft.com/ + 366 Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand + 367 Date: Fri, 14 Jul 2017 09:26:50 -0400