One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensive as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Non-cone mode can describe the included files using both positive and negative patterns, which changes the possible return values of path_matches_pattern_list(). Test both kinds of matches for increased coverage. To test this, we can create a blobless sparse clone, expand the sparse-checkout slightly, and then run 'git backfill --sparse' to see how much data is downloaded. The general steps are 1. git clone --filter=blob:none --sparse <url> 2. git sparse-checkout set <dir1> ... <dirN> 3. git backfill --sparse For the Git repository with the 'builtin' directory in the sparse-checkout, we get these results for various batch sizes: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|-------| | (Initial clone) | 3 | 110 MB | | | 10K | 12 | 192 MB | 17.2s | | 15K | 9 | 192 MB | 15.5s | | 20K | 8 | 192 MB | 15.5s | | 25K | 7 | 192 MB | 14.7s | This case matters less because a full clone of the Git repository from GitHub is currently at 277 MB. Using a copy of the Linux repository with the 'kernel/' directory in the sparse-checkout, we get these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|------| | (Initial clone) | 2 | 1,876 MB | | | 10K | 11 | 2,187 MB | 46s | | 25K | 7 | 2,188 MB | 43s | | 50K | 5 | 2,194 MB | 44s | | 100K | 4 | 2,194 MB | 48s | This case is more meaningful because a full clone of the Linux repository is currently over 6 GB, so this is a valuable way to download a fraction of the repository and no longer need network access for all reachable objects within the sparse-checkout. Choosing a batch size will depend on a lot of factors, including the user's network speed or reliability, the repository's file structure, and how many versions there are of the file within the sparse-checkout scope. There will not be a one-size-fits-all solution. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
73 lines
2.9 KiB
Plaintext
73 lines
2.9 KiB
Plaintext
Path-Walk API
|
|
=============
|
|
|
|
The path-walk API is used to walk reachable objects, but to visit objects
|
|
in batches based on a common path they appear in, or by type.
|
|
|
|
For example, all reachable commits are visited in a group. All tags are
|
|
visited in a group. Then, all root trees are visited. At some point, all
|
|
blobs reachable via a path `my/dir/to/A` are visited. When there are
|
|
multiple paths possible to reach the same object, then only one of those
|
|
paths is used to visit the object.
|
|
|
|
Basics
|
|
------
|
|
|
|
To use the path-walk API, include `path-walk.h` and call
|
|
`walk_objects_by_path()` with a customized `path_walk_info` struct. The
|
|
struct is used to set all of the options for how the walk should proceed.
|
|
Let's dig into the different options and their use.
|
|
|
|
`path_fn` and `path_fn_data`::
|
|
The most important option is the `path_fn` option, which is a
|
|
function pointer to the callback that can execute logic on the
|
|
object IDs for objects grouped by type and path. This function
|
|
also receives a `data` value that corresponds to the
|
|
`path_fn_data` member, for providing custom data structures to
|
|
this callback function.
|
|
|
|
`revs`::
|
|
To configure the exact details of the reachable set of objects,
|
|
use the `revs` member and initialize it using the revision
|
|
machinery in `revision.h`. Initialize `revs` using calls such as
|
|
`setup_revisions()` or `parse_revision_opt()`. Do not call
|
|
`prepare_revision_walk()`, as that will be called within
|
|
`walk_objects_by_path()`.
|
|
+
|
|
It is also important that you do not specify the `--objects` flag for the
|
|
`revs` struct. The revision walk should only be used to walk commits, and
|
|
the objects will be walked in a separate way based on those starting
|
|
commits.
|
|
|
|
`commits`, `blobs`, `trees`, `tags`::
|
|
By default, these members are enabled and signal that the path-walk
|
|
API should call the `path_fn` on objects of these types. Specialized
|
|
applications could disable some options to make it simpler to walk
|
|
the objects or to have fewer calls to `path_fn`.
|
|
+
|
|
While it is possible to walk only commits in this way, consumers would be
|
|
better off using the revision walk API instead.
|
|
|
|
`prune_all_uninteresting`::
|
|
By default, all reachable paths are emitted by the path-walk API.
|
|
This option allows consumers to declare that they are not
|
|
interested in paths where all included objects are marked with the
|
|
`UNINTERESTING` flag. This requires using the `boundary` option in
|
|
the revision walk so that the walk emits commits marked with the
|
|
`UNINTERESTING` flag.
|
|
|
|
`pl`::
|
|
This pattern list pointer allows focusing the path-walk search to
|
|
a set of patterns, only emitting paths that match the given
|
|
patterns. See linkgit:gitignore[5] or
|
|
linkgit:git-sparse-checkout[1] for details about pattern lists.
|
|
When the pattern list uses cone-mode patterns, then the path-walk
|
|
API can prune the set of paths it walks to improve performance.
|
|
|
|
Examples
|
|
--------
|
|
|
|
See example usages in:
|
|
`t/helper/test-path-walk.c`,
|
|
`builtin/backfill.c`
|