Files
git/Documentation/technical/api-path-walk.adoc
Derrick Stolee 70664d2865 pack-objects: add --path-walk option
In order to more easily compute delta bases among objects that appear at
the exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given
by the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.

This new walk is incompatible with some features and is ignored by
others:

 * Object filters are not currently integrated with the path-walk API,
   such as sparse-checkout or tree depth. A blobless packfile could be
   integrated easily, but that is deferred for later.

 * Server-focused features such as delta islands, shallow packs, and
   using a bitmap index are incompatible with the path-walk API.

 * The path walk API is only compatible with the --revs option, not
   taking object lists or pack lists over stdin. These alternative ways
   to specify the objects currently ignores the --path-walk option
   without even a warning.

Future changes will create performance tests that demonstrate the power
of this approach.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 12:15:38 -07:00

74 lines
2.9 KiB
Plaintext

Path-Walk API
=============
The path-walk API is used to walk reachable objects, but to visit objects
in batches based on a common path they appear in, or by type.
For example, all reachable commits are visited in a group. All tags are
visited in a group. Then, all root trees are visited. At some point, all
blobs reachable via a path `my/dir/to/A` are visited. When there are
multiple paths possible to reach the same object, then only one of those
paths is used to visit the object.
Basics
------
To use the path-walk API, include `path-walk.h` and call
`walk_objects_by_path()` with a customized `path_walk_info` struct. The
struct is used to set all of the options for how the walk should proceed.
Let's dig into the different options and their use.
`path_fn` and `path_fn_data`::
The most important option is the `path_fn` option, which is a
function pointer to the callback that can execute logic on the
object IDs for objects grouped by type and path. This function
also receives a `data` value that corresponds to the
`path_fn_data` member, for providing custom data structures to
this callback function.
`revs`::
To configure the exact details of the reachable set of objects,
use the `revs` member and initialize it using the revision
machinery in `revision.h`. Initialize `revs` using calls such as
`setup_revisions()` or `parse_revision_opt()`. Do not call
`prepare_revision_walk()`, as that will be called within
`walk_objects_by_path()`.
+
It is also important that you do not specify the `--objects` flag for the
`revs` struct. The revision walk should only be used to walk commits, and
the objects will be walked in a separate way based on those starting
commits.
`commits`, `blobs`, `trees`, `tags`::
By default, these members are enabled and signal that the path-walk
API should call the `path_fn` on objects of these types. Specialized
applications could disable some options to make it simpler to walk
the objects or to have fewer calls to `path_fn`.
+
While it is possible to walk only commits in this way, consumers would be
better off using the revision walk API instead.
`prune_all_uninteresting`::
By default, all reachable paths are emitted by the path-walk API.
This option allows consumers to declare that they are not
interested in paths where all included objects are marked with the
`UNINTERESTING` flag. This requires using the `boundary` option in
the revision walk so that the walk emits commits marked with the
`UNINTERESTING` flag.
`pl`::
This pattern list pointer allows focusing the path-walk search to
a set of patterns, only emitting paths that match the given
patterns. See linkgit:gitignore[5] or
linkgit:git-sparse-checkout[1] for details about pattern lists.
When the pattern list uses cone-mode patterns, then the path-walk
API can prune the set of paths it walks to improve performance.
Examples
--------
See example usages in:
`t/helper/test-path-walk.c`,
`builtin/pack-objects.c`,
`builtin/backfill.c`